Re: UUID version 6 proposal, initial feedback

Brad Peabody <bradgareth@gmail.com> Fri, 31 January 2020 20:52 UTC

Return-Path: <bradgareth@gmail.com>
X-Original-To: ietf@ietfa.amsl.com
Delivered-To: ietf@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 737AD120041 for <ietf@ietfa.amsl.com>; Fri, 31 Jan 2020 12:52:25 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.997
X-Spam-Level:
X-Spam-Status: No, score=-1.997 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Q4b_7aV52-rx for <ietf@ietfa.amsl.com>; Fri, 31 Jan 2020 12:52:22 -0800 (PST)
Received: from mail-pf1-x434.google.com (mail-pf1-x434.google.com [IPv6:2607:f8b0:4864:20::434]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 76EC7120020 for <ietf@ietf.org>; Fri, 31 Jan 2020 12:52:22 -0800 (PST)
Received: by mail-pf1-x434.google.com with SMTP id 185so3948621pfv.3 for <ietf@ietf.org>; Fri, 31 Jan 2020 12:52:22 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:from:to:references:message-id:date:user-agent:mime-version :in-reply-to:content-language; bh=Za2IfvrFRCdxBtEaPc3feHbWCO6pucg7PrYneoLG2IM=; b=oFIzWGoLURXATj1jhDOXq3n0JMAxivkBH9jUFIKjH6AWc/qWB0VPh8TsHUTwjwm6vC 918j7sfZL420XHkWnWzCs7m/o1nIkx6XZw7N5/lHMYxSnm2HJdFCwcvlhxOOKkyLBbrF mDWnFHSrkb6kkhOXp0KLexzbJinnLbJ6S7EHiiNMLfuyhe1hZZnGQCJr9V1Cz60tfTdU HJRazMpqlTv2KyHOsAjRUt13z4lJcclXu86O9EVidwreNDItBjNPmkZXevc3ZFL7z1kB 0k18Oi4Z0x7S+T63sIfSDKzPf0WjDhmuZBaAmN9QJlvSE5PurchcAEkMhBAnsnh9F6lt l5fQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:from:to:references:message-id:date :user-agent:mime-version:in-reply-to:content-language; bh=Za2IfvrFRCdxBtEaPc3feHbWCO6pucg7PrYneoLG2IM=; b=cHRI4YAUbrvzLFwEz6zoWorqZ/tDMIL6FVKsC4ibd/smlkVw0Rgkj16aaQsAOnkZsc pb9QcMLnnW0LU0reTtwFOdaANzgk0snTFqh4kRF8YP582/D/P+hmM1FvVFjP7IgcK6ve z6Dc752Fbga8pNPX8yvTh4/Q8IAD8LcLBqqnYq22IJbX+n2mdeWl+3Jcwus8FfHm78Cg 8auV1sk2xD5koP5kcRakm3MzBpN1rOOvTtNRnuv1bCQxHvpGK7PVBGx4VLNC1TAfvZsZ KBxFerFR/pZIURMW9V/yBbSKlDntOw1ZbuWqgmi8Y0MQut4mlWH1cE6uqTs0Uheub4cB FymA==
X-Gm-Message-State: APjAAAVZAiYg942PLcO/NC4RJygVJFx/ugwZ15MEhX4D6Xa6C6DIiRzZ llRzBXvzbLdPLFTUZn32uXoL+OrU
X-Google-Smtp-Source: APXvYqyrDRyGhlSzEINmNdQjxtDnrBwJqTbsdGY0/T0+P1bXcD9ENHiSuijNIv4cODuq+1eK/koj9A==
X-Received: by 2002:a62:e719:: with SMTP id s25mr10765763pfh.40.1580503941215; Fri, 31 Jan 2020 12:52:21 -0800 (PST)
Received: from BGPMacBookPro.charter.com ([2600:6c50:7f:5954:6052:6198:1597:9860]) by smtp.gmail.com with ESMTPSA id g22sm10972698pgk.85.2020.01.31.12.52.19 for <ietf@ietf.org> (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 31 Jan 2020 12:52:20 -0800 (PST)
Subject: Re: UUID version 6 proposal, initial feedback
From: Brad Peabody <bradgareth@gmail.com>
To: IETF discussion list <ietf@ietf.org>
References: <D0894516-3F20-4545-BD7D-BE4FA96FAF75@gmail.com> <CABkgnnXSxqqinyK4QiwVv-VuzAraHFUGCrm0K0e9dJX_F80bWg@mail.gmail.com> <D3517A2C-1FCC-42D2-9AB6-248680BE89E1@gmail.com>
Message-ID: <c5ba6f5d-7c61-bfdf-63e6-be7d640ee50c@gmail.com>
Date: Fri, 31 Jan 2020 12:52:18 -0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.4.1
MIME-Version: 1.0
In-Reply-To: <D3517A2C-1FCC-42D2-9AB6-248680BE89E1@gmail.com>
Content-Type: multipart/alternative; boundary="------------618B58F24B549439FEBBCA60"
Content-Language: en-US
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf/TUHRhmb96wIQhbXGGa3rssnZekE>
X-BeenThere: ietf@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF-Discussion <ietf.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf>, <mailto:ietf-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf/>
List-Post: <mailto:ietf@ietf.org>
List-Help: <mailto:ietf-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf>, <mailto:ietf-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 31 Jan 2020 20:52:26 -0000

I've given this UUID idea a lot more thought and had more discussion on 
it and I wanted to get feedback on these latest ideas.  To summarize:

- Instead of making one single way to generate UUIDs, it should be a set 
of options which indicate a UUID with the features needed.  Properties 
like case-insensitivity, how much random data is included, time ordering 
with a timestamp, and others, can be highly application-specific.   
Trying to specify one set that works for everyone is unrealistic and 
unnecessary.

- There appear to be three main aspects to generating a UUID: timestamp, 
random data and the text encoding of it.  Most of the options are 
centered around these three aspects.  Examples: timestamps can be in 
second-precise ("ts"), millisecond-precise ("tms") and 
nanosecond-precise ("tns") forms.  Random data can be in various lengths 
"r64" for 8 bytes of random data might be okay for many applications, 
some might want "r128" for the additional uniqueness, others might only 
want "r32" where occasional duplication is not dire (example: a unique 
ID for an error message used for debugging) and it's better to have a 
shorter ID.  Various encoding forms like "b32", "b64" (and some alphabet 
variations on this) and "hex" can be used. These are "format options" 
and together they make a "format spec"

- There's also an option for some static bytes to be appended - for 
example a machine ID in a cluster. Useful for fully guaranteeing 
uniqueness in cases where there's a controlled environment.

- In practical terms and as an example, a format spec of "tns r64 b32" 
would result in a nanosecond-precise timestamp (64 bits) followed by 64 
bits of random data, encoded in base32. Alternatively, if you want to 
save on text length a bit and case-insensitivity is not important for 
your application, use "tns r64 b64" instead.

- UUIDs in text form are parseable if one knows the format spec used.  
(This addresses the case of some applications needing to extract the 
timestamp, without over complicating the UUID text format by trying to 
somehow include the format spec in it.) Otherwise it can also of course 
be used as an opaque string.

- The special format option "rfc4122" results in a timestamp in the 
format of that RFC, the number 6 in the version bit field, and the lower 
64 bits filled with random data.   This addresses the backward 
compatibility concern while allowing newer applications to be more 
flexible (using other options instead).

This is the document I've been putting together, it's still very rough, 
but has a fair amount of detail: 
https://docs.google.com/document/d/1bctTr14CrxzjHUIRAkT8jB46Jomr9aB2JQ9hDCh3cJg

Any feedback on this would greatly appreciated.  Overall my plan is to 
complete the document above and then write an implementation in Go and 
in C (and maybe it's time to learn Rust, lol) to make sure it works in 
practice.  And then convert the document into an IETF draft and try for 
moving it toward becoming an RFC.

Best, Brad


On 7/4/18 1:44 AM, Brad Peabody wrote:
> Thanks Martin,
>
> Those are really good points and I agree.
>
> I’m a bit concerned about moving entirely away from the existing 
> format, as I think there are some useful properties - particularly the 
> ability to extract the time stamp. Also by producing newer versions of 
> UUIDs that are the same size and layout a lot of existing software 
> will continue to work (for example, Cassandra understands UUIDs and 
> works with my prototype “version 6” UUIDs without any modification).
>
> But the points you bring up are quite relevant.  I’m not sure how to 
> address the clock skew leakage that might occur, but definitely the 
> counters and MAC address can go the way of the buffalo. Simply reading 
> from /dev/urandom, et al makes so much more sense these days.
>
> I think there is also real potential in the idea of making them 
> variable length.  If you need compatibility with existing UUID schemes 
> you can use the 128 bit length, but you can also go longer if you 
> require more guarantee of uniqueness or un-guess-ability.
>
> The only other thing that has me pondering on this is how important is 
> the property of being able to tell “is this a valid UUID, and if so 
> what type”.  If we don’t care about this, then it might make sense to 
> have two additional formats - a base64 which is as compact as 
> reasonably possible while still being ASCII, and a base32 which is 
> case-insensitive - they both have merit depending on the use case. 
>  But then add the fact that it’s variable length and it becomes 
> impossible to distinguish the format reliably (i.e you might read the 
> base32 value as base64 by accident).  I can think of some ways to work 
> around this in the format using specific characters in one and not the 
> other, but it starts to feel really hacky.  I’m just not sure if this 
> is actually important for real-world applications.  Many applications 
> just use the UUID opaquely, but not all.  And again, extracting the 
> timestamp can be useful - but then is it reasonable to also say “if 
> you want the time, you’ll need to know the format, you won’t be able 
> to automatically tell the difference between a base64 and base32 
> UUID”.  One possible simple solution: put a period after the first 128 
> bits
>
> Still not sure on this, input welcome.
>
> All that said, it seems to me we’re headed for a spec that has the 
> following properties:
>
> - Encodes timestamp
> - Sorts correctly as raw bytes
> - Has the version field in the same place for compatibility
> - Supports the existing 128-bit hex representation 
> (NNNNNNNN-NNNN-NNNN-NNNN-NNNNNNNNNNNN) for compatibility but also...
> - Specifies alternative (url-safe) base64 encoding for compact string 
> representation while still sorting correctly; and also a base32 
> encoding where case-insensitivity is required
> - Instead of existing dangerous bits like counter and MAC address, use 
> secure random data available from OS
> - Can be arbitrarily longer as needed, filled with additional random data.
>
> Does that seem about right?
>
> Also, when things settle in terms of the high level requirements, at 
> what point should I do a rough draft of the spec on this?  (And is 
> there some sort of outline I would follow?)
>
> -Brad
>
>
>> On Jul 3, 2018, at 7:11 PM, Martin Thomson <martin.thomson@gmail.com> 
>> wrote:
>>
>> Nice presentation.
>>
>> I've always been skeptical of the value of UUIDs.  It's very easy to
>> make a unique string.  Formatting it in a particular way, however,
>> serves no real purpose and it might not be necessary for your use
>> case.
>>
>> UUIDs (aside from v4) also have some nasty properties.  For instance,
>> revealing a MAC address is now considered problematic.  As is exposing
>> the output of a clock, which has been used in interesting ways to
>> undermine privacy (clock skew can be used for de-anonymisation, even
>> for determining CPU load).  Counters and timers also allow an observer
>> to correlate the creation of different identifiers.
>>
>> Why not make a string with the properties you desire?  You might need
>> more than 120 bits to achieve all that, but that's just another reason
>> not to use UUIDs.  A library that provided these functions (maybe
>> while addressing the above concerns), would be a valuable addition.
>> On Wed, Jul 4, 2018 at 10:42 AM Brad Peabody <bradgareth@gmail.com> 
>> wrote:
>>>
>>> I would like to get some initial feedback, and suggested next steps, 
>>> on the idea of making an RFC that covers a version 6 for UUIDs.  It 
>>> would have an embedded timestamp and sorting by its raw bytes 
>>> results in a sequence equivalent to sorting by time.  This is not 
>>> addressed by existing UUID types.  (The closest one is version 1, 
>>> but due to the byte encoding it does not sort correctly.)
>>>
>>> There is also seems to be some ambiguity in the existing RFC on when 
>>> to increment the counter field which I think can be clarified.
>>>
>>> I did a prototype implementation and basic write-up on the details 
>>> here: https://bradleypeabody.github.io/uuidv6/
>>>
>>> I’m also thinking of covering the base64 encoding form which retains 
>>> the sorting properties but makes it human readable and is 
>>> significantly shorter than the long hex encoded form.
>>>
>>> Having these things is very useful when considering UUIDs as 
>>> database primary keys, which is becoming more and more common in 
>>> distributed systems; and indeed this is the main motivation.
>>>
>>> Any help or advice on moving forward with this would be appreciated.
>>>
>>> - Brad
>>>
>