Re: UUID version 6 proposal, initial feedback

Brad Peabody <bradgareth@gmail.com> Wed, 04 July 2018 18:15 UTC

Return-Path: <bradgareth@gmail.com>
X-Original-To: ietf@ietfa.amsl.com
Delivered-To: ietf@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6003B130E30 for <ietf@ietfa.amsl.com>; Wed, 4 Jul 2018 11:15:40 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.692
X-Spam-Level:
X-Spam-Status: No, score=-0.692 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, TRACKER_ID=1.306, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9rolorOjZVBb for <ietf@ietfa.amsl.com>; Wed, 4 Jul 2018 11:15:37 -0700 (PDT)
Received: from mail-pf0-x22d.google.com (mail-pf0-x22d.google.com [IPv6:2607:f8b0:400e:c00::22d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id ED4C2130E6D for <ietf@ietf.org>; Wed, 4 Jul 2018 11:15:36 -0700 (PDT)
Received: by mail-pf0-x22d.google.com with SMTP id u16-v6so3226276pfh.3 for <ietf@ietf.org>; Wed, 04 Jul 2018 11:15:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=m2Y+HsaTX4kEZTXbLuxgBWy9VpckYn50Hm3leB8Ea8I=; b=GsCFveB9kieUYIIBUEwkEEgumybFQ83X/ti2HOEt7vkLpU/zsIwQ7CgA7WWDypWh5+ r6g73+fkeAyBsPPi63EjVBXPk4TE/rO0ZWC2++ks/AsbXhHcgojN5j6UW0/1DK8aXZfu G44C8MWc0Cx4d0r9NWssn5Ybxp6IYb9kGZUUrtlDOu15c1oEVcx8Tq6cZKrU0wSPYVD1 zVGUBKmCNBBWBErrzTNebfSP+N4izcze1DrjQt/CTlEZr8q52MuJ0FvdVdRA8by0yT2y BfCsvx9BhIxBCh5XJYykBZaGJ5Z1C0p2H5L3C9zs3jLzgJ3VtxRMQfof+DyvgKjXY+jN H6xQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=m2Y+HsaTX4kEZTXbLuxgBWy9VpckYn50Hm3leB8Ea8I=; b=bjE7TfnGQ77xs4UVlj5IdM0/nj77HDkxQNbZPxzZyx9KiPFa4X1FkVEX3ByDMb7DyG bP6ZHurzVitoV3Xis7ISO9vU1u3cYM6y3p/GQER2d0Zg9/AhGuMT3tmwW79bgGw1mnpA zQWEyxS0CXm3MNze/3zsCUJrMkz+9+rvL1USPGLp+1asZ/mtxb77BvpDX/VK//cwjuqJ 6MFJP9rxubYoWH9WQVCK0mIc/4Oh9PI2nhAgesYE3RlAy1eICCjGnOT6f64P1M97UbzR c/0qcl8bh3vQA7xhqyZd6dNQQN5pSdUMiL5jz3fBiuhCvquCDW7+78f+eQbBm//mb9kG IMHA==
X-Gm-Message-State: APt69E2SgiRQ2gySf8hK+Fxt0RIl/vkH6wtOBFpzocOuDLHTbwfoIkwD q9wVOutKhFy2VTDaMBBQKtc=
X-Google-Smtp-Source: AAOMgpcI2m+g10jlUM2sqlXz5o3onwsPAHkmn3eWPzXWARvBCNYwOrdsciFnEmlrWjbq17u8e+sQUg==
X-Received: by 2002:a62:4898:: with SMTP id q24-v6mr3204871pfi.58.1530728136544; Wed, 04 Jul 2018 11:15:36 -0700 (PDT)
Received: from [192.168.1.102] (cpe-23-242-42-48.socal.res.rr.com. [23.242.42.48]) by smtp.gmail.com with ESMTPSA id v66-v6sm9189836pfb.84.2018.07.04.11.15.35 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 04 Jul 2018 11:15:35 -0700 (PDT)
From: Brad Peabody <bradgareth@gmail.com>
Message-Id: <E1877C87-E8D8-4A8C-89F8-B519E3A8499F@gmail.com>
Content-Type: multipart/alternative; boundary="Apple-Mail=_83D620EB-51FB-4A07-AA2A-B3B6E917C6B4"
Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\))
Subject: Re: UUID version 6 proposal, initial feedback
Date: Wed, 04 Jul 2018 11:15:34 -0700
In-Reply-To: <CABkgnnUO0sQfT94hE27MrL47XUH1AQ42KAGPzuB+d2YDz0FjCg@mail.gmail.com>
Cc: IETF discussion list <ietf@ietf.org>, ben@benramsey.com, Tony Finch <dot@dotat.at>
To: Martin Thomson <martin.thomson@gmail.com>
References: <D0894516-3F20-4545-BD7D-BE4FA96FAF75@gmail.com> <CABkgnnXSxqqinyK4QiwVv-VuzAraHFUGCrm0K0e9dJX_F80bWg@mail.gmail.com> <076C4555-C150-43F0-9D09-CFDF09D0F0A8@gmail.com> <CABkgnnUO0sQfT94hE27MrL47XUH1AQ42KAGPzuB+d2YDz0FjCg@mail.gmail.com>
X-Mailer: Apple Mail (2.3273)
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf/CQghioUyUpjMaJ21MJlDrwDy38k>
X-BeenThere: ietf@ietf.org
X-Mailman-Version: 2.1.26
Precedence: list
List-Id: IETF-Discussion <ietf.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf>, <mailto:ietf-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf/>
List-Post: <mailto:ietf@ietf.org>
List-Help: <mailto:ietf-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf>, <mailto:ietf-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 04 Jul 2018 18:15:41 -0000

The primary motivation for compatibility is that there is a lot of existing software that expects to parse a UUID as:

NNNNNNNN-NNNN-NNNN-NNNN-NNNNNNNNNNNN

And store it as:

byte[16]

Examples:

https://github.com/s-u/uuid/blob/master/src/uuid.h <https://github.com/s-u/uuid/blob/master/src/uuid.h>
https://docs.oracle.com/javase/6/docs/api/java/util/UUID.html <https://docs.oracle.com/javase/6/docs/api/java/util/UUID.html>
https://docs.python.org/3/library/uuid.html <https://docs.python.org/3/library/uuid.html>

The binary form is more compact, especially when compared to hex.   It seems reasonable to me that applications should be able to continue to parse a text UUID and store it in it’s most compact form in memory or on disk.    An opaque string is useful in a lot of cases, but I think it’s also totally possible and useful to support a standard text format so programs can get at the raw data for space-saving concerns.

That said, I do want to make sure I understand the alternatives.  Are you thinking that maybe the clock resolution should be lowered (100-ns resolution is not needed for index locality) and then the additional bits can be used for randomness?  (Indeed this is the approached used by some of the projects linked to in the message from Tony Finch.)

If so, my only concern is that the ability to use the timestamp as a creation time is useful for some applications.

One idea would be to introduce two new versions, one with the extractable and high-resolution timestamp, one with a lower resolution time focused on providing as many random bits as possible.  Both with the improved text formats (base64 and base32) and variable length additions mentioned previously.  What do you think about that?


> On Jul 4, 2018, at 2:40 AM, Martin Thomson <martin.thomson@gmail.com> wrote:
> 
> Still not sure that I understand the compatibility motivation.  Unless
> you are trying to fill pre-existing fields that only accept UUID, then
> an text or octet string key is superior in almost every way to a UUID.
> 
> I appreciate the value of locality of reference (assuming null
> hashing, contiguous or near-contiguous keys would end up in close
> proximity), but that's better served by text or octet string as well.
> When you consider birthday paradox and the need for global uniqueness,
> the 120 bits you have available to you needs to be spent mostly on
> randomness unless your database is so small as to make this effort
> essentially more academic than useful.
> On Wed, Jul 4, 2018 at 6:52 PM Brad Peabody <bradgareth@gmail.com> wrote:
>> 
>> Thanks Martin,
>> 
>> Those are really good points and I agree.
>> 
>> I’m a bit concerned about moving entirely away from the existing format, as I think there are some useful properties - particularly the ability to extract the time stamp. Also by producing newer versions of UUIDs that are the same size and layout a lot of existing software will continue to work (for example, Cassandra understands UUIDs and works with my prototype “version 6” UUIDs without any modification).
>> 
>> But the points you bring up are quite relevant.  I’m not sure how to address the clock skew leakage that might occur, but definitely the counters and MAC address can go the way of the buffalo. Simply reading from /dev/urandom, et al makes so much more sense these days.
>> 
>> I think there is also real potential in the idea of making them variable length.  If you need compatibility with existing UUID schemes you can use the 128 bit length, but you can also go longer if you require more guarantee of uniqueness or un-guess-ability.
>> 
>> The only other thing that has me pondering on this is how important is the property of being able to tell “is this a valid UUID, and if so what type”.  If we don’t care about this, then it might make sense to have two additional formats - a base64 which is as compact as reasonably possible while still being ASCII, and a base32 which is case-insensitive - they both have merit depending on the use case.  But then add the fact that it’s variable length and it becomes impossible to distinguish the format reliably (i.e you might read the base32 value as base64 by accident).  I can think of some ways to work around this in the format using specific characters in one and not the other, but it starts to feel really hacky.  I’m just not sure if this is actually important for real-world applications.  Many applications just use the UUID opaquely, but not all.  And again, extracting the timestamp can be useful - but then is it reasonable to also say “if you want the time, you’ll need to know the format, you won’t be able to automatically tell the difference between a base64 and base32 UUID”.  One possible simple solution: For UUIDs longer than 128 bits, put a period after the first 128 bits and then encode the rest.  This would make it possible with a few basic checks to tell which encoding format, and would remain URL safe.  And is simple to implement.
>> 
>> All that said, it seems to me we’re headed for a spec that has the following properties:
>> 
>> - Encodes timestamp
>> - Sorts correctly as raw bytes
>> - Has the version field in the same place for compatibility
>> - Supports the existing 128-bit hex representation (NNNNNNNN-NNNN-NNNN-NNNN-NNNNNNNNNNNN) for compatibility but also...
>> - Specifies alternative (url-safe) base64 encoding for compact string representation while still sorting correctly; and also a base32 encoding where case-insensitivity is required (ideally with a way to scan the string and easily tell which encoding format is being used)
>> - Instead of existing dangerous bits like counter and MAC address, use secure random data available from OS
>> - Can be arbitrarily longer as needed, filled with additional random data.
>> 
>> Does that seem about right?
>> 
>> Also, when things settle in terms of the high level requirements, at what point should a rough draft of the spec on this be written?  (And is there some sort of outline I would follow?)
>> 
>> -Brad
>> 
>> 
>>> On Jul 3, 2018, at 7:11 PM, Martin Thomson <martin.thomson@gmail.com> wrote:
>>> 
>>> Nice presentation.
>>> 
>>> I've always been skeptical of the value of UUIDs.  It's very easy to
>>> make a unique string.  Formatting it in a particular way, however,
>>> serves no real purpose and it might not be necessary for your use
>>> case.
>>> 
>>> UUIDs (aside from v4) also have some nasty properties.  For instance,
>>> revealing a MAC address is now considered problematic.  As is exposing
>>> the output of a clock, which has been used in interesting ways to
>>> undermine privacy (clock skew can be used for de-anonymisation, even
>>> for determining CPU load).  Counters and timers also allow an observer
>>> to correlate the creation of different identifiers.
>>> 
>>> Why not make a string with the properties you desire?  You might need
>>> more than 120 bits to achieve all that, but that's just another reason
>>> not to use UUIDs.  A library that provided these functions (maybe
>>> while addressing the above concerns), would be a valuable addition.
>>> On Wed, Jul 4, 2018 at 10:42 AM Brad Peabody <bradgareth@gmail.com> wrote:
>>>> 
>>>> I would like to get some initial feedback, and suggested next steps, on the idea of making an RFC that covers a version 6 for UUIDs.  It would have an embedded timestamp and sorting by its raw bytes results in a sequence equivalent to sorting by time.  This is not addressed by existing UUID types.  (The closest one is version 1, but due to the byte encoding it does not sort correctly.)
>>>> 
>>>> There is also seems to be some ambiguity in the existing RFC on when to increment the counter field which I think can be clarified.
>>>> 
>>>> I did a prototype implementation and basic write-up on the details here: https://bradleypeabody.github.io/uuidv6/
>>>> 
>>>> I’m also thinking of covering the base64 encoding form which retains the sorting properties but makes it human readable and is significantly shorter than the long hex encoded form.
>>>> 
>>>> Having these things is very useful when considering UUIDs as database primary keys, which is becoming more and more common in distributed systems; and indeed this is the main motivation.
>>>> 
>>>> Any help or advice on moving forward with this would be appreciated.
>>>> 
>>>> - Brad
>>>> 
>>