[openpgp] User ID conventions (it's not really a RFC2822 name-addr)
Daniel Kahn Gillmor <dkg@fifthhorseman.net> Mon, 16 September 2019 22:35 UTC
Return-Path: <dkg@fifthhorseman.net>
X-Original-To: openpgp@ietfa.amsl.com
Delivered-To: openpgp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 192FF12008F for <openpgp@ietfa.amsl.com>; Mon, 16 Sep 2019 15:35:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.99
X-Spam-Level:
X-Spam-Status: No, score=-1.99 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_FILL_THIS_FORM_SHORT=0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=neutral reason="invalid (unsupported algorithm ed25519-sha256)" header.d=fifthhorseman.net header.b=yk2n9t4k; dkim=pass (2048-bit key) header.d=fifthhorseman.net header.b=lZ1o1cYN
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gC3pKqFjyUVw for <openpgp@ietfa.amsl.com>; Mon, 16 Sep 2019 15:35:14 -0700 (PDT)
Received: from che.mayfirst.org (che.mayfirst.org [IPv6:2001:470:1:116::7]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1619312006D for <openpgp@ietf.org>; Mon, 16 Sep 2019 15:35:13 -0700 (PDT)
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/simple; d=fifthhorseman.net; i=@fifthhorseman.net; q=dns/txt; s=2019; t=1568673312; h=from : to : subject : date : message-id : mime-version : content-type : from; bh=Twwe5y1Tug5pb+qvji0C99U2anhLDMe0kumL1SIbq7c=; b=yk2n9t4kC6lTGGjh5uxyWNdXs/a6GaPYEH3P1fC3bXWi9hKcd7qATttV 71ngxY3AO4GKq7JNh9yTFQ9rw2GjDQ==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=fifthhorseman.net; i=@fifthhorseman.net; q=dns/txt; s=2019rsa; t=1568673312; h=from : to : subject : date : message-id : mime-version : content-type : from; bh=Twwe5y1Tug5pb+qvji0C99U2anhLDMe0kumL1SIbq7c=; b=lZ1o1cYNt88MdQvYOLhnVamjFDS7fytWWI12TuzfK/1WipcjAKHgBKXt vtkO8Bf+4XzDjtJHGQUki2BS6sCtu4OXxpOzXGE9xq3RaQNHMwwsj3GiOk AVtJCPSm1uz/U9qEJwuW1DH30EToTSzpX76w5W6k0XxMpgZx9dFWsq+9fr BxBW07rPV7jMLm1ucg8rgpj8xwA0oIquIsD0KQPqjLgJ8DnRzu41lq+sS4 37XGsfT6ZdBcXt2q0Ri69irwUptyzhOV4rL6qOoRmx88+9TqHy6tzsoOY6 4+uwkohAYfXxGdF8iBHJrk4ZrF3wJKk4fuBA2+5T27NY8xG5eF/JEg==
Received: from fifthhorseman.net (unknown [38.109.115.130]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by che.mayfirst.org (Postfix) with ESMTPSA id 7B748F9A7 for <openpgp@ietf.org>; Mon, 16 Sep 2019 18:35:11 -0400 (EDT)
Received: by fifthhorseman.net (Postfix, from userid 1000) id 9B1F92078C; Mon, 16 Sep 2019 18:35:08 -0400 (EDT)
From: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
To: openpgp@ietf.org
Autocrypt: addr=dkg@fifthhorseman.net; prefer-encrypt=mutual; keydata= mDMEXEK/AhYJKwYBBAHaRw8BAQdAr/gSROcn+6m8ijTN0DV9AahoHGafy52RRkhCZVwxhEe0K0Rh bmllbCBLYWhuIEdpbGxtb3IgPGRrZ0BmaWZ0aGhvcnNlbWFuLm5ldD6ImQQTFggAQQIbAQUJA8Jn AAULCQgHAgYVCgkICwIEFgIDAQIeAQIXgBYhBMS8Lds4zOlkhevpwvIGkReQOOXGBQJcQsbzAhkB AAoJEPIGkReQOOXG4fkBAO1joRxqAZY57PjdzGieXLpluk9RkWa3ufkt3YUVEpH/AP9c+pgIxtyW +FwMQRjlqljuj8amdN4zuEqaCy4hhz/1DbgzBFxCv4sWCSsGAQQB2kcPAQEHQERSZxSPmgtdw6nN u7uxY7bzb9TnPrGAOp9kClBLRwGfiPUEGBYIACYWIQTEvC3bOMzpZIXr6cLyBpEXkDjlxgUCXEK/ iwIbAgUJAeEzgACBCRDyBpEXkDjlxnYgBBkWCAAdFiEEyQ5tNiAKG5IqFQnndhgZZSmuX/gFAlxC v4sACgkQdhgZZSmuX/iVWgD/fCU4ONzgy8w8UCHGmrmIZfDvdhg512NIBfx+Mz9ls5kA/Rq97vz4 z48MFuBdCuu0W/fVqVjnY7LN5n+CQJwGC0MIA7QA/RyY7Sz2gFIOcrns0RpoHr+3WI+won3xCD8+ sVXSHZvCAP98HCjDnw/b0lGuCR7coTXKLIM44/LFWgXAdZjm1wjODbg4BFxCv50SCisGAQQBl1UB BQEBB0BG4iXnHX/fs35NWKMWQTQoRI7oiAUt0wJHFFJbomxXbAMBCAeIfgQYFggAJhYhBMS8Lds4 zOlkhevpwvIGkReQOOXGBQJcQr+dAhsMBQkB4TOAAAoJEPIGkReQOOXGe/cBAPlek5d9xzcXUn/D kY6jKmxe26CTws3ZkbK6Aa5Ey/qKAP0VuPQSCRxA7RKfcB/XrEphfUFkraL06Xn/xGwJ+D0hCw==
Date: Mon, 16 Sep 2019 18:35:07 -0400
Message-ID: <87woe7zx7o.fsf@fifthhorseman.net>
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="=-=-="; micalg="pgp-sha512"; protocol="application/pgp-signature"
Archived-At: <https://mailarchive.ietf.org/arch/msg/openpgp/wNo27-0STfGR9JZSlC7s6OYOJkI>
Subject: [openpgp] User ID conventions (it's not really a RFC2822 name-addr)
X-BeenThere: openpgp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Ongoing discussion of OpenPGP issues." <openpgp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/openpgp>, <mailto:openpgp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/openpgp/>
List-Post: <mailto:openpgp@ietf.org>
List-Help: <mailto:openpgp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/openpgp>, <mailto:openpgp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 16 Sep 2019 22:35:17 -0000
Hey OpenPGP folks-- I'd like to have a clearer undersatnding about the actual conventions for OpenPGP User IDs in the context of e-mail. The standards currently say that the convention is an RFC2822 "name-addr", but (as detailed below), that does not appear to be the actual convention in practice. While we're updating RFC 4880, we should fix the standards to reflect reality. There are two proposals at the end that i'd love feedback on. I prefer proposal 2. Claims about name-addr ---------------------- RFC 4880 says the following: 5.11. User ID Packet (Tag 13) A User ID packet consists of UTF-8 text that is intended to represent the name and email address of the key holder. By convention, it includes an RFC 2822 [RFC2822] mail name-addr, but there are no restrictions on its content. The packet length in the header specifies the length of the User ID. RFC4880bis repeats the above, and adds: 5.13.2. User ID Attribute Subpacket […] A User ID Attribute subpacket, just like a User ID packet, consists of UTF-8 text that is intended to represent the name and email address of the key holder. By convention, it includes an RFC 2822 [RFC2822] mail name-addr, but there are no restrictions on its content. For devices using OpenPGP for device certificates, it may just be the device identifier. The packet length in the header specifies the length of the User ID. Both of these references to rfc 2822 are problematic. Real user IDs don't look like this, and other implementations won't parse things this way either, so the implementers might be led astray by this documentation. User ID convention is not a name-addr ------------------------------------- Here are a few concrete reasons why the convention is not actually an RFC 2822 name-addr: a) name-addr in RFC 2822 is defined to be a US-ASCII field, potentially charset-switched with RFC 2047 extensions in at least the display-name part. But User IDs are native UTF-8. For example, compare the following strings: 1) Björn Björnson <bjoern@example.net> 2) Bj=?utf-8?q?=C3=B6?=rn Bj=?utf-8?q?=C3=B6?=rnson <bjoern@example.net> We expect User IDs to look like (1), even though (2) is technically an RFC 2822 mail-addr. We don't want people to generate user IDs like (2), and we don't want implementations to try to apply RFC 2047 decoding to the contents of a user ID packet to be able to display it. b) name-addr doesn't allow non-quoted internal commas or apostrophes, so the following common User ID patterns are not technically name-addrs either, though implementations generate them, and people use them just fine in the real world: 3) Acme Industries, Inc. <info@acme.example> 4) Michael O'Brian <obrian@example.biz> 5) Smith, John <jsmith@example.com> c) in RFC 2822, a <name-addr> is not the same as a "mailbox" -- a "mailbox" is either a "name-addr" (which contains an "addr-spec") or an "addr-spec" on its own. But we have many examples in flight today of user IDs that are just a raw "addr-spec" (without angle-brackets), and those tend to be accepted by many OpenPGP implementations: 6) mariag@example.org d) the "display-name" part of an RFC 2822 "name-addr" is a "phrase" (a series of "word"s, which are either "atom"s or "quoted-string"s). An "atom" cannot contain the "@" symbol, so the display-name cannot contain an unquoted @. However, due to infelicities in common interfaces, we also see a large number of user IDs that simply replicate the addr-spec as though it were the domain name. This is not a valid name-addr, but it is accepted by most OpenPGP implementations. For example: 7) joe@example.net <joe@example.net> These differences between RFC 2822's name-addr and the actual user IDs in use today suggest that the guidance that they are "by convention" a name-addr is a mistake, and a potentially damaging one at that. It's likely to cause implementers to do expensive implementations of the complex name-addr syntax, which they then have to make exceptions for when they encounter all the real-world counterexamples. At the same time, we don't want implementers to each have their own arbitrary deviations from the convention -- the more uniform we can make the convention, the more likely we'll be to have interoperability. Goals ----- AFAICT, there is one main, uncontroversial technical goal for an e-mail-focused OpenPGP implementation when dealing with user IDs: A) extract the addr-spec If the implementation can't figure out the addr-spec, they can't use the certificate to learn how to contact. and if the implementation can't index internally by addr-spec, then they can't find the appropriate certificate to use when trying to contact a given e-mail address. What we really want is for every implementation to do this in a robust and predictable way, including for all of the common non-mail-addr forms described above. Are there any other goals that people think this convention should cover? Some (possibly-contentious) additional goals: B) accepting UTF-8 addr-specs recent RFCs about internationalization accept non-ASCII characters in domain names and local-parts of the addr-spec: https://tools.ietf.org/html/rfc6530#section-10.1 https://tools.ietf.org/html/rfc6532#section-3.2 do we expect user agents to be able to extract addr-specs that look like: иван.сергеев@пример.рф Dörte@Sörensen.example.com (These examples are from https://en.m.wikipedia.org/wiki/International_email) C) accepting really unusual addr-specs: the addr-spec definition formally includes some really bizarre structures that (while probably in use on some legacy systems) are a really bad idea. For example, localparts that are wrapped in double-quotes but otherwise contain forbidden characters can be problematic: "Abc@def"@example.com "Fred Bloggs"@example.com It looks to me like RFC 5322 even allows CFWS in the local-part, ugh. Do we expect user agents to do anything sensible with these addresses? A non-goal (does anyone want this?): D) be able to distinguish the "comment" from the "name" in display-name: Despite several implementations appearing to distinguish "Comment" from "Name" in the display-part, it's not clear that anyone *does* anything with that information, so it's mainly clutter and confusion. On top of that, there are probably more useless comments than useful ones, so i'd be happy to let this misfeature die out. Proposal 1: unicode maybe-wrapped addr-spec ------------------------------------------- We can address goals A, B, and C with some sort of language that acknowledges reality if we accept the following: * addr-spec from RFC 5322 is augmented by the definitions in RFC 6532 section 3 * there is no structure that we care about in what we would have called the "display-name" part of the supposed name-addr. Then the user ID convention becomes (again, assuming atext as augmented by 6532 §3): pgp-uid-prefix-char = atext / specials pgp-uid-convention = addr-spec / *pgp-uid-prefix-char "<" addr-spec ">" Proposal 2: simplify, simplify ------------------------------ Proposal 1 is still pretty ugly due to the inherent complexities of addr-spec itself. We can simplify the formal addr-spec greatly if: - we don't allow CFWS or quoted-string in the local-part, and - we don't allow CFWS or domain-literal addresses in the domain, and - we drop all the obsolete variants ("obs-*" labels in RFC 5322 ABNF) CFWS is "comments and folding whitespace". Dropping comments is justified by the argument that comments can go elsewhere in the user ID. Folding-whitespace isn't necessary due to the structure of the user ID itself -- we're not in an e-mail message header. Dropping obsolete parts is justified because they're obsolete. Dropping quoted-string is justified because it's rarely used, and likely to break in reality. And dropping domain-literal parts is justified because no one delivers e-mail to raw IP addresses anyway. Note that yes, this will discard some legitimate (if odd) addresses (e.g. ones with CFWS or quoted-string), and it may fail to recognize some legacy (odd) user IDs (obs-* or domain-literal). But we're describing a convention here, not making a normative statement, and we can do much better than the convention we were describing earlier but pretty much every implementation fails to follow. Using the definitions in RFC 5322 and RFC 5234, as augmented by RFC 6532 section 3, we can implement this simplification like so: pgp-addr-spec = dot-atom-text "@" dot-atom-text pgp-uid-prefix-char = atext / specials pgp-uid-convention = pgp-addr-spec / *pgp-uid-prefix-char "<" pgp-addr-spec ">" Note that every pgp-addr-spec is by definition an addr-spec (though not all addr-specs are a pgp-addr-spec). I believe that proposal 2 is closer to what most implementations do today, and it handles goals A and B. I don't mind it failing at goal C because of how much simpler the matching rule is. Conclusion ---------- My preference is to replace the text about User ID conventions in RFC 4880bis with proposal 2, but i'd be open to hearing other suggestions if anyone has them. --dkg PS in researching other ways to solve this problem, i came up with an approach that relies on Unicode character properties, in particular Grapheme_Base and Grapheme_Extend as a way to exclude control chars and other non-printables. This is a more sophisticated/nuanced approach than the RFC 6532 ABNF extensions to atext. But specifying it requires a character class set subtraction operation (you want to subtract "<" and ">" and "@" and " " from the Grapheme_* classes), which isn't listed in IETF's ABNF definition in RFC 5324. And implementing it requires a toolkit capable of discerning and acting on Unicode properties (e.g. the python regex module from PyPi, but not the re module from python's stdlib). That's too bad, because 6532 §3 effectively makes things like U+200B ZERO WIDTH JOINER allowable within dot-atom-text, which is uncomfortable and weird. But other implementers reliant on 6532 might accept such a localpart anyway. These costs don't appear to be worth the minor gain compared to proposal 2, so i've stopped attempting to document that approach. If anyone wants to take a crack at it though, i'm happy to share my notes.
- [openpgp] User ID conventions (it's not really a … Daniel Kahn Gillmor
- Re: [openpgp] User ID conventions (it's not reall… Daniel Kahn Gillmor
- Re: [openpgp] User ID conventions (it's not reall… Michael Richardson
- Re: [openpgp] User ID conventions (it's not reall… Jon Callas
- Re: [openpgp] User ID conventions (it's not reall… Daniel Kahn Gillmor
- Re: [openpgp] User ID conventions (it's not reall… Daniel Kahn Gillmor
- Re: [openpgp] User ID conventions (it's not reall… Neal H. Walfield
- Re: [openpgp] User ID conventions (it's not reall… brian m. carlson
- Re: [openpgp] User ID conventions (it's not reall… Neal H. Walfield
- Re: [openpgp] User ID conventions (it's not reall… Neal H. Walfield
- Re: [openpgp] User ID conventions (it's not reall… Daniel Kahn Gillmor
- Re: [openpgp] User ID conventions (it's not reall… brian m. carlson