Re: [precis] [media-types] Internet media type application/pkcs8-encrypted rev 2

On 11/23/2015 5:31 PM, Martin J. Dürst wrote:
> Hello Sean,
>
> I have cc'ed the precis mailing list because some of what I'll write 
> below is relevant for the discussion you have started there. This is 
> also the reason why I'm keeping most the previous context.
>
> On 2015/11/11 00:25, Sean Leonard wrote:
>> Hello Martin,
>>
>> On Nov 10, 2015, at 1:45 AM, Martin J. Dürst <duerst@it.aoyama.ac.jp> 
>> wrote:
>>
>>> Hello Sean,
>>>
>>> I have a few questions re. your registration below.
>>>
>>> On 2015/11/05 14:57, Sean Leonard wrote:
>>>> Hello:
>>>>
>>>> To keep this moving, trying a different thing. Please review.
>>>>
>>>> Sean
>>>>
>>>> *****
>>>>
>>>> Type name: application
>>>>
>>>> Subtype name: pkcs8-encrypted
>>>>
>>>> Required parameters: N/A
>>>>
>>>> Optional parameters:
>>>> charset: When the private key encryption algorithm incorporates a 
>>>> “password" that is an octet string, a mapping between user input 
>>>> and the octet string is desirable. PKCS #5 [RFC2898] Section 3 
>>>> recommends "that applications follow some common text encoding 
>>>> rules"; it then suggests, but does not recommend, ASCII and UTF-8. 
>>>> This parameter specifies the charset that a recipient SHOULD 
>>>> attempt first when mapping user input to the octet string. It has 
>>>> the same semantics as the charset parameter from text/plain, except 
>>>> that it only applies to the user’s input of the password. There is 
>>>> no default value.
>>>
>>> Why does it say "This parameter specifies the charset that a 
>>> recipient SHOULD attempt *first*" here? Can't that encoding just be 
>>> specified as such?
>>>
>>> At least for future, similar efforts, it would be extremely 
>>> desirable to not leave character encoding open like this, but just 
>>> to nail it down to UTF-8.
>>
>> There seems to be something of a “cultural disconnect” between the 
>> security people and the I18N/UI/UX people.
>>
>> The I18N/UI/UX people want well-defined interfaces that work with 
>> users “in their own language”, whether that language is visual, 
>> aural, tactile, symbolic, pictorial, etc. Invariably this involves 
>> Unicode and a large character repertoire such as 💩 and 大便所.
>>
>> In contrast, the security people find open-ended things like Unicode 
>> to be anathema and would much rather restrict the range of inputs to 
>> a small and preferably uniformly distributed set of values. And there 
>> are good reasons for that, because when you introduce bias into 
>> cryptographic protocols, it turns out that it is a lot easier to 
>> cryptanalyze the results.
>>
>> The common security protocols that I have seen that take passwords, 
>> hand-wave about character sets and encodings and define the password 
>> to be an octet string. This is great for universality but bad for 
>> human input. PBKDF2 (PKCS #5, on which this PKCS #8 
>> EncryptedPrivateKeyInfo registration is based) is a leading example 
>> of the “octet string” approach. Ultimately, the algorithms don’t care 
>> what encoding it’s in, as long as they get a blob of bits (octets).
>>
>> My knowledge of implementations of PKCS #5/#8/#12 suggests that there 
>> are many applications out there that give zero thought to the 
>> encoding issue, which means that they will take user input “As-Is”, 
>> i.e., in the current code page.
>>
>> Note that PKCS #12 defines the input to this structure as a UTF-16LE 
>> encoded character string, *with* a terminating U+0000 NULL character 
>> (i.e., the octets 00 00). This is really “weird” except of course for 
>> the fact that Microsoft invented it and then shipped it without too 
>> much thought, in which case, all weirdness can be explained.
>>
>> It is a design criteria that if you extract such an 
>> EncryptedPrivateKeyInfo blob from a PKCS #12 file, that you should be 
>> able to process it. If you specify UTF-8 as the one, single, true 
>> encoding of the password for application/pkcs8-encrypted, that can’t 
>> happen.
>
> That's just fine, in this specific case. I have explicitly prefaced my 
> remark above with "At least in the future".
>
> But if we know that the password is encoded in UTF-16LE, then why 
> doesn't your registration just say "This parameter specifies the 
> charset" rather than the handwavy "This parameter specifies the 
> charset that a recipient SHOULD attempt *first*".

See below...

>
>
>> Furthermore, UTF-8 is not uniformly distributed across the octet 
>> range. If your users are in US-English they are highly likely to have 
>> octets in 20-7E. Octets in 00-1F will be pretty rare. And if you 
>> choose scalar values randomly in Unicode (regardless of assignment), 
>> you will see a *lot* of F0-F4 but virtually none in 00-7F. And in 
>> spite of all this, octets F5-FF will *never* appear in UTF-8.
>>
>> It turns out that we have a pretty good source of uniformity and 
>> universality: characters in the US-ASCII range 20-7E. Many password 
>> input boxes will only accept US-ASCII and so user’s non-US-English 
>> keyboards will switch to US-ASCII mode for the purpose of providing 
>> input to such boxes. What matters is not so much the specific 
>> characters, so much as a reasonable selection of arbitrary buttons 
>> that a user can push *across a wide range of devices*. This ends up 
>> giving you 5-6 bits of entropy per user input. So the need for UTF-8 
>> or any particular encoding is actually not as great as some people 
>> perceive.
>
> My comment was specifically trying to say: If you use something more 
> than US-ASCII, make it UTF-8. I think that's also the general policy 
> of the IETF. As for entropy, the entropy needs to be measured over the 
> whole string. It's clear that in UTF-8 bytes, a password in the ASCII 
> range is shorter than a similar-length (in terms of charaters) 
> password in a non-Latin script. The entropy of each byte will be 
> lower, but the entropy of the overall password should be about the same.
>
> Something that's very important for passwords is how easy they are to 
> remember for actual people. It should be obvious that it's easier for 
> somebody to remember a password in the language/script they use every 
> day than in some foreign gibberish.

Actually, the most important criteria for passwords is that they are 
inputtable: capable of being input into a computing system. The 
passphrase "éclairs sont si délicieuX" is nice and memorable, but if you 
cannot put in "é" (or "X") in the computer systems where you're going to 
use the thing, you're screwed.

Viewed through this lens, many cryptographic systems are so-defined to 
accept secret input as octet strings, where neither the mathematical 
algorithms nor the protocols inherently restrict the range of the 
octets. Whether "é" maps to E9, C3 A9, E9 00, or 82 [code page 437] is 
relevant because if the user tries to input é but the math part doesn't 
get the original octets on the input, it won't work.

Scenarios where PKCS #8 blobs might be accessed at device boot-time are 
possible. In those scenarios, the character set/encoding of the system 
may well be something other than UTF-8. Regardless of the encoding, the 
boot-time input methods may be severely restricted. (There is a recent 
thread on the Unicode mailing list, related to passwords, that raises 
this issue.)

PKCS #8/PKCS #5 mention ASCII and UTF-8 as complementary 
possibilities--this registration does not attempt to disturb that.

The handwavy text, "This parameter specifies the charset that a 
recipient SHOULD attempt *first*", recognizes the mathematical reality. 
Suppose an Evildoer (pick your preferred one) encodes all of its private 
keys with the parameter charset=us-ascii. But the actual input has some 
character ñ that is clearly outside the range. When the Good Guys™ try 
to break the encryption, if they only attempt characters in us-ascii 
range, they are never going to decrypt the thing. Ditto for UTF-8 
encoding if the correct input (meaning the input that actually results 
successfully decrypting the private key payload) contains FFh somehow, 
or some overlong encoding.

If this parameter were baked into the PKCS #8 format, perhaps a stronger 
claim could be made in the specification text. (Compare with PKCS #12.) 
But as it stands, the parameter is just a "hint"...and for a long time, 
implementations will probably not bother to put this parameter in.

>
>> Overall I think that a standard such as IEEE 802.11 strikes a 
>> reasonable balance. (See 802.11-2012 Annex M.4, which is informative, 
>> but is pretty much the worldwide de-facto standard practice.) In 
>> 802.11, the input to PBKDF2 is between 8-63 ASCII-encoded characters 
>> in the range 20-7E, or 64 hexadecimal characters that convert 
>> directly to 32 octets.
>
> So it's up to 63 ASCII characters but only up to 32 octets that may 
> e.g. be used for UTF-8? That doesn't strike me as a reasonable 
> balance; it puts a much stronger length limitation on some scripts 
> outside ASCII.

Take a look at 802.11-2012 Annex M.4. Here is a direct link: 
<http://standards.ieee.org/about/get/802/802.11.html>.

When the input is 64 hexadecimal characters, PBKDF2 is bypassed: the 
resulting 32 octets are used directly as the RSNA pre-shared key (PSK). 
The advantage of defining the protocol that way is that a user interface 
(aka, password prompt) for Wi-Fi devices only needs to have one text box 
field: the implementation then auto-detects which path to use based on 
the length. Otherwise, all Wi-Fi devices on the planet would have to 
have user interfaces designed with additional dropdowns, checkboxes, 
radio buttons, etc. to pick between the (ASCII) password and direct 
input of the PSK. (It stands to reason that every password input for 
every SSID has a corresponding valid 64-hexacdecimal-character PSK.)

802.11-2012 does not define an algorithm for cases when the passphrase 
has characters that encode outside of the range of 32 to 126. I do not 
know what other implementations do worldwide if non-ASCII characters are 
entered. As some have observed (on the Unicode mailing list), some 
operating systems, e.g., iOS, always enforce Latin script for password 
box entry.

>
>> ***
>> To answer your questions directly:
>>> Why does it say "This parameter specifies the charset that a 
>>> recipient SHOULD attempt *first*" here?
>>> Can't that encoding just be specified as such?
>>
>>
>> The parameter is not cryptographically protected so it is subject to 
>> tampering or substitution. Furthermore, a good-faith but naïve sender 
>> may put some encoding (e.g., UTF-8) but not have the means to verify 
>> that the encoding actually works, because the user did not supply the 
>> password. Basically it’s a good-faith first effort, but this 
>> parameter can’t meaningfully restrict what the sender or receiver 
>> attempt to do.
>
> That essentially applies to any single parameter in any single media 
> type registration, and in much more of what the IETF does. Yet this is 
> virtually never called out, because otherwise, IETF documents would be 
> full of such stuff and very hard to read.

Perhaps. First of all such warnings about "not being cryptographically 
protected" tend to show up in many Security Considerations sections, of 
which I have seen many in the IETF. Second of all, the content inside 
PKCS #8 EncryptedPrivateKeyInfo *is* cryptographically protected (to a 
certain extent, depends on the algorithms being used), but an 
implementation really needs to be vigilant not to treat the stuff 
outside, i.e., the media type parameters, as the same as on the inside.

>
>> Also, I am not sure how to specify the NULL suffix in the PKCS 
>> #12-extracted case.
>
> That may suggest that you are going down the wrong path here.
>
>> I suppose it could just be “+0” or something.
>>
>>>
>>>> ualg: When the charset is a Unicode-based encoding, this parameter 
>>>> is a space-delimited list of Unicode algorithms that a recipient 
>>>> SHOULD first attempt to apply to the Unicode user input in 
>>>> succession, in order to derive the octet string. The list of 
>>>> algorithm keywords is defined by [UNICODE]. “Tailored operations” 
>>>> are operations that are sensitive to language, which must be 
>>>> provided as an input parameter. If a tailored operation is called 
>>>> for, the exclamation mark followed by the [BCP47] language tag 
>>>> specifies the language. For example, "toNFD toNFKC_Casefold!tr" 
>>>> first applies Normalization Form D, followed by Normalization Form 
>>>> KC with Case Folding in the Turkish language, according to 
>>>> [UNICODE] and [UAX31]. The default value of this parameter is 
>>>> empty, and leaves the matter of whether to normalize, case fold, or 
>>>> apply other transformations unspecified.
>>>
>>> "When the charset is": Is this the charset parameter, or the actual 
>>> encoding of the password?
>>
>> Admittedly this was vague. First draft. I am not sure what it should 
>> be. Per PKCS #5, the "Actual Encoding" is just an octet string of 
>> arbitrary length.
>>
>> I would limit this to cases when the charset parameter is present and 
>> defined. Makes it easier.
>>
>>>
>>> What is a "Unicode algorithm”?
>>
>> Conformance Clause D17.
>
> Well, this, via the term "Named Unicode Algorithm" points to table 3.1 
> (page 93 in Unicode V 8.0).
>
>
>>> Reading on and looking at the examples, the intent becomes clearer, 
>>> at least to somebody who has seen things such toNFD and toNFKC and 
>>> Casefold, but I hope we can avoid "specification by example" here.
>>
>> In fairness, “toNFD” and “toNFKC” are not defined terms. However, NFD 
>> (D118) and NFKC (D121) are.
>
> Yes, but not as (Named) Unicode Algorithms.
>
>> I would rather not create Yet Another Registry of things.
>
> I'd agree in principle.
>
>> The terms are in fact defined in [UNICODE] in the conformance clauses.
>
> Yes, but there are many other things defined there, too.

Yes...
If you meant to refer to [UNICODE] Table 3.1 (page 93, Unicode V8.0) as 
a canonical list of names...I do not find that satisfactory because the 
only entries for normalization are "Normalization" and "Identifier 
Normalization", but the most important kind of transformation is to pick 
which Normalization Form you want: NFC, NFD, NFKC, or NFKD. (I am under 
the impression that NFKC is the "Best" for passwords, but the PRECIS 
profile for passwords uses NFC. Not sure exactly why but I will leave 
that one alone.)

Really, algorithms to do character transformations, prohibitions, 
substitutions, etc. in the Internet context, would seem to fall under 
PRECIS. Therefore maybe the proper pre-existing "registry" of things is 
just PRECIS profiles.

Developing from my last PRECIS e-mail on the topic, I am leaning towards 
a single parameter:
pw-mapping

with special values:
*pkcs12  = UTF-16LE with U+0000 NULL terminator
*precis    = PRECIS password profile, i.e., OpaqueString from Section 4 
for RFC 7613 (always UTF-8)
*precis-XXX     = PRECIS profile named by XXX
*hex        = hexadecimal input: as with 802.11, the input is mapped to 
0-9, A-F, and then converted directly to octets. If there are an odd 
number of hex digits, the final digit 0 is appended, or an error 
condition may be raised.
*dtmf      = The characters "0"-"9", "A"-"D", "*", and "#", which map to 
their corresponding ASCII codes. (This is to support restricted-input 
devices, i.e., telephones and telephone-like equipment.)

Otherwise, pw-mapping is a charset.

The parameter previously called "ualg" is interesting but maybe it's a 
bit over-specified. If *precis-XXX can do the job, I am fine with 
specifying "pw-mapping" only, and removing "ualg".

Personally I don't see why PRECIS prohibits control characters like HT 
in passwords. It seems to me that those sorts of characters are legit 
for password purposes. But that is just my personal view and I don't 
care enough about HT, BEL, ENQ, or similar in this context to fight 
about it. :)

>
>> My usability perception is that if people really want to use Unicode 
>> in their passwords, canonicalization is a very useful property to 
>> preserve. Case folding/case mapping are not so useful, as most 
>> systems like to have case-sensitive passwords for greater entropy, 
>> but “most systems” is not “all systems” so we shouldn’t preclude the 
>> use of case algorithms. As for other algorithms such as line 
>> breaking, character segmentation, Hangul syllable name generation, 
>> etc., the short answer is “I don’t know”. (These are all reasons why 
>> people stick with ASCII passwords, by the way.)
>
> Line breaking, character segmentation, Hangul syllable name 
> generation,... are completely irrelevant for passwords and passphrases.
>
> Also, many algorithms come with options or parameters.
>
>
>>> Also, if there is indeed a list of algorithm identifiers in 
>>> [UNICODE], then it would be good to give a Section number. Is the 
>>> intent that each and every algorithm named somewhere in [UNICODE] is 
>>> implemented? My rough guess would be that the average password input 
>>> implementation implements only the identity transform. [I would of 
>>> course be positively surprised if I were wrong.]
>>
>> See above; main thing that worries me is Normalization Forms.
>>
>>>
>>> Also, references for [UNICODE], [BCP47], and [UAX31] should be give 
>>> so that this registration is self-containing.
>>
>> Ok.
>>
>> Another possibility is that this registration goes back to “rev 1”, 
>> i.e., no optional parameters about the character encoding at all. I 
>> think that is perfectly defensible. But it is not particularly 
>> i18n-friendly.
>
> I'm not sufficiently familiar with the format and the actual use 
> cases, but my suggestion would be to check what's actually out there 
> in the field (such as the Microsoft UTF-16LE including final NULL), 
> and select or create a list of parameters/algorithms (with a registry 
> if it turns out to be needed). To that, add a way to reference PRECIS, 
> even if it's not currently used, because that includes the 
> expertise/recommendations of experts.
>
> The current proposal just essentially saying: Unicode may define some 
> of the pieces you may want to use here, and may have labels for them, 
> so just give it a try. I'm not at all sure this will help 
> interoperability, except by similar accidents like the Microsoft one 
> that you described above.

Right. Well, maybe PRECIS to the rescue? David Wheeler's aphorism is 
that all problems in computer science can be solved by another level of 
indirection. If PRECIS can handle this stuff, that's good enough for me.

Sean

>
> Regards,    Martin.
>
>> Regards,
>>
>> Sean
>>
>>>
>>> Regards,   Martin.
>>>
>>>> Encoding considerations: binary
>>>>
>>>> Security considerations:
>>>> Carries a cryptographic private key. See Section 6 of RFC 5958.
>>>> EncryptedPrivateKeyInfo PKCS #8 data contains exactly one private 
>>>> key. Poor password choices, weak algorithms, or improper parameter 
>>>> selections (e.g., insufficient salting rounds) will make the 
>>>> confidential payloads much easier to compromise.
>>>>
>>>> Interoperability considerations:
>>>> PKCS #8 is a widely recognized format for private key information 
>>>> on all modern cryptographic stacks. The encrypted variation in this 
>>>> registration, EncryptedPrivateKeyInfo (Section 3, Encrypted Private 
>>>> Key Info, of RFC 5958, and Section 6 of PKCS #8), is less widely 
>>>> used for exchange than PKCS #12, but it is much simpler to 
>>>> implement. The contents are exactly one private key (with optional 
>>>> attributes), so the possibility for hidden "easter eggs" in the 
>>>> payload such as unexpected certificates or miscellaneous secrets is 
>>>> drastically reduced.
>>>>
>>>> Published specification:
>>>> PKCS #8 v1.2, November 1993 (republished as RFC 5208, May 2008); 
>>>> RFC 5958, August 2010
>>>>
>>>> Applications that use this media type:

>>>> Machines, applications, browsers, Internet kiosks, and so on, that 
>>>> support this standard allow a user to import, export, and exercise 
>>>> a single private key.
>>>>
>>>> Fragment identifier considerations: N/A
>>>>
>>>> Additional information:
>>>>
>>>> Deprecated alias names for this type: N/A
>>>> Magic number(s): None.
>>>> File extension(s): .p8e
>>>> Macintosh file type code(s): N/A
>>>>
>>>> Person & email address to contact for further information:
>>>> Sean Leonard <dev+ietf&seantek.com>
>>>>
>>>> Intended usage: COMMON
>>>>
>>>> Restrictions on usage: None.
>>>>
>>>> Author:
>>>> RSA, EMC, IETF
>>>>
>>>> Change controller: The IETF
>>>>
>>>> Provisional registration? (standards tree only): No
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> media-types mailing list
>>>> media-types@ietf.org
>>>> https://www.ietf.org/mailman/listinfo/media-types
>>>>
>>

Re: [precis] [media-types] Internet media type application/pkcs8-encrypted rev 2

Attachment: smime.p7s