Re: Gen-ART LC review of draft-ietf-eai-utf8headers-09.txt

One addition to Harald's comments...

--On Sunday, 23 March, 2008 20:43 +0100 Harald Tveit Alvestrand
<harald@alvestrand.no> wrote:

>>   Because internationalized local parts may cause email
>>   addresses to be longer, processes which parse, store, or
>>   handle email addresses or local parts must take extra care
>>   not to overflow buffers, truncate addresses, exceed storage
>>   allotments, or, when comparing, fail to use the entire
>>   length.
>> 
>> technical: this is great advice, but I don't understand how
>> UTF-8 changes the situation. If you aren't changing the
>> 998-octet requirement, software that breaks for UTF-8 would
>> also break for ASCII headers with the same  octet
>> length.

> If someone uses another representation internally (for
> instance UTF-16),  and has a 998-character buffer, that will
> sometimes fit into 998 octets  of UTF-8, and sometimes not.
> The same goes in the other direction....  I'm sure others will
> think of other cases.

Spencer, I'm a little confused by your even asking the question,
so let me try for a slightly different answer in case you were
asking a different question.   Two of the advantages we have
with ASCII (and the closely-related ISO 8859 code character
sets) are that every character is the same length as every other
character and that every character is exactly one octet.    As a
consequence of that relationship, we have clutter in many places
in the RFC space, and probably in implementations, in which
"character" and "octet" are used interchangeably when referring
to lengths. 

I note that you carefully, and correctly, said "same octet
length" above and not the "same length in characters".   But RFC
821 talks about lengths in characters and, to my astonishment
and shame, so does section 4.5.3.1 of rfc2821bis (I've just
flagged that to the relevant ADs and will try to get it fixed
before the thing is published).   But that is the definitional
problem, and perhaps the new risk, in a nutshell.

Now, if one goes to UTF-32, the characters are all the same
length, but four octets instead of one.  An implementation that
counts characters, but allocates buffers in octets (assuming
that they are the same thing) is obviously headed for trouble,
but computing the length from the character count or vice versa
is pretty straightforward.

UTF-8 (and technically UTF-16) break both of those original
assumptions.  The characters may be more than one octet long and
one cannot compute the number of octets from the number of
characters (UTF-8 is aggressively variable-length; UTF-16
occupies either two or four octets per character depending on
whether the character has a high enough code point that
surrogate pairs are needed).

>...
>> 9.2.  Informative References
>> 
>> 
>>   [Hoffman-utf8-headers]
>>              Hoffman, P., "SMTP Service Extensions or
>>              Transmission of Headers in UTF-8 Encoding",
>>              draft-hoffman-utf8headers-00.txt (work in
>>              progress), December 2003.
>> 
>> Technical: I know this is how we refer to Internet Drafts,
>> but "2003"  isn't
>> "work in progress". You might s/work in progress/expired
>> Internet  Draft/, or
>> (probably better) simply move the rest of the full citation
>> to the Acknowledgements section - it didn't seem like you
>> really expected  anyone to
>> actually refer to this reference, anyway :-)

> It's a part of the history, and we can probably safely lose it.

It is referenced, and its historical role mentioned, in RFC
4952, so can almost certainly be dropped utf8headers.  

On the more general subject, I've tried raising the issue of
these documents that are referenced for historical reasons and
hence, IMO, should not say "work in progress" and should include
the exact file name so that people can find them if interested.
I've gotten  nowhere, so it is someone else's turn.   What is
really needed, I think, is a policy on these sorts of things,
corresponding modifications to tools like xml2rfc, etc.  I don't
think hiding the references in inline text is that right answer,
but that is just my opinion.

best,
    john

_______________________________________________
IETF mailing list
IETF@ietf.org
https://www.ietf.org/mailman/listinfo/ietf