Re: Gen-ART LC review of draft-ietf-eai-utf8headers-09.txt
John C Klensin <john-ietf@jck.com> Sun, 23 March 2008 23:38 UTC
Return-Path: <ietf-bounces@ietf.org>
X-Original-To: ietfarch-ietf-archive@core3.amsl.com
Delivered-To: ietfarch-ietf-archive@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 4B0D728C2F1; Sun, 23 Mar 2008 16:38:24 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -100.567
X-Spam-Level:
X-Spam-Status: No, score=-100.567 tagged_above=-999 required=5 tests=[AWL=-0.130, BAYES_00=-2.599, FH_RELAY_NODNS=1.451, HELO_MISMATCH_ORG=0.611, RDNS_NONE=0.1, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BSKX9JXdaUke; Sun, 23 Mar 2008 16:38:20 -0700 (PDT)
Received: from core3.amsl.com (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id D8AEC3A699D; Sun, 23 Mar 2008 16:38:20 -0700 (PDT)
X-Original-To: ietf@core3.amsl.com
Delivered-To: ietf@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 3BA6D3A6C13; Sun, 23 Mar 2008 16:38:19 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id naS7+0yrNc0M; Sun, 23 Mar 2008 16:38:18 -0700 (PDT)
Received: from bs.jck.com (ns.jck.com [209.187.148.211]) by core3.amsl.com (Postfix) with ESMTP id A614C3A69D7; Sun, 23 Mar 2008 16:38:17 -0700 (PDT)
Received: from [127.0.0.1] (helo=p3.JCK.COM) by bs.jck.com with esmtp (Exim 4.34) id 1JdZj9-000AO9-DG; Sun, 23 Mar 2008 19:35:47 -0400
Date: Sun, 23 Mar 2008 19:35:43 -0400
From: John C Klensin <john-ietf@jck.com>
To: Harald Tveit Alvestrand <harald@alvestrand.no>, Spencer Dawkins <spencer@wonderhamster.org>
Subject: Re: Gen-ART LC review of draft-ietf-eai-utf8headers-09.txt
Message-ID: <67F9E9613733D4D739377E7D@p3.JCK.COM>
In-Reply-To: <47E6B2EF.50109@alvestrand.no>
References: <001001c88c18$a841afc0$6501a8c0@china.huawei.com> <47E6B2EF.50109@alvestrand.no>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Disposition: inline
Cc: General Area Review Team <gen-art@ietf.org>, ietf@ietf.org, "Abel Yang \\(editor\\)" <abelyang@twnic.net.tw>
X-BeenThere: ietf@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: IETF Discussion <ietf.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ietf>, <mailto:ietf-request@ietf.org?subject=unsubscribe>
List-Post: <mailto:ietf@ietf.org>
List-Help: <mailto:ietf-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf>, <mailto:ietf-request@ietf.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: ietf-bounces@ietf.org
Errors-To: ietf-bounces@ietf.org
One addition to Harald's comments... --On Sunday, 23 March, 2008 20:43 +0100 Harald Tveit Alvestrand <harald@alvestrand.no> wrote: >> Because internationalized local parts may cause email >> addresses to be longer, processes which parse, store, or >> handle email addresses or local parts must take extra care >> not to overflow buffers, truncate addresses, exceed storage >> allotments, or, when comparing, fail to use the entire >> length. >> >> technical: this is great advice, but I don't understand how >> UTF-8 changes the situation. If you aren't changing the >> 998-octet requirement, software that breaks for UTF-8 would >> also break for ASCII headers with the same octet >> length. > If someone uses another representation internally (for > instance UTF-16), and has a 998-character buffer, that will > sometimes fit into 998 octets of UTF-8, and sometimes not. > The same goes in the other direction.... I'm sure others will > think of other cases. Spencer, I'm a little confused by your even asking the question, so let me try for a slightly different answer in case you were asking a different question. Two of the advantages we have with ASCII (and the closely-related ISO 8859 code character sets) are that every character is the same length as every other character and that every character is exactly one octet. As a consequence of that relationship, we have clutter in many places in the RFC space, and probably in implementations, in which "character" and "octet" are used interchangeably when referring to lengths. I note that you carefully, and correctly, said "same octet length" above and not the "same length in characters". But RFC 821 talks about lengths in characters and, to my astonishment and shame, so does section 4.5.3.1 of rfc2821bis (I've just flagged that to the relevant ADs and will try to get it fixed before the thing is published). But that is the definitional problem, and perhaps the new risk, in a nutshell. Now, if one goes to UTF-32, the characters are all the same length, but four octets instead of one. An implementation that counts characters, but allocates buffers in octets (assuming that they are the same thing) is obviously headed for trouble, but computing the length from the character count or vice versa is pretty straightforward. UTF-8 (and technically UTF-16) break both of those original assumptions. The characters may be more than one octet long and one cannot compute the number of octets from the number of characters (UTF-8 is aggressively variable-length; UTF-16 occupies either two or four octets per character depending on whether the character has a high enough code point that surrogate pairs are needed). >... >> 9.2. Informative References >> >> >> [Hoffman-utf8-headers] >> Hoffman, P., "SMTP Service Extensions or >> Transmission of Headers in UTF-8 Encoding", >> draft-hoffman-utf8headers-00.txt (work in >> progress), December 2003. >> >> Technical: I know this is how we refer to Internet Drafts, >> but "2003" isn't >> "work in progress". You might s/work in progress/expired >> Internet Draft/, or >> (probably better) simply move the rest of the full citation >> to the Acknowledgements section - it didn't seem like you >> really expected anyone to >> actually refer to this reference, anyway :-) > It's a part of the history, and we can probably safely lose it. It is referenced, and its historical role mentioned, in RFC 4952, so can almost certainly be dropped utf8headers. On the more general subject, I've tried raising the issue of these documents that are referenced for historical reasons and hence, IMO, should not say "work in progress" and should include the exact file name so that people can find them if interested. I've gotten nowhere, so it is someone else's turn. What is really needed, I think, is a policy on these sorts of things, corresponding modifications to tools like xml2rfc, etc. I don't think hiding the references in inline text is that right answer, but that is just my opinion. best, john _______________________________________________ IETF mailing list IETF@ietf.org https://www.ietf.org/mailman/listinfo/ietf
- Gen-ART LC review of draft-ietf-eai-utf8headers-0… Spencer Dawkins
- Re: Gen-ART LC review of draft-ietf-eai-utf8heade… Harald Tveit Alvestrand
- Re: Gen-ART LC review of draft-ietf-eai-utf8heade… Spencer Dawkins
- Re: Gen-ART LC review of draft-ietf-eai-utf8heade… John C Klensin
- Re: Gen-ART LC review of draft-ietf-eai-utf8heade… Frank Ellermann
- Re: Gen-ART LC review of draft-ietf-eai-utf8heade… Harald Tveit Alvestrand
- Re: Gen-ART LC review of draft-ietf-eai-utf8heade… Spencer Dawkins