Re: [EAI] draft-klensin-encoded-word-type-u-00

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Fri, 25 November 2011 05:52 UTC

Return-Path: <duerst@it.aoyama.ac.jp>
X-Original-To: ima@ietfa.amsl.com
Delivered-To: ima@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9286021F8B74 for <ima@ietfa.amsl.com>; Thu, 24 Nov 2011 21:52:49 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -99.728
X-Spam-Level:
X-Spam-Status: No, score=-99.728 tagged_above=-999 required=5 tests=[AWL=0.062, BAYES_00=-2.599, HELO_EQ_JP=1.244, HOST_EQ_JP=1.265, MIME_8BIT_HEADER=0.3, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FzCCxDNp-bGT for <ima@ietfa.amsl.com>; Thu, 24 Nov 2011 21:52:48 -0800 (PST)
Received: from scintmta02.scbb.aoyama.ac.jp (scintmta02.scbb.aoyama.ac.jp [133.2.253.34]) by ietfa.amsl.com (Postfix) with ESMTP id 4968221F8B27 for <ima@ietf.org>; Thu, 24 Nov 2011 21:52:47 -0800 (PST)
Received: from scmse02.scbb.aoyama.ac.jp ([133.2.253.231]) by scintmta02.scbb.aoyama.ac.jp (secret/secret) with SMTP id pAP5qfJR008977 for <ima@ietf.org>; Fri, 25 Nov 2011 14:52:45 +0900
Received: from (unknown [133.2.206.133]) by scmse02.scbb.aoyama.ac.jp with smtp id 5187_5fd1_affb854c_1729_11e1_9ba2_001d096c5782; Fri, 25 Nov 2011 14:52:41 +0900
Received: from [IPv6:::1] ([133.2.210.1]:44740) by itmail.it.aoyama.ac.jp with [XMail 1.22 ESMTP Server] id <S1571F6A> for <ima@ietf.org> from <duerst@it.aoyama.ac.jp>; Fri, 25 Nov 2011 14:52:44 +0900
Message-ID: <4ECF2D27.7060906@it.aoyama.ac.jp>
Date: Fri, 25 Nov 2011 14:52:39 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100722 Eudora/3.0.4
MIME-Version: 1.0
To: John C Klensin <klensin@jck.com>
References: <79084A029BB2424F3BBF8A65@PST.JCK.COM>
In-Reply-To: <79084A029BB2424F3BBF8A65@PST.JCK.COM>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
Cc: ima@ietf.org
Subject: Re: [EAI] draft-klensin-encoded-word-type-u-00
X-BeenThere: ima@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "EAI \(Email Address Internationalization\)" <ima.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ima>, <mailto:ima-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ima>
List-Post: <mailto:ima@ietf.org>
List-Help: <mailto:ima-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ima>, <mailto:ima-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 25 Nov 2011 05:52:49 -0000

Hello John,

On 2011/11/25 6:58, John C Klensin wrote:
> Hi.
>
> One of John Levine's comments about the mailinglist document and
> some issues with pop-imap-downgrade have convinced me that there
> is a problem with the use of encoded words in strings that users
> are expected to see and work with that we might actually know
> how to fix.
>
> Historically, encoded words have been used in contexts where we
> expected MUAs to turn them back into their original (native
> character) forms, with display of those encoded forms to users
> being a (hopefully infrequent) last resort.   That decoding is
> less likely in the pop-imap-downgrade and mailing list
> situations: in both, if the relevant clients could handle native
> UTF-8 addresses, the encoded word forms would not be necessary.

My understanding is that, in this day and age, lots of (if not all) 
clients can decode "Q" and "B". There may be some restrictions, e.g. in 
Japan there are still clients that can decode iso-2022-jp (and these 
days also UTF-8), but not many other encodings. Some of them may use a 
Japanese encoding (Shift_JIS or EUC-JP) internally, even if they accept 
(with subsetting) UTF-8 externally.

There are currently not yet any clients that can handle raw UTF-8, and 
of course there are no clients that can handle encoded-word-type-u. What 
we want is to get them to handle raw UTF-8. So I really don't understand 
where encoded-word-type-u would be helpful.

> The problem with encoded words in those contexts is that
> encoding form "Q" is pretty useless except for mostly-ASCII text
> and of somewhat dubious value then and encoding form "B" pretty
> much requires a computer to decode.

Yes indeed.

> %-encoded UTF-8 octets are
> even worse: one needs a computer or a subtle calculation to turn
> them into a Unicode code point reference which then must be
> looked up in a table and one actually has to understand how
> UTF-8 works to be able to tell whether %NN%MM%OO is one
> character, two characters, or three characters.

Well, I might be slightly biased, but I'd choose %-encoded UTF-8 over 
base64-encoded UTF-8 any single time.

For base-64, I have:
- Convert character to bit pattern (use a table)
- Collect bit patterns into bytes
- Decode bytes according to UTF-8
- Look up characters in Unicode code table

For %-encoding, I have:
- Convert from hex to bits (trivial)
- Decode bytes according to UTF-8
- Look up characters in Unicode code table

So I don't see what would be worse in %-encoded UTF-8 octets when 
compared with base64. Actually, for %-encoding, I can use 
http://rishida.net/tools/conversion/ (fourth input field) and get 
everything done. Of course, it would be possible to do the same for 
base64, too, if somebody has the time to put together a page.

> One possible solution is to mix the encoded word strategy will
> direct encoding of Unicode characters by code point.  I've just
> posted draft-klensin-encoded-word-type-u-00 as a strawman for
> doing just that.  It is not a WG document.  It is not even a
> serious proposal at this stage.  But it provides a relatively
> concrete proposal about "something else" that we might do to get
> at least slightly dug out of the present hole that some of you
> might consider worth thinking about.
>
> I have deliberately not added normalization considerations to
> that draft, but it could easily be done and I suspect it would
> be necessary if the proposal was to be as useful as it might be.

People don't want to look numbers up, whether it be on 
http://rishida.net/tools/conversion/ or on 
http://www.unicode.org/charts/ or wherever. The nerds on this list, me 
included, are the exception, not the rule. For 99.9% or so of users, 
things either work or they are broken. For the rest, things such as 
base64 or %-encoding are not an issue. If anything needs to be done, 
putting up a web site that automatically converts between (currently 
defined) encoded word syntax and actual text is way more useful than 
writing a draft for yet another convention. One big advantage of the 
former (and http://rishida.net/tools/conversion/) is that conversion is 
for a whole string, rather than character by character, as any human 
lookup would be.

So, let humans do what they are good at, and keep the code conversion 
for computers.

Regards,    Martin.