Re: [rfc-i] Unicode in ABNF (in RFC) draft-seantek-unicode-in-abnf-01.txt

Sean Leonard <dev+ietf@seantek.com> Tue, 04 October 2016 18:06 UTC

Return-Path: <rfc-interest-bounces@rfc-editor.org>
X-Original-To: ietfarch-rfc-interest-archive@ietfa.amsl.com
Delivered-To: ietfarch-rfc-interest-archive@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id F18B412941E for <ietfarch-rfc-interest-archive@ietfa.amsl.com>; Tue, 4 Oct 2016 11:06:59 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -5.597
X-Spam-Level:
X-Spam-Status: No, score=-5.597 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.001, RCVD_IN_DNSWL_LOW=-0.7, RP_MATCHES_RCVD=-2.996, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id I4yhXrlPXa-r for <ietfarch-rfc-interest-archive@ietfa.amsl.com>; Tue, 4 Oct 2016 11:06:58 -0700 (PDT)
Received: from rfc-editor.org (rfc-editor.org [4.31.198.49]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 291F1129418 for <rfc-interest-archive-eekabaiReiB1@ietf.org>; Tue, 4 Oct 2016 11:06:58 -0700 (PDT)
Received: from rfcpa.amsl.com (localhost [IPv6:::1]) by rfc-editor.org (Postfix) with ESMTP id CA405B80C4B; Tue, 4 Oct 2016 11:06:57 -0700 (PDT)
X-Original-To: rfc-interest@rfc-editor.org
Delivered-To: rfc-interest@rfc-editor.org
Received: from localhost (localhost [127.0.0.1]) by rfc-editor.org (Postfix) with ESMTP id 237D9B80C4A for <rfc-interest@rfc-editor.org>; Tue, 4 Oct 2016 11:06:57 -0700 (PDT)
X-Virus-Scanned: amavisd-new at rfc-editor.org
Received: from rfc-editor.org ([127.0.0.1]) by localhost (rfcpa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id JxA7RzMNpoaz for <rfc-interest@rfc-editor.org>; Tue, 4 Oct 2016 11:06:56 -0700 (PDT)
Received: from mxout-08.mxes.net (mxout-08.mxes.net [216.86.168.183]) by rfc-editor.org (Postfix) with ESMTPS id 3FA61B80C4D for <rfc-interest@rfc-editor.org>; Tue, 4 Oct 2016 11:06:56 -0700 (PDT)
Received: from [192.168.123.7] (unknown [75.83.2.34]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id 1A50950A85; Tue, 4 Oct 2016 14:06:49 -0400 (EDT)
To: =?UTF-8?Q?Martin_J._D=c3=bcrst?= <duerst@it.aoyama.ac.jp>, "abnf-discuss@ietf.org" <abnf-discuss@ietf.org>
References: <147539145843.2906.13032756764513250005.idtracker@ietfa.amsl.com> <1c5eb0fa-c6bd-ef6a-320a-8eaf28559d9e@seantek.com> <f0560992-70aa-225e-7c48-d1df652851eb@it.aoyama.ac.jp>
From: Sean Leonard <dev+ietf@seantek.com>
Message-ID: <f3f544bb-08dd-8664-34cd-1d9ec6132212@seantek.com>
Date: Tue, 4 Oct 2016 11:08:33 -0700
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0
MIME-Version: 1.0
In-Reply-To: <f0560992-70aa-225e-7c48-d1df652851eb@it.aoyama.ac.jp>
Cc: Chris Newman <chris.newman@oracle.com>, "rfc-interest@rfc-editor.org" <rfc-interest@rfc-editor.org>
Subject: Re: [rfc-i] Unicode in ABNF (in RFC) draft-seantek-unicode-in-abnf-01.txt
X-BeenThere: rfc-interest@rfc-editor.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "A list for discussion of the RFC series and RFC Editor functions." <rfc-interest.rfc-editor.org>
List-Unsubscribe: <https://www.rfc-editor.org/mailman/options/rfc-interest>, <mailto:rfc-interest-request@rfc-editor.org?subject=unsubscribe>
List-Archive: <http://www.rfc-editor.org/pipermail/rfc-interest/>
List-Post: <mailto:rfc-interest@rfc-editor.org>
List-Help: <mailto:rfc-interest-request@rfc-editor.org?subject=help>
List-Subscribe: <https://www.rfc-editor.org/mailman/listinfo/rfc-interest>, <mailto:rfc-interest-request@rfc-editor.org?subject=subscribe>
Content-Transfer-Encoding: base64
Content-Type: text/plain; charset="utf-8"; Format="flowed"
Errors-To: rfc-interest-bounces@rfc-editor.org
Sender: "rfc-interest" <rfc-interest-bounces@rfc-editor.org>

On 10/3/2016 2:53 AM, Martin J. Dürst wrote:
> I don't see the need to use %su for Unicode strings. The code points 
> speak for themselves, just use %s. Leaving %i/%iu undefined for 
> Unicode is indeed advisable, although it could be based on default 
> case folding, but we know that this would be imperfect, in particular 
> for Turkish.

I like %su because it notifies the reader, and a parser, to expect UTF-8 
and "deal with it" in a way that %s alone doesn't. For example, accented 
e can be é (U+00E9) or é (U+0065 U+0301). When printed or in a medium 
that doesn't provide direct access to the encoded data (screenshot? 
mobile app? etc.), the quoted string is ambiguous. Saying %s"foo" means 
you know that foo is always in the ASCII range, and can't possibly be 
composed of anything else (including, for example, FULLWIDTH ASCII in 
U+FF00-U+FF5E). Are %s"foo" and %s"foo" the same? How about %s"·˙•․‥…‧"? 
%s"µ" and %s"μ"? And the bajillion different dashes? Then there is the 
issue that even if a code point is objectively, graphically distinct in 
this version of Unicode, some future version may assign a code point to 
a character that commonly looks exactly the same as an existing character.

Responding to your point, defining %su"" would mean %s"" is undefined 
for Unicode, which avoids the temptation of %i"" (or nothing "" aka the 
traditional approach). Perhaps if a need develops for a case-insensitive 
version in a protocol, %iu could take a parameter that indicates the 
language tailoring, such as %iu[tr]"çilek". (But, I suppose, one could 
make a converse argument that %i[tr]"çilek" would be a natural evolution.)

Those are a couple of arguments. I am happy to go with whatever (rough) 
consensus emerges, however.

Regards,

Sean


_______________________________________________
rfc-interest mailing list
rfc-interest@rfc-editor.org
https://www.rfc-editor.org/mailman/listinfo/rfc-interest