Re: [Idna-update] Genart telechat review of draft-faltstrom-unicode11-08

Martin J. Dürst <duerst@it.aoyama.ac.jp> Tue, 19 March 2019 10:19 UTC

Return-Path: <duerst@it.aoyama.ac.jp>
X-Original-To: idna-update@ietfa.amsl.com
Delivered-To: idna-update@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8DDBE1310EE for <idna-update@ietfa.amsl.com>; Tue, 19 Mar 2019 03:19:40 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.922
X-Spam-Level:
X-Spam-Status: No, score=-0.922 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FROM_EXCESS_BASE64=0.979, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=itaoyama.onmicrosoft.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id AiYen1EG3EbI for <idna-update@ietfa.amsl.com>; Tue, 19 Mar 2019 03:19:38 -0700 (PDT)
Received: from JPN01-TY1-obe.outbound.protection.outlook.com (mail-eopbgr1400113.outbound.protection.outlook.com [40.107.140.113]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EB9E912716C for <idna-update@ietf.org>; Tue, 19 Mar 2019 03:19:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=itaoyama.onmicrosoft.com; s=selector1-it-aoyama-ac-jp; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=uyyp0I98IKGRDSg9MUgfyyZCf4z4ZXCENnr3LHd1IIU=; b=S3Z8XjKS5fTjEFT7cRDFhFMPbq61KrHE2ChYNzb8vAdqvEU2BAoEv+jU/tZ4srdLhleXEIryAaWaHKXISARKfXcz0L/LAsZst7AXDOAOGvWJTVZONSWGI5qjdhZ469FyeTBuanqVMwibtMwH2FqlUUB9inU61cA1G/d6W5umYbc=
Received: from TYAPR01MB5149.jpnprd01.prod.outlook.com (20.179.187.18) by TYAPR01MB2399.jpnprd01.prod.outlook.com (20.177.103.141) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1709.14; Tue, 19 Mar 2019 10:19:34 +0000
Received: from TYAPR01MB5149.jpnprd01.prod.outlook.com ([fe80::98b6:d90e:9ae7:302]) by TYAPR01MB5149.jpnprd01.prod.outlook.com ([fe80::98b6:d90e:9ae7:302%3]) with mapi id 15.20.1709.015; Tue, 19 Mar 2019 10:19:34 +0000
From: =?utf-8?B?TWFydGluIEouIETDvHJzdA==?= <duerst@it.aoyama.ac.jp>
To: Asmus Freytag <asmusf@ix.netcom.com>, "idna-update@ietf.org" <idna-update@ietf.org>
Thread-Topic: [Idna-update] Genart telechat review of draft-faltstrom-unicode11-08
Thread-Index: AQHU3VzVurPJZqZEcUKPczDgWng2vqYRo1uAgADhB4CAABM9AIAAKCYA
Date: Tue, 19 Mar 2019 10:19:34 +0000
Message-ID: <3a7bd491-ab06-b8c7-9a8b-862c7a3cd122@it.aoyama.ac.jp>
References: <155289429627.26188.2047331005281292889@ietfa.amsl.com> <458987D953A5B3227D3A791F@PSB> <EA2B2A09-152C-4AF3-B0C8-0D352CCA6647@netnod.se> <6b149a8d-9102-ea1a-5048-b83842fc66c0@ix.netcom.com>
In-Reply-To: <6b149a8d-9102-ea1a-5048-b83842fc66c0@ix.netcom.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-clientproxiedby: TY2PR01CA0045.jpnprd01.prod.outlook.com (2603:1096:404:ce::33) To TYAPR01MB5149.jpnprd01.prod.outlook.com (2603:1096:404:12e::18)
authentication-results: spf=none (sender IP is ) smtp.mailfrom=duerst@it.aoyama.ac.jp;
x-ms-exchange-messagesentrepresentingtype: 1
x-originating-ip: [133.2.210.64]
x-ms-publictraffictype: Email
x-ms-office365-filtering-correlation-id: 601a205d-b953-4992-790c-08d6ac546239
x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(7021145)(8989299)(4534185)(7022145)(4603075)(4627221)(201702281549075)(8990200)(7048125)(7024125)(7025125)(7027125)(7023125)(5600127)(711020)(4605104)(2017052603328)(7153060)(7193020); SRVR:TYAPR01MB2399;
x-ms-traffictypediagnostic: TYAPR01MB2399:
x-ms-exchange-purlcount: 5
x-microsoft-antispam-prvs: <TYAPR01MB2399D40C7A33E794518977A5CA400@TYAPR01MB2399.jpnprd01.prod.outlook.com>
x-forefront-prvs: 0981815F2F
x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(136003)(366004)(39840400004)(346002)(376002)(396003)(189003)(199004)(71200400001)(99286004)(8676002)(52116002)(6116002)(6506007)(386003)(53546011)(102836004)(31696002)(186003)(6246003)(26005)(76176011)(81156014)(68736007)(66574012)(30864003)(2501003)(14444005)(81166006)(14454004)(71190400001)(74482002)(3846002)(5660300002)(508600001)(256004)(86362001)(966005)(97736004)(106356001)(486006)(6486002)(53936002)(7736002)(2616005)(66066001)(31686004)(85202003)(105586002)(8936002)(316002)(15650500001)(85182001)(229853002)(786003)(305945005)(110136005)(93886005)(25786009)(446003)(6512007)(6306002)(11346002)(2906002)(6436002)(476003); DIR:OUT; SFP:1102; SCL:1; SRVR:TYAPR01MB2399; H:TYAPR01MB5149.jpnprd01.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:0;
received-spf: None (protection.outlook.com: it.aoyama.ac.jp does not designate permitted sender hosts)
x-ms-exchange-senderadcheck: 1
x-microsoft-antispam-message-info: PBgsKqtEUmepDknBcV5u5paCELZt+Dd0PlkJC735i4nr4ArFSkKIFU5+ly8DKLYxpqfn0QobZTOQSLFOAkXdp8vpz7EIFVCkNsi96qd/3J+NAQHvaRUL6fN1vPvYfr5Uy2cBhZBTidgVP2BINlF5RlIURjb9+0irVXWT297+363NniHnEtkfI0/tM5ajfbrYE5Jza6YAwh8TGVasVKacBboGT2k84XqgwINdVevGZqVwHc+m/zlFcUUZa4xUr2Rfl3xZ6f8Q6bXAFos/WhFmD2o+AqVmBTiQEo6D181KQ7+Q7+9JUr7wcZCF+Sxm53MXJ5yXHSdgwwDV5EMkFFYLkea4M5mMrLchr1KCFP86KtPxeh83wUuLxULDAsjRxxXc1D+relfp+25N9LKjMaN1gS6kzxESDkILce8D2UBgrK4=
Content-Type: text/plain; charset="utf-8"
Content-ID: <64C56B9C1EA29B43BE2B1BD91B3590A2@jpnprd01.prod.outlook.com>
Content-Transfer-Encoding: base64
MIME-Version: 1.0
X-OriginatorOrg: it.aoyama.ac.jp
X-MS-Exchange-CrossTenant-Network-Message-Id: 601a205d-b953-4992-790c-08d6ac546239
X-MS-Exchange-CrossTenant-originalarrivaltime: 19 Mar 2019 10:19:34.7830 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: e02030e7-4d45-463e-a968-0290e738c18e
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYAPR01MB2399
Archived-At: <https://mailarchive.ietf.org/arch/msg/idna-update/E5Pq7ahk0Z1s92FTDpqoBBIGtts>
Subject: Re: [Idna-update] Genart telechat review of draft-faltstrom-unicode11-08
X-BeenThere: idna-update@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Internationalized Domain Names in Applications \(IDNA\) implementation and update discussions" <idna-update.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idna-update>, <mailto:idna-update-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idna-update/>
List-Post: <mailto:idna-update@ietf.org>
List-Help: <mailto:idna-update-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idna-update>, <mailto:idna-update-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 19 Mar 2019 10:19:41 -0000

Hello Asmus, Patrik,

On 2019/03/19 16:55, Asmus Freytag wrote:
> On 3/18/2019 11:47 PM, Patrik Fältström wrote:
>> 6. The fact it was an issue was of course also brought up to the 
>> Unicode Consortium as I am (handy enough) also the liaison from IETF 
>> to Unicode Consortium. No response. They do push, as we know, TR#46 
>> which is a different kind of animal than IDNA2008. Oh well. Things 
>> where fine, and we moved on. And IETF could focus on what IETF really 
>> should focus on, to ensure the algorithm itself was still "ok". I.e. 
>> without looking at individual code points, but more in general terms, 
>> was IDNA2008 still good enough? Lots of good feedback from people like 
>> Asmus that was fighting like mad (and still is) trying to a. convince 
>> registries they MUST come up with a subset of IDNA2008 permissible 
>> code points when they decide what can be used in their zones and b. 
>> come up with processes and rules for how that work should be done. 
>> Specifically work did go on for a long time regarding what code points 
>> can be used in the root zone, i.e. for TLDs. This is managed in the 
>> IAB document, and the Klensin draft about and more.
>>
>> 6. Now in version 11.0.0, the non-backward-compatible change happened 
>> again. Sigh.
> 
> I think its worth injecting some pragmatism here.

I mostly agree with Asmus here. I think it's also worth looking at the 
actual cases. Looking at 
https://tools.ietf.org/html/draft-faltstrom-unicode11-08#section-4.3, in 
the 11.0.0 timeframe, we have a single previous character that changed 
from DISALLOWED to PVALID. See also 
https://tools.ietf.org/html/draft-faltstrom-unicode11-08#section-5, 
which essentially says that there's not much of a problem either way.

In the 6.0.0 timeframe, there were both more actual changes (3), and 
there was one change where we moved from PVALID to DISALLOWED, which I 
personally think is the tougher one (see 
https://tools.ietf.org/html/rfc6452#section-1.3). And looking at the 
code chart at https://www.unicode.org/charts/PDF/U1980.pdf, one may even 
reasonably argue that keeping U+19DA NEW TAI LUE THAM DIGIT ONE would 
have been the better choice, because when one compares U+19B1 NEW TAI 
LUE VOWEL SIGN AA and U+19D1 NEW TAI LUE DIGIT ONE and finds them having 
(at least on first approximation) the same glyph shape, one can easily 
guess that U+19DA NEW TAI LUE THAM _DIGIT_ ONE is used (this is only a 
quick conjection which would have to be confirmed) exactly in situations 
where the distinction between letters and digits is important, of which 
we know domain names are a good example.


> We know that only about 35K code points out of 130K total are actually 
> used in someones orthography, where that someone is a speaker of a 
> language used for daily activities and with good intergenerational 
> transmission (= not likely to die out soon).
> 
> All the other 100K code points are either symbols or punctuation or, 
> more often, code points that are recognized only by scholars.
> 
> Very little activity in Unicode affects that first set of 35K code 
> points: future changes may be possible due to orthography reforms, as 
> well as some population acquiring widespread literacy for the first 
> time, but other than that, there's not much going on that needs new code 
> points. There's the occasional case where the properties for some of the 
> lesser used code points in this set may turn out to require corrections 
> as result of software implementations for these becoming more mainstream 
> and thus discovering issues. Generally, these affect languages where 
> digital deployment itself is still nascent; and only a small subset 
> potentially affects IDNA category.
> 
> The upshot is that the practical effect of such activity in the margins 
> is effectively invisible.
> 
> The overwhelming part of the additions and other changes are within the 
> remaining set of 100K specialized characters. These code points are 
> either misused for malicious registrations, relying on users' 
> unfamiliarity with these rare code points or of use only to specialized 
> audiences.
> 
> The practical effect of any incompatibilities in that set is, again, 
> effectively minuscule; even given the realistic expectation that the 
> property data for these won't get rigorously tested in practical 
> applications and therefore could be found incorrect at some future point.
> 
> 
>>   I was at that time on my way to just give up, and recommend IETF 
>> that what we do today, to reference something that we thought was 
>> normative, stable etc (i.e. the Unicode Standard) did not work. That 
>> IETF instead would pick up and use some relationship with ISO that 
>> should be formalized, and instead of referencing the Unicode 
>> Consortium IETF should reference ISO, and then ISO should have in its 
>> rules that backward compatibility would be a requirement.
> 
> 
> This is a complete nonstarter.
> 
> Nobody wins if we encourage the existence of two different encoding 
> standards - and the current process of cooperation works precisely 
> because it doesn't work the way it's sketched below.

I completely agree. First, ISO mostly delegates properties to Unicode. 
Second, Unicode is where the feedback from implementers and users is 
mostly coming in, and it's that feedback that leads to corrections. 
Third, while in ISO there's a chance that they would leave some 
properties untouched just because they don't understand them or are not 
interested, there's also the chance that they would tweak something if 
for whatever reason one national body or the other (or to be precise, 
the representative of that national body) insists on a change.

>> If then Unicode did ship something to ISO for "approval", then ISO 
>> would say "no, try again" or else ISO would make a process violation 
>> that members could appeal within ISO. But after sleeping on this for a 
>> bit, I felt the issue with 11.0.0 also happened with 7.0.0 and maybe 
>> one should first have a discussion within IETF again before the bolts 
>> are blown regarding relationship with Unicode Consortium. That just 
>> started when the I18N directorate was closed, and I ended up in a 
>> void. After long time, pushing and what not (lets discuss in the bar) 
>> this group finally was created were this VERY SERIOUS ISSUE could be 
>> discussed, and as John explained, there are many things to discuss, 
>> where we can boil it down to two:
>>
>> A. Should the non-backward compatible change in 11.0.0 result in a 
>> change in rule [G] in RFC 5892, or should we accept a non-backward 
>> compatible change? To trigger the discussing, I proposed the same 
>> result as for 7.0.0, to NOT update [G], and simply accept it. This 
>> also because the next issue is the important one.
>>
>> B. What should we do with IDNA2008? Obviously Unicode Standard is not 
>> stable enough. Or is it? What should we do with review? Should we have 
>> to start do what Martin just did with 12.0.0?
> 
> what Martin did with Unicode 12.0.0 is what I call a sanity check. I 
> think it's a reasonable activity (if resources exist), but I would not 
> expect to find any issues that are of enough practical importance to 
> actually matter. (See above).
> 
> 
>>   Do we IETF have the expertise? Can we rely on individuals like 
>> Martin, Asmus, myself and John and very few more to be around? Can we 
>> rely on Unicode Consortium? Is this the time to instead move to ISO? 
>> Like keeping IDNA2008 algorithm BUT tie it to ISO approved charset, 
>> and then ask ISO to protect the backward compatibility?

If we needed that, we could simply also use the rule [G] override. It 
might be a lot cheaper both for us and for others.

While at it, I'd like to discuss some assumptions about implementations. 
https://tools.ietf.org/html/draft-faltstrom-unicode11-08#section-5 says:
                 As including an exception would require implementation
    changes in deployed implementations of IDNA20008, the editor proposes
    that such a BackwardCompatible rule NOT to be added to IDNA2008.

This seems to assume the following implementation strategy: The code of 
an IDNA library is version-agnostic, relies on some lower layer for 
providing property information, and this lower layer gets updated 
automa[c/t]ically when new Unicode information is published. That would 
e.g. be the case with an implementation in pure Ruby that used \p{} 
regular expression properties (the way I did it in my script). Such an 
implementation would not need any updates except the updates to Ruby 
itself when the Unicode version changes and there are no additions to 
the rule [G] set.

While it is my understanding that e.g. fonts for emoji in new Unicode 
versions get deployed to many OSes pretty quickly, IDNA implementations 
(I just looked at https://gitlab.com/libidn/libidn2/) seem to work 
differently. They assume that they are on their own with respect to 
properties, and compile in all the property data they need (which may be 
just the end values of PVALID/DISALLOWED/...). That means that for them 
to update to a new Unicode version means to adjust the source code and 
to recompile. With such a strategy, adding a codepoint or two into the 
rule [G] set would be rather easy and trivial.

It might be good to inspect some of the implementations we know about to 
see which strategy they follow.

>> Or are we 
>> done, so that we simply freeze IDN to a specific Unicode version, and 
>> simply ignore all added code points after that version? I.e. at 5.2.0 
>> we knew some interesting code points where still to be added, but 
>> after 11.0.0?
>>
>> What I now know is that IESG have told me that we do have agreement in 
>> IETF on draft-faltstrom-unicode11-08.txt which implies moving to IDNA 
>> 11.0.0 without adding things to [G] is the path forward for now.
>>
>> What to do with 12.0.0 and future versions is still up in the air.
> 
> 
> I believe the correct answer is to have a very serious discussion on how 
> to help registries to make sure their support for the widely needed 35K 
> is more robust and not to get sidetracked by the vast wasteland of 
> obsolescent, obsolete, ancient or otherwise specialized use code points 
> (whether for liturgical, phonetic, or poetic usage).
> 
> Continue to keep a watchful eye over the application of the algorithm 
> for future versions, but don't expect that any cases will rise to the 
> threshold where any exceptions need to be baked into the protocol - with 
> all the attendant issues for updating software libraries.

I fully agree with Asmus here.

Regards,   Martin.


> But focus on solving issues that have at least practical impact.
> 
> A./
>