Re: [Tools-discuss] [xml2rfc] [Rfc-markdown] New xml2rfc release: v3.16.0 (details of Unicode)

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Fri, 20 January 2023 07:30 UTC

Return-Path: <duerst@it.aoyama.ac.jp>
X-Original-To: tools-discuss@ietfa.amsl.com
Delivered-To: tools-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B401FC14CE4E; Thu, 19 Jan 2023 23:30:26 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.902
X-Spam-Level:
X-Spam-Status: No, score=-1.902 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=itaoyama.onmicrosoft.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 8n2jwM8PKNNS; Thu, 19 Jan 2023 23:30:21 -0800 (PST)
Received: from JPN01-TYC-obe.outbound.protection.outlook.com (mail-tycjpn01on2097.outbound.protection.outlook.com [40.107.114.97]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B0EE6C14CE4B; Thu, 19 Jan 2023 23:30:19 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=eFcWdXpZ/3MSUJtaIdLU2XcoQiZ1ene8mZTQxaPZ8aNOecR0m9N6jXAJU1QOswdwyBMpEEtHVSiea2BeDNkjj3XUsEz2sb3gOnFI6tvF5JnvxbC1meUo8uAYl2Xl+x1PTrQVzFCkFbAxXXamZJWHb7z+qzHbSPSXbQVAMKjf6aMlJMWVxY2JClD5Ta+O7rRbS47YF4p1+STEFyqzGmQ+Khi4qVJEAqKUwhc3pogLjb5ap+VNMBYvry0oBGRT3Ti5xxGSAJCdLgM61IBM8Ypz9EP0hPS+ECn25Iz0OHi84hDkt0ivrayM7g2/msWUOhYSnVUWsvHWo8qf7DkLqG9NDA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=TcZlndtw3PCoBrF7yorTI+v+ZnP7R6yTVfaT13yUgMQ=; b=lp7oxABqiDK/o0K/u3Bf4+1tmV5Jjl0Zakz/CsJ8xFC5MIOD8cKer5nXqRZKJBkDH60mFYdv8F3/0H1uhMeGB210fSTfubuRsvNsHUqX5cEeMX3p2FZHDWAe3xcW9fr7RXDMOEAH7elzmGzalLGXPzQIo/Lm341fqAuXRkrCtsEe2cYUoN+lMxQrLvQrtEvxYnN68oga9CcvKKFUVxpa0UJrfFY4vkSejIwS5/m3Enl2pf/ZU542/ApS+urWVukpYUJTReTv1j1VK8N2mdUG29i0CAUE3ACRluoYhBg4P+UPLGCAlKmZ4/GhEqfk0gafEC2WP/6bEFuHrXy5nucCQQ==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=it.aoyama.ac.jp; dmarc=pass action=none header.from=it.aoyama.ac.jp; dkim=pass header.d=it.aoyama.ac.jp; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=itaoyama.onmicrosoft.com; s=selector2-itaoyama-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=TcZlndtw3PCoBrF7yorTI+v+ZnP7R6yTVfaT13yUgMQ=; b=OT4Q6ND7NYmEqmcsBvK5W2NVJxjcVBFlZYqfsOcHUn7TSWdDgVbAK/fCtQvzkY86aXFTn7hCDexsH8rXE8flyAIaIriHmrQbwFMlFKE2HSH1emQvSa5ALT1gRgslchuL/ZOyr7Xb5b790Meldo0dvvEsckmiH6NUEXlxU20bkPc=
Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=it.aoyama.ac.jp;
Received: from TYAPR01MB5689.jpnprd01.prod.outlook.com (2603:1096:404:8053::7) by OS7PR01MB11683.jpnprd01.prod.outlook.com (2603:1096:604:240::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6002.27; Fri, 20 Jan 2023 07:30:13 +0000
Received: from TYAPR01MB5689.jpnprd01.prod.outlook.com ([fe80::c31f:e92e:6745:20a8]) by TYAPR01MB5689.jpnprd01.prod.outlook.com ([fe80::c31f:e92e:6745:20a8%6]) with mapi id 15.20.6002.026; Fri, 20 Jan 2023 07:30:13 +0000
Message-ID: <d40cf4bc-c96b-0f1f-27ff-10289af29c53@it.aoyama.ac.jp>
Date: Fri, 20 Jan 2023 16:30:11 +0900
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.6.1
Content-Language: en-US
To: John C Klensin <john-ietf@jck.com>, Jay Daley <exec-director@ietf.org>, Marc Petit-Huguenin <marc@petit-huguenin.org>
Cc: xml2rfc@ietf.org, tools-discuss <tools-discuss@ietf.org>
References: <CAD2=Z87EMetcpv66YY_b2+X1-yFy4cTpKMjPoJL=cH99c7P_Uw@mail.gmail.com> <9d719176-a4eb-7cce-e706-10325700531c@petit-huguenin.org> <F1A5624B-16D0-4463-AC5F-B0A03F3B94B6@ietf.org> <8f5a497e-4135-7c0c-46cb-c3fe4791e9f3@petit-huguenin.org> <3B53040D-9B5F-4410-9029-459729ADFDF8@ietf.org> <7d532a76-c750-8cb6-fc86-f6242da2bc77@petit-huguenin.org> <27182BDD-899E-4238-9DF8-7AE3E0F0C18F@akamai.com> <11a20f42-4fe6-8d8d-1d76-54049d0bdb68@petit-huguenin.org> <6179D431-EC33-49B9-A793-805E926C1050@akamai.com> <6d55cf67-1ce2-8094-dda4-ab877ac2a1d7@petit-huguenin.org> <1573D125-B3A9-4839-A1B2-DAD87FBB5DA6@ietf.org> <91C1EBAB771E004FD6EB9761@PSB>
From: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
In-Reply-To: <91C1EBAB771E004FD6EB9761@PSB>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 8bit
X-ClientProxiedBy: TY2PR04CA0016.apcprd04.prod.outlook.com (2603:1096:404:f6::28) To TYAPR01MB5689.jpnprd01.prod.outlook.com (2603:1096:404:8053::7)
MIME-Version: 1.0
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: TYAPR01MB5689:EE_|OS7PR01MB11683:EE_
X-MS-Office365-Filtering-Correlation-Id: b44fbe44-4d23-4f21-650e-08dafab82b36
X-MS-Exchange-SenderADCheck: 1
X-MS-Exchange-AntiSpam-Relay: 0
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: mI8NbxFNIlEEA2/uOdu1UmGf/oAjL9NtGuWnuq5raPcw6PDe8vX5NNlZtJpiqn0XOu4YnM9q/HtLXTzVjFsOX6xaa9SYJuL11oMlVDRsfs+JxJVpzYDnfscaUxpx5kzivmf/ODzhu1p9gg1kUKC39TCd5wisZfIUEmJz5QiQcTYj8v119LhebAi+xqsqtyUN5diqoL7U/HEoFDp5x/S2H4kY5gscnwAeHkISzUwfGgcq2cdsfz7v5k5dEFsXLOaGHKrkTsLw04tS+h+bpHwcEet+OE7X31x/6AGs564s/Kn2vIeiEBR+5/33x63BqKAuqxUnATUnArL5ZQTp5ks7eIGWA5qWeGZBlCSTe9/9CdLMIzIrgOgAZUmU32L7uzTJXhykQxdLKJIADR+Na9pAdhboFiEbbIhhQO9SJfQiFnnmaQ4ZqYe3k8c6Q0D6X24/sqdcenk66jT7hTqghkZLPMzekkbSeYK+NPPVp8QyY5/sejH69u8X6OMF3IW0mz0TLBBlbK19xusG4YUgDvzOco9PRAo+85wAlCBHoDiDUvEdi05OqwQtEkE7CzXkgDW9Aglja3ZO8tZjPcG9DE+Wm9H8kII0aCeYP1PAaatQ5qcfI2yoTh16hEOOBhNPCHepEwkOEnb1WQHKruZ8pZPjjOsS12U18yJbPt53FA92Zbkxmod2csZCjuQSAyjxgPGb8qZ/jFtlbX5qdvGC7u/29ztTNNOo8j8FquJ1BbnTWiy9jjmLLHzkbLQXi2d+n6OFujz9QltJEv4V+st3zyOFbw==
X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:TYAPR01MB5689.jpnprd01.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230022)(39850400004)(376002)(346002)(366004)(136003)(396003)(451199015)(31696002)(2906002)(36916002)(478600001)(83380400001)(52116002)(6486002)(8936002)(6506007)(186003)(26005)(2616005)(6512007)(53546011)(8676002)(66946007)(66476007)(66556008)(86362001)(5660300002)(4326008)(66574015)(41300700001)(41320700001)(110136005)(786003)(38100700002)(316002)(38350700002)(31686004)(45980500001)(43740500002); DIR:OUT; SFP:1102;
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: VSUADNIRTyNxtgyB7ruXtSuxkxXUyVVyzIJ3rQA/oAnBZZkQyvhPa5NqzhE7xYsSImKvT36Xpz7hV06XR9YnsnpMrt+VJyuqJRWjIovHjgxs0NZwI217BMG5JXewnjGd6YZRkQwG17d/oDkDYRr/zpnaHSumKHQmVJJw9HJ02FC8n2zJp9eEpwCcdsBaKM7nxgLNBlNw9hf8SZcGPtNKZ93HSUdHni0NmPuvYOyC0GPzeDtSTTlolriGgF5x8rW+Zk9rsPRijhUMVXr482Y+6dtojckHWEq6osKTlNjJ4jSEG9Kv+s90KRC7n4udpwhM95pxXx6TvbeaE/FpenTc79bUDIYKvkXypWe3nUn12d6ahNDJ63lEqeOSB8vrLzrmVKfv1gNcalVVXdXdwp95WMV6bPsKZ8Wdxb0mnZr66U+O/itdBVrrXbV2gMXrIIfTOcORC/UcmClg0wwBYqZxYWkFtU5i7Fwo9z763D/fy5wG0K2dCwK9bSh85k/NXRxhBqSrobJ7kqLMxuatoRCnN9143BpgcEyXv9gs9VCV6atU7YhJy6vYqreCyxEiMstCSbs5Si9KepMqnOVy+PBN2cjINmKg37FlfOJKv5ZUR0Scdh4qbKjiPbAg3Dl9g7KkZ7ZG9ICfMOJ9YgvI9tIcmtpnGWeu8MfLLpAINGgn+Mn6PnuS5ZSWTyWVwRW9YGjwhkHsulJohLHZuUsA8AQR3Tv60toOIA7FFjysgCVwFHsjB4dDx/1FmyIWJubQUZNGj8lw0RZoPvPwvOSO3ZdLadhRaGzjhcZxRhVuL+ONlyblw5C0P8vaAszeO4bwSTRsQDeooN0LwszjHuLv4TxWSV9HbSdrDN31bmBJUM5al7VGRZWZILiYiR68CR4KRum7iDpWwcA3wof8qafNclHfywYKDTQEMPZVxL6GfhorL7d+qbGxYSbCKLlgOnWmfb205DbWEMNUxDBKMPN2kWf0pdPwtVJajml3SLaZwG0LMyhJkIMh9GbrQXcy4DnyfO1POtapku4T2rW1+qVVKZ/PzXu5FnOPqlCSs0YOC2Er6zadMVk7aHKGWyl5MmXyrBLzFO6nmvyT76l7bTnDSgrrDy4x7M18mBAehXsAnmI7+RcYhKLlUqYhcDiRCuFo8T3C4zYMpPA/UR0ONGGmqg26l7jJKFXSsJ5pY9DQUzu6SP0Sozksb1+RD3NigE2f9CWajRzpLIXtXQ3j9yWA9rt3awPM8+mC+nHTK69PqnLTFrMZqDZMceueCbQDtF54rq1MG0PVH1P7OzKDXYRUDK41tbhku5pIzJ6bBzq/SL626hbmQ6aUsE2FWzurPpMAXa/skYbI9GgPay3AK7psQEJYC+1gKueAijtq3MQLvEmIx2pOdLUlONks40z1F5/tsDMLJHRFz2gC6s0U6Cj8BDhq9tqSXZxydX9snRt5didQ3f4mxlNFBpKB00gkKsUvauM1EoDw/LzzsJNMe8NB1fa/kMox36DfRJ+K3VlMkBS2+WtmNrQfYf51BVCXSSaM/f4yWV0UrNn6eOnzX8+7V40EztI8vys2r8wF51PsfZUKJm+wSYN5/X7i9pwyZ/4xiJOn
X-OriginatorOrg: it.aoyama.ac.jp
X-MS-Exchange-CrossTenant-Network-Message-Id: b44fbe44-4d23-4f21-650e-08dafab82b36
X-MS-Exchange-CrossTenant-AuthSource: TYAPR01MB5689.jpnprd01.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 20 Jan 2023 07:30:13.3820 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: e02030e7-4d45-463e-a968-0290e738c18e
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: f8wODToes1F4xcG34LEbEis+Re3C8IMIr69mfuz9Cab1zoaMLY/eY4iMF0KZZ+u0nKd2b30PtCytUqpd+BtFrw==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: OS7PR01MB11683
Archived-At: <https://mailarchive.ietf.org/arch/msg/tools-discuss/l1v_mcUGrWci7alVhnr80EO52ts>
Subject: Re: [Tools-discuss] [xml2rfc] [Rfc-markdown] New xml2rfc release: v3.16.0 (details of Unicode)
X-BeenThere: tools-discuss@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: IETF Tools Discussion <tools-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tools-discuss/>
List-Post: <mailto:tools-discuss@ietf.org>
List-Help: <mailto:tools-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 20 Jan 2023 07:30:26 -0000

On 2023-01-20 12:43, John C Klensin wrote:

> I agree, but it is probably worth digging a bit further into
> this.  Assume valid UTF-8 [1][2], just to make things a bit more
> simple than it would be otherwise [3]. Because of issues with,
> e.g., ordering of combining characters as well as issues
> specific to particular scripts and use of some scripts by
> particular languages, "valid UTF-8" is not sufficient to imply
> strings that can be rendered in a particular or reasonable way.
> Some strings that are valid UTF-8 are, because of different
> sorts of peculiarities, going to render differently by different
> engines.

Can we please assume the following the following three points?

a) Like for all other aspects of draft/rfc-to-be creation, the
    authors/editors will use non-ASCII characters carefully and
    diligently to the best of their knowledge. They will be
    familiar with the characters and scripts in question e.g.
    because it's their name or address, or otherwise. They will
    also check these characters carefully in AUTH48, including
    the .pdf version.
b) Like all other aspects of draft/rfc-to-be review, the wide
    range of people (WG members, chairs, shepherds, ADs,...)
    who will review non-ASCII characters in a document will do
    so with as much care as for the other aspects of a document
    (and because of their number, they will do so using a wide
    range of tools such as plain text editors, browsers, MUAs,
    ..., on a wide range of OSes).
c) The RPC will check non-ASCII characters carefully the same
    way they check other aspects of an rfc-to-be, and will get
    back to editors/authors/WG/ADs/... if they see any need.

Given the above, I'm rather confident that in general, things will work 
out just fine. Still of course, very occasionally, something might go 
wrong. But that happens with other aspects of RFCs, too. For a clear 
example, check out RFC 7158/7159.


> Sometimes that makes a difference, sometimes it
> doesn't.  Combining characters that modify preceding ones
> appearing first in a string aside, the most commonly cited
> troublesome examples involve
> 
>   * mixtures of characters drawn from scripts
> 	characterized as right-to-left with characters that are
> 	not associated with inherent directionality (e.g., the
> 	digits associated with contemporary Latin-based scripts)
> 	and characters from scripts characterized as
> 	left-to-right and
> 	
>   * rendering issues with Emoji and other graphemes that
> 	can look very different in different environments.
> 
>   * Emoji combining sequences, where they are and are not
> 	valid, and how they are interpreted and rendered.
> 
> I don't see the type of markup under discussion as having much
> to do with the above and hence, again, agree with what you have
> said as far as it goes.  At the same time, asking the RPC to
> sort out that whole range of issues may require skill sets that
> the RPC does not have today and, even then, additional markup
> may be needed to give them adequate information.   In
> particular, unless you / the RPC want to hire staff or
> consultants whose skills extend across the full range of writing
> systems and possible uses of Unicode, caution may be in order.
> It might be relevant --if it is not already generally understood
> on these lists-- that those skill sets are actually fairly rare
> (possibly translating into "hard to obtain" and/or "expensive").

Finding a consultant whose' skills extends across the full range of 
writing systems and possible uses of Unicode may simply be impossible. 
I'd know quite a few people who come close, but only close.

But I know that in the community as a whole, we have many people who 
have expertise in specific scripts, languages, and encoding mechanisms. 
And I hope we and the RPC can leverage this expertise. I for one 
volunteer to help with any specific questions the RPC has with respect 
to Unicode.


> Especially if that is not the plan, it may be worth considering:
> 
> (1) Enhancing the Style Guide to indicate which scripts and
> languages can be used freely and which ones require, e.g.,
> advanced consultation with the RPC.  Maybe normalization
> suggestions belong there too, maybe not.  If that list starts
> with, e.g., European scripts derived from Greek or Latin plus
> maybe Chinese script and gradually expands, I don't see much, or
> any, harm.

Such a list probably would create more harm than good. The average 
draft/rfc-to-be author/editor doesn't use Unicode characters because 
they want to cheat the system, but because they need them for specific 
names/addresses/examples/... If it so happened that they needed Egyptian 
Hieroglyphs (very rare case, but let's just assume), it won't help them 
if the RPC says "we currently have Greek and Chinese, why don't you try 
one of these" or "at our current pace of adding scripts to our 'allowed' 
list, come back in 20 years".

Please note that RFC 9290, which initiated the current discussion, 
needed an RTL example, and chose Hebrew שלום because that avoided any 
potential ligaturing problems with Arabic. This shows both how authors 
are careful and how your idea of an "allowed" list would have failed.


> (2) Retaining the <u> element, not as a requirement for anytime
> Unicode is being used (or used in an unusual context) but to
> allow, not just the existing attributes but specification of
> direction and language.
> 
> The two obviously interact because one could then make a rule
> that <u> is optional for language-script pairs enumerated in the
> Style guide but must be specified for all other cases and any
> additional cases where the author believes it is needed
> (including as advice for the RPC).
> 
> And, btw, I do not quite understand what happens to information
> now specified in the "format" attribute if <u> is dropped
> generally.
> 
>      john
> 
> p.s. to John Levine and others who might have noticed: After
> looking at the <u> section (3.62) of
> draft-irse-draft-irse-xml2rfcv3-implemented-03, "Unicide" sounds
> like a term that it might be useful to define and use, perhaps
> for particular flavors of Unicode abuse such as, in other
> contexts, deliberately using non-obvious Unicode strings to
> create confusion or deception.  An alternate definition
> involving committing, or wanting to commit, acts of violence
> against Unicode designers would obviously be inappropriate.
> 
> _Notes_
> 
> [1] Note the difference between RFC 3629 and 2279 and, in turn
> between 2279 and 2044.   Also note that 3269, despite its
> Internet Standard status, should probably not be the last word
> on the subject.

[RFC 3269 is "Author Guidelines for Reliable Multicast Transport (RMT) 
Building Blocks and Protocol Instantiation documents". I assume you 
meant RFC 3629.]

> [2] FWIW, the current text-form output from xml2rfc violates a
> SHOULD in RFC 3629.  As we consider how and where Unicode code
> sequences may be used, that should either be fixed or documented
> and justified in some clear way and prominent place.

RFC 3269 contains four SHOULDs (one of them a SHOULD NOT). Could you 
indicate which one you think is violated?

All the SHOULDs revolve around the use of U+FEFF as a signature. I just 
carefully re-read them, but can't find any way in which xml2rfc 
currently would violate any of these SHOULDs.


> [3]  Noting that the native Unicode encoding of some important
> operating systems and tools is not UTF-8, so some caution is
> required even there.

The relevant operating systems actually apply the necessary caution. 
It's extremely rare these days that some action is needed by users.


Regards,   Martin.