Re: [xml2rfc] [Tools-discuss] [Rfc-markdown] New xml2rfc release: v3.16.0

John C Klensin <> Fri, 20 January 2023 03:43 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id CF5CCC14EAA3; Thu, 19 Jan 2023 19:43:28 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001] autolearn=ham autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id BQbhZDE_-mtY; Thu, 19 Jan 2023 19:43:23 -0800 (PST)
Received: from ( []) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 00FDBC14E514; Thu, 19 Jan 2023 19:43:16 -0800 (PST)
Received: from [] (helo=PSB) by with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <>) id 1pIiJ7-000HPM-U5; Thu, 19 Jan 2023 22:43:13 -0500
Date: Thu, 19 Jan 2023 22:43:06 -0500
From: John C Klensin <>
To: Jay Daley <>, Marc Petit-Huguenin <>
cc:, tools-discuss <>
Message-ID: <91C1EBAB771E004FD6EB9761@PSB>
In-Reply-To: <>
References: <> <> <> <> <> <> <> <> <> <> <>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Scanned: No (on; SAEximRunCond expanded to false
Archived-At: <>
Subject: Re: [xml2rfc] [Tools-discuss] [Rfc-markdown] New xml2rfc release: v3.16.0
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: XML2RFC discussion list <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 20 Jan 2023 03:43:28 -0000

--On Thursday, January 19, 2023 21:00 +0000 Jay Daley
<> wrote:

>> On 19 Jan 2023, at 20:46, Marc Petit-Huguenin
>> <> wrote:
>> Signed PGP part
>> On 1/19/23 12:35, Salz, Rich wrote:
>>> Thank you for your reply.
>>>> Obviously when I say add a new element to the language, it
>>>> means create a RFC7991bis that does that modification. And
>>>> then modify the xml2rfc tool to match the new spec.
>>> Wasn't obvious, I appreciate the clarification. But if
>>> you're modifying the XML language, why not just remove the
>>> limitations on non-ASCII that have been found to be
>>> problematic?
>> Because what I find to be problematic is allowing Unicode
>> everywhere.  My proposal of following RFC 7997 to the letter
>> instead of sneakily circumvent it has, among others, the
>> advantage of not allowing emojis in a document generated by
>> xml2rfc.  I cannot think of a more important goal for the
>> future of humanity.
> I can only repeat what I've said before in the hope that
> doing so leads to some recognition - all this change does is
> move the enforcement of RFC 7997 from the tool to the RPC
> editors and stream managers.  That same set of people already
> enforce many other rules quite successfully.  This is not
> opening the floodgates.


I agree, but it is probably worth digging a bit further into
this.  Assume valid UTF-8 [1][2], just to make things a bit more
simple than it would be otherwise [3]. Because of issues with,
e.g., ordering of combining characters as well as issues
specific to particular scripts and use of some scripts by
particular languages, "valid UTF-8" is not sufficient to imply
strings that can be rendered in a particular or reasonable way.
Some strings that are valid UTF-8 are, because of different
sorts of peculiarities, going to render differently by different
engines.  Sometimes that makes a difference, sometimes it
doesn't.  Combining characters that modify preceding ones
appearing first in a string aside, the most commonly cited
troublesome examples involve 

 * mixtures of characters drawn from scripts
	characterized as right-to-left with characters that are
	not associated with inherent directionality (e.g., the
	digits associated with contemporary Latin-based scripts)
	and characters from scripts characterized as
	left-to-right and
 * rendering issues with Emoji and other graphemes that
	can look very different in different environments.

 * Emoji combining sequences, where they are and are not
	valid, and how they are interpreted and rendered.

I don't see the type of markup under discussion as having much
to do with the above and hence, again, agree with what you have
said as far as it goes.  At the same time, asking the RPC to
sort out that whole range of issues may require skill sets that
the RPC does not have today and, even then, additional markup
may be needed to give them adequate information.   In
particular, unless you / the RPC want to hire staff or
consultants whose skills extend across the full range of writing
systems and possible uses of Unicode, caution may be in order.
It might be relevant --if it is not already generally understood
on these lists-- that those skill sets are actually fairly rare
(possibly translating into "hard to obtain" and/or "expensive").
Especially if that is not the plan, it may be worth considering:

(1) Enhancing the Style Guide to indicate which scripts and
languages can be used freely and which ones require, e.g.,
advanced consultation with the RPC.  Maybe normalization
suggestions belong there too, maybe not.  If that list starts
with, e.g., European scripts derived from Greek or Latin plus
maybe Chinese script and gradually expands, I don't see much, or
any, harm.

(2) Retaining the <u> element, not as a requirement for anytime
Unicode is being used (or used in an unusual context) but to
allow, not just the existing attributes but specification of
direction and language.

The two obviously interact because one could then make a rule
that <u> is optional for language-script pairs enumerated in the
Style guide but must be specified for all other cases and any
additional cases where the author believes it is needed
(including as advice for the RPC).

And, btw, I do not quite understand what happens to information
now specified in the "format" attribute if <u> is dropped


p.s. to John Levine and others who might have noticed: After
looking at the <u> section (3.62) of
draft-irse-draft-irse-xml2rfcv3-implemented-03, "Unicide" sounds
like a term that it might be useful to define and use, perhaps
for particular flavors of Unicode abuse such as, in other
contexts, deliberately using non-obvious Unicode strings to
create confusion or deception.  An alternate definition
involving committing, or wanting to commit, acts of violence
against Unicode designers would obviously be inappropriate.


[1] Note the difference between RFC 3629 and 2279 and, in turn
between 2279 and 2044.   Also note that 3269, despite its
Internet Standard status, should probably not be the last word
on the subject. 

[2] FWIW, the current text-form output from xml2rfc violates a
SHOULD in RFC 3629.  As we consider how and where Unicode code
sequences may be used, that should either be fixed or documented
and justified in some clear way and prominent place. 

[3]  Noting that the native Unicode encoding of some important
operating systems and tools is not UTF-8, so some caution is
required even there.