Re: [Rfc-markdown] [xml2rfc] [irsg] character sets, was UPDATE regarding <u>

Martin Thomson <mt@lowentropy.net> Sun, 05 March 2023 23:43 UTC

Return-Path: <mt@lowentropy.net>
X-Original-To: rfc-markdown@ietfa.amsl.com
Delivered-To: rfc-markdown@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1759CC14F74A for <rfc-markdown@ietfa.amsl.com>; Sun, 5 Mar 2023 15:43:05 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.097
X-Spam-Level:
X-Spam-Status: No, score=-7.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=lowentropy.net header.b="FtHS0hw2"; dkim=pass (2048-bit key) header.d=messagingengine.com header.b="ejHfnmIP"
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id T9QN7nPKaDOn for <rfc-markdown@ietfa.amsl.com>; Sun, 5 Mar 2023 15:43:00 -0800 (PST)
Received: from wout3-smtp.messagingengine.com (wout3-smtp.messagingengine.com [64.147.123.19]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A08D8C14F736 for <rfc-markdown@ietf.org>; Sun, 5 Mar 2023 15:43:00 -0800 (PST)
Received: from compute6.internal (compute6.nyi.internal [10.202.2.47]) by mailout.west.internal (Postfix) with ESMTP id ADF35320024A for <rfc-markdown@ietf.org>; Sun, 5 Mar 2023 18:42:56 -0500 (EST)
Received: from imap41 ([10.202.2.91]) by compute6.internal (MEProxy); Sun, 05 Mar 2023 18:42:56 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=lowentropy.net; h=cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:sender:subject:subject:to:to; s=fm2; t= 1678059776; x=1678146176; bh=aCgEMSrsSUYAUfFssE/3whJ/4yd2SUlKaML wznzP/uE=; b=FtHS0hw2ditPZmr3l5KhFpITNxDDoQMYuuqC9ZEiBDZCycl2bEO IJ7IVgMn+F2AvleLcnP7OrMDRuIkUAlIdhUV68K/UdNHokG+Os8UC5o5J8whGT4t Wm6s9cZOe/ZdkE4jkDki8rB+RZYtdgF/dX2DXUMzJxI2gq0nNjbpw66EXYprLsXU kVUWse1yqU0gPgbHpIkgDDSNU/X2RyQGgc+QbhgBX5XmkE7vrBVKv6NM7okPj/RT XxprVkXi17aaU2w07jh+2+LK+4WYlPcPtPKMEY0B67QW+tvyWMMUtj1vzNC6WWEu ykJKQ39OG17EKVdonwdWE+sDf1A8sJ+Afow==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t=1678059776; x= 1678146176; bh=aCgEMSrsSUYAUfFssE/3whJ/4yd2SUlKaMLwznzP/uE=; b=e jHfnmIPdxAEufGGRYHmHP5L5lnbTh4FNIfiKpdkaaRzQr3eeSgUIEbV1KSbYVaj1 zvhFMFRAe5VQqsmnQcCV7mZzpUU42laXK4sskbJODSVEQYWYqFJtHOQN6OGF0oMt 9L29aLa77y8CKUtDuxJil0TNcbibnUgx/TNBn0jE4Dy39Qh6ZrbaYJgw4kGBzVDn CPVP2zO1wVl7qOsnRlE/5jgF21vRB41+TyI7lz6WOa9vm6se5bN6xc94SO9viGbK esf4ACyfjE9hL4FtmUkHALHV2b70zdxc38bajcmYJDIk9dNxJ6+VbWhALGEIhZnV AP2C/tLoFvCHG8IPENdGQ==
X-ME-Sender: <xms:ACkFZJQYH8RAPLwMIXnilXgv7FtQdv8EB8zfdQ0X-07l6I8ljfz4yQ> <xme:ACkFZCyLuLmybPwvDemFN1yvSGcC90GMNwOcZGzhi8fXhuWWqCEr63hs_LBgDcr8Y 08c_jlbZwSHiLr5X9U>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvddthedgudefucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpefofgggkfgjfhffhffvufgtgfesth hqredtreerjeenucfhrhhomhepfdforghrthhinhcuvfhhohhmshhonhdfuceomhhtsehl ohifvghnthhrohhphidrnhgvtheqnecuggftrfgrthhtvghrnhepffetgfevvdfhtdffhe ejudefhffhveehudefhffgleelteeifeegfefggfelhefgnecuffhomhgrihhnpehivght fhdrohhrghenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhroh hmpehmtheslhhofigvnhhtrhhophihrdhnvght
X-ME-Proxy: <xmx:ACkFZO0hjUyIhUFJW45SLpSihbgxozqkPBRJ64DEj3H8bmeH9TbxcA> <xmx:ACkFZBDM8XX0TZ2pCMrjk4HgTKOb4xhpdtv93Z5pCfB8MVxaX2isPA> <xmx:ACkFZCgHTrCfhUflSwEYKH3T4TBfVqO1KsRBTlfJNi5C6h0o8smyiA> <xmx:ACkFZNuB-p1-qTUJ2hJx_7td3gmgIRwACNQhcJTLIL9ZpIC5zxWT-w>
Feedback-ID: ic129442d:Fastmail
Received: by mailuser.nyi.internal (Postfix, from userid 501) id 1B6DE234007B; Sun, 5 Mar 2023 18:42:56 -0500 (EST)
X-Mailer: MessagingEngine.com Webmail Interface
User-Agent: Cyrus-JMAP/3.9.0-alpha0-183-gbf7d00f500-fm-20230220.001-gbf7d00f5
Mime-Version: 1.0
Message-Id: <a39b8c32-6f4f-4caf-8400-1846ea25faa2@betaapp.fastmail.com>
In-Reply-To: <5081F069-705D-4707-85EB-DBA11D594D19@tzi.org>
References: <20230304190316.05346A51F3D2@ary.qy> <5081F069-705D-4707-85EB-DBA11D594D19@tzi.org>
Date: Mon, 06 Mar 2023 10:42:35 +1100
From: Martin Thomson <mt@lowentropy.net>
To: rfc-markdown@ietf.org
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/rfc-markdown/zuNCTI7_bnowSyEP5UWtRZMVTHI>
Subject: Re: [Rfc-markdown] [xml2rfc] [irsg] character sets, was UPDATE regarding <u>
X-BeenThere: rfc-markdown@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: "rfc-markdown is a discussion list for people writing I-Ds and RFCs in Markdown and the authors of the tools used for that." <rfc-markdown.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rfc-markdown>, <mailto:rfc-markdown-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rfc-markdown/>
List-Post: <mailto:rfc-markdown@ietf.org>
List-Help: <mailto:rfc-markdown-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rfc-markdown>, <mailto:rfc-markdown-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 05 Mar 2023 23:43:05 -0000

Carsten, have you considered how the report might differ for <artwork>?  I see a few documents with a lot of symbols in artwork, which might make the report noisier than is ideal.

That would maybe imply working from XML, which has some interesting implications for those characters that xml2rfc treats specially, like non-breaking spaces and zero-width spaces of different types.

On Mon, Mar 6, 2023, at 00:19, Carsten Bormann wrote:
> On 2023-03-04, at 20:03, John Levine <johnl@taugh.com> wrote:
>> 
>> One issue is our policy about where any non-ASCII goes, but a separate
>> issue that Carsten has run into is exotic characters beyond the 3000
>> or so that are in the fonts we normally use.
>
> Kramdown-rfc 1.6.26 now has an `echars` utility that is more talkative 
> about unicode blocks and unicode scripts.  Examples below.  I hope this 
> is useful in our quest to rescue these Unicode arcana from being held 
> captive by Unicode super-geeks.
>
> Grüße, Carsten
>
>
> This is the RFC-to-be where RPC happened to pick a character from the 
> Dingbats block that started this discussion:
>
> $ echars rfc/authors/rfc9340.txt
> *** Latin-1 Supplement (Latin)
> ß: U+00DF    1 LATIN SMALL LETTER SHARP S
> á: U+00E1    3 LATIN SMALL LETTER A WITH ACUTE
> ä: U+00E4    1 LATIN SMALL LETTER A WITH DIAERESIS
> é: U+00E9    5 LATIN SMALL LETTER E WITH ACUTE
> ó: U+00F3    1 LATIN SMALL LETTER O WITH ACUTE
> ø: U+00F8    1 LATIN SMALL LETTER O WITH STROKE
> ü: U+00FC    7 LATIN SMALL LETTER U WITH DIAERESIS
> *** Latin Extended-A (Latin)
> ć: U+0107    2 LATIN SMALL LETTER C WITH ACUTE
> č: U+010D    2 LATIN SMALL LETTER C WITH CARON
> ę: U+0119    1 LATIN SMALL LETTER E WITH OGONEK
> ł: U+0142    2 LATIN SMALL LETTER L WITH STROKE
> š: U+0161    1 LATIN SMALL LETTER S WITH CARON
> *** General Punctuation (Common)
> –: U+2013    1 EN DASH
> *** Dingbats (Common)
> ➔: U+2794    2 HEAVY WIDE-HEADED RIGHTWARDS ARROW
> *** Miscellaneous Mathematical Symbols-A (Common)
> ⟩: U+27E9   61 MATHEMATICAL RIGHT ANGLE BRACKET
> *** Arabic Presentation Forms-B (Common)
> : U+FEFF    1 ZERO WIDTH NO-BREAK SPACE
>
>
> For comparison, an RFC out of the pre-v3 times:
>
> $ echars rfc/rfc8265.txt
> *** Basic Latin (Common)
> "\f": U+000C   25 <control-000C>
> *** Latin-1 Supplement
> ¹: U+00B9    1 SUPERSCRIPT ONE (Common)
> ß: U+00DF    3 LATIN SMALL LETTER SHARP S (Latin)
> å: U+00E5    1 LATIN SMALL LETTER A WITH RING ABOVE (Latin)
> *** Latin Extended-A (Latin)
> ſ: U+017F    2 LATIN SMALL LETTER LONG S
> *** Greek and Coptic (Greek)
> Σ: U+03A3    2 GREEK CAPITAL LETTER SIGMA
> π: U+03C0    2 GREEK SMALL LETTER PI
> ς: U+03C2    2 GREEK SMALL LETTER FINAL SIGMA
> σ: U+03C3    2 GREEK SMALL LETTER SIGMA
> *** Ogham (Ogham)
>  : U+1680    1 OGHAM SPACE MARK
> *** Number Forms (Latin)
> Ⅳ: U+2163    6 ROMAN NUMERAL FOUR
> *** Mathematical Operators (Common)
> ∞: U+221E    2 INFINITY
> *** Miscellaneous Symbols (Common)
> ♦: U+2666    1 BLACK DIAMOND SUIT
> *** Alphabetic Presentation Forms (Latin)
> fi: U+FB01    2 LATIN SMALL LIGATURE FI
> *** Arabic Presentation Forms-B (Common)
> : U+FEFF    1 ZERO WIDTH NO-BREAK SPACE
>
> You can see the form feeds we used before RFC8650, as well as the 
> dreadful BOM already (which is in the Arabic Presentation Forms-B 
> block, in case you didn’t know that).
>
> _______________________________________________
> Rfc-markdown mailing list
> Rfc-markdown@ietf.org
> https://www.ietf.org/mailman/listinfo/rfc-markdown