Re: [xml2rfc] [irsg] character sets, was UPDATE regarding <u>

Carsten Bormann <cabo@tzi.org> Sun, 05 March 2023 13:19 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: xml2rfc@ietfa.amsl.com
Delivered-To: xml2rfc@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 47ABAC14CF1A; Sun, 5 Mar 2023 05:19:29 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.199
X-Spam-Level:
X-Spam-Status: No, score=-4.199 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 933QXTG2C4Ew; Sun, 5 Mar 2023 05:19:24 -0800 (PST)
Received: from smtp.zfn.uni-bremen.de (smtp.zfn.uni-bremen.de [134.102.50.15]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C41EBC14CEFC; Sun, 5 Mar 2023 05:19:23 -0800 (PST)
Received: from [192.168.217.124] (p548dc9a4.dip0.t-ipconnect.de [84.141.201.164]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4PV2Np0qLdzDCbS; Sun, 5 Mar 2023 14:19:22 +0100 (CET)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <20230304190316.05346A51F3D2@ary.qy>
Date: Sun, 05 Mar 2023 14:19:21 +0100
Cc: xml2rfc@ietf.org, rfc-markdown@ietf.org
X-Mao-Original-Outgoing-Id: 699715161.558488-53542cc15c1d3001f49b1311463adaca
Content-Transfer-Encoding: quoted-printable
Message-Id: <5081F069-705D-4707-85EB-DBA11D594D19@tzi.org>
References: <20230304190316.05346A51F3D2@ary.qy>
To: "John R. Levine" <johnl@taugh.com>
X-Mailer: Apple Mail (2.3608.120.23.2.7)
Archived-At: <https://mailarchive.ietf.org/arch/msg/xml2rfc/CM7oMM2MbL4_YnZWZv2cdHn1mmg>
Subject: Re: [xml2rfc] [irsg] character sets, was UPDATE regarding <u>
X-BeenThere: xml2rfc@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: XML2RFC discussion list <xml2rfc.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/xml2rfc/>
List-Post: <mailto:xml2rfc@ietf.org>
List-Help: <mailto:xml2rfc-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 05 Mar 2023 13:19:29 -0000

On 2023-03-04, at 20:03, John Levine <johnl@taugh.com> wrote:
> 
> One issue is our policy about where any non-ASCII goes, but a separate
> issue that Carsten has run into is exotic characters beyond the 3000
> or so that are in the fonts we normally use.

Kramdown-rfc 1.6.26 now has an `echars` utility that is more talkative about unicode blocks and unicode scripts.  Examples below.  I hope this is useful in our quest to rescue these Unicode arcana from being held captive by Unicode super-geeks.

Grüße, Carsten


This is the RFC-to-be where RPC happened to pick a character from the Dingbats block that started this discussion:

$ echars rfc/authors/rfc9340.txt
*** Latin-1 Supplement (Latin)
ß: U+00DF    1 LATIN SMALL LETTER SHARP S
á: U+00E1    3 LATIN SMALL LETTER A WITH ACUTE
ä: U+00E4    1 LATIN SMALL LETTER A WITH DIAERESIS
é: U+00E9    5 LATIN SMALL LETTER E WITH ACUTE
ó: U+00F3    1 LATIN SMALL LETTER O WITH ACUTE
ø: U+00F8    1 LATIN SMALL LETTER O WITH STROKE
ü: U+00FC    7 LATIN SMALL LETTER U WITH DIAERESIS
*** Latin Extended-A (Latin)
ć: U+0107    2 LATIN SMALL LETTER C WITH ACUTE
č: U+010D    2 LATIN SMALL LETTER C WITH CARON
ę: U+0119    1 LATIN SMALL LETTER E WITH OGONEK
ł: U+0142    2 LATIN SMALL LETTER L WITH STROKE
š: U+0161    1 LATIN SMALL LETTER S WITH CARON
*** General Punctuation (Common)
–: U+2013    1 EN DASH
*** Dingbats (Common)
➔: U+2794    2 HEAVY WIDE-HEADED RIGHTWARDS ARROW
*** Miscellaneous Mathematical Symbols-A (Common)
⟩: U+27E9   61 MATHEMATICAL RIGHT ANGLE BRACKET
*** Arabic Presentation Forms-B (Common)
: U+FEFF    1 ZERO WIDTH NO-BREAK SPACE


For comparison, an RFC out of the pre-v3 times:

$ echars rfc/rfc8265.txt
*** Basic Latin (Common)
"\f": U+000C   25 <control-000C>
*** Latin-1 Supplement
¹: U+00B9    1 SUPERSCRIPT ONE (Common)
ß: U+00DF    3 LATIN SMALL LETTER SHARP S (Latin)
å: U+00E5    1 LATIN SMALL LETTER A WITH RING ABOVE (Latin)
*** Latin Extended-A (Latin)
ſ: U+017F    2 LATIN SMALL LETTER LONG S
*** Greek and Coptic (Greek)
Σ: U+03A3    2 GREEK CAPITAL LETTER SIGMA
π: U+03C0    2 GREEK SMALL LETTER PI
ς: U+03C2    2 GREEK SMALL LETTER FINAL SIGMA
σ: U+03C3    2 GREEK SMALL LETTER SIGMA
*** Ogham (Ogham)
 : U+1680    1 OGHAM SPACE MARK
*** Number Forms (Latin)
Ⅳ: U+2163    6 ROMAN NUMERAL FOUR
*** Mathematical Operators (Common)
∞: U+221E    2 INFINITY
*** Miscellaneous Symbols (Common)
♦: U+2666    1 BLACK DIAMOND SUIT
*** Alphabetic Presentation Forms (Latin)
fi: U+FB01    2 LATIN SMALL LIGATURE FI
*** Arabic Presentation Forms-B (Common)
: U+FEFF    1 ZERO WIDTH NO-BREAK SPACE

You can see the form feeds we used before RFC8650, as well as the dreadful BOM already (which is in the Arabic Presentation Forms-B block, in case you didn’t know that).