Re: [Rfc-markdown] [xml2rfc] [irsg] character sets, was UPDATE regarding <u>

Carsten Bormann <cabo@tzi.org> Sun, 05 March 2023 23:55 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: rfc-markdown@ietfa.amsl.com
Delivered-To: rfc-markdown@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id F014CC14F736 for <rfc-markdown@ietfa.amsl.com>; Sun, 5 Mar 2023 15:55:53 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.896
X-Spam-Level:
X-Spam-Status: No, score=-6.896 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LDagOYE1bMJl for <rfc-markdown@ietfa.amsl.com>; Sun, 5 Mar 2023 15:55:50 -0800 (PST)
Received: from smtp.zfn.uni-bremen.de (smtp.zfn.uni-bremen.de [134.102.50.15]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 176FDC14F75F for <rfc-markdown@ietf.org>; Sun, 5 Mar 2023 15:55:50 -0800 (PST)
Received: from [192.168.217.124] (p548dc9a4.dip0.t-ipconnect.de [84.141.201.164]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4PVJW81qG9zDCbW; Mon, 6 Mar 2023 00:55:48 +0100 (CET)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <a39b8c32-6f4f-4caf-8400-1846ea25faa2@betaapp.fastmail.com>
Date: Mon, 06 Mar 2023 00:55:47 +0100
Cc: rfc-markdown@ietf.org
X-Mao-Original-Outgoing-Id: 699753347.8447371-ae5160646e4781813cfea620f3e285da
Content-Transfer-Encoding: quoted-printable
Message-Id: <940B4C2A-9253-4E05-AF01-0BA123BAE072@tzi.org>
References: <20230304190316.05346A51F3D2@ary.qy> <5081F069-705D-4707-85EB-DBA11D594D19@tzi.org> <a39b8c32-6f4f-4caf-8400-1846ea25faa2@betaapp.fastmail.com>
To: Martin Thomson <mt@lowentropy.net>
X-Mailer: Apple Mail (2.3608.120.23.2.7)
Archived-At: <https://mailarchive.ietf.org/arch/msg/rfc-markdown/9CtdtiFztfaJ_5llrPW8KwaUg50>
Subject: Re: [Rfc-markdown] [xml2rfc] [irsg] character sets, was UPDATE regarding <u>
X-BeenThere: rfc-markdown@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: "rfc-markdown is a discussion list for people writing I-Ds and RFCs in Markdown and the authors of the tools used for that." <rfc-markdown.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rfc-markdown>, <mailto:rfc-markdown-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rfc-markdown/>
List-Post: <mailto:rfc-markdown@ietf.org>
List-Help: <mailto:rfc-markdown-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rfc-markdown>, <mailto:rfc-markdown-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 05 Mar 2023 23:55:54 -0000

On 2023-03-06, at 00:42, Martin Thomson <mt@lowentropy.net> wrote:
> 
> Carsten, have you considered how the report might differ for <artwork>?  I see a few documents with a lot of symbols in artwork, which might make the report noisier than is ideal.

I’m not sure — in the end these characters also must be in xml2rfc's PDF repertoire.

> That would maybe imply working from XML, which has some interesting implications for those characters that xml2rfc treats specially, like non-breaking spaces and zero-width spaces of different types.

I think it would be interesting to create a variety of targeted text extractors for RFCXML.

(Note that RFCXML does store a lot of text in attributes, which therefore also need to be extracted and sorted into the buckets we want to have.)

Grüße, Carsten

> 
> On Mon, Mar 6, 2023, at 00:19, Carsten Bormann wrote:
>> On 2023-03-04, at 20:03, John Levine <johnl@taugh.com> wrote:
>>> 
>>> One issue is our policy about where any non-ASCII goes, but a separate
>>> issue that Carsten has run into is exotic characters beyond the 3000
>>> or so that are in the fonts we normally use.
>> 
>> Kramdown-rfc 1.6.26 now has an `echars` utility that is more talkative 
>> about unicode blocks and unicode scripts.  Examples below.  I hope this 
>> is useful in our quest to rescue these Unicode arcana from being held 
>> captive by Unicode super-geeks.
>> 
>> Grüße, Carsten
>> 
>> 
>> This is the RFC-to-be where RPC happened to pick a character from the 
>> Dingbats block that started this discussion:
>> 
>> $ echars rfc/authors/rfc9340.txt
>> *** Latin-1 Supplement (Latin)
>> ß: U+00DF    1 LATIN SMALL LETTER SHARP S
>> á: U+00E1    3 LATIN SMALL LETTER A WITH ACUTE
>> ä: U+00E4    1 LATIN SMALL LETTER A WITH DIAERESIS
>> é: U+00E9    5 LATIN SMALL LETTER E WITH ACUTE
>> ó: U+00F3    1 LATIN SMALL LETTER O WITH ACUTE
>> ø: U+00F8    1 LATIN SMALL LETTER O WITH STROKE
>> ü: U+00FC    7 LATIN SMALL LETTER U WITH DIAERESIS
>> *** Latin Extended-A (Latin)
>> ć: U+0107    2 LATIN SMALL LETTER C WITH ACUTE
>> č: U+010D    2 LATIN SMALL LETTER C WITH CARON
>> ę: U+0119    1 LATIN SMALL LETTER E WITH OGONEK
>> ł: U+0142    2 LATIN SMALL LETTER L WITH STROKE
>> š: U+0161    1 LATIN SMALL LETTER S WITH CARON
>> *** General Punctuation (Common)
>> –: U+2013    1 EN DASH
>> *** Dingbats (Common)
>> ➔: U+2794    2 HEAVY WIDE-HEADED RIGHTWARDS ARROW
>> *** Miscellaneous Mathematical Symbols-A (Common)
>> ⟩: U+27E9   61 MATHEMATICAL RIGHT ANGLE BRACKET
>> *** Arabic Presentation Forms-B (Common)
>> : U+FEFF    1 ZERO WIDTH NO-BREAK SPACE
>> 
>> 
>> For comparison, an RFC out of the pre-v3 times:
>> 
>> $ echars rfc/rfc8265.txt
>> *** Basic Latin (Common)
>> "\f": U+000C   25 <control-000C>
>> *** Latin-1 Supplement
>> ¹: U+00B9    1 SUPERSCRIPT ONE (Common)
>> ß: U+00DF    3 LATIN SMALL LETTER SHARP S (Latin)
>> å: U+00E5    1 LATIN SMALL LETTER A WITH RING ABOVE (Latin)
>> *** Latin Extended-A (Latin)
>> ſ: U+017F    2 LATIN SMALL LETTER LONG S
>> *** Greek and Coptic (Greek)
>> Σ: U+03A3    2 GREEK CAPITAL LETTER SIGMA
>> π: U+03C0    2 GREEK SMALL LETTER PI
>> ς: U+03C2    2 GREEK SMALL LETTER FINAL SIGMA
>> σ: U+03C3    2 GREEK SMALL LETTER SIGMA
>> *** Ogham (Ogham)
>>  : U+1680    1 OGHAM SPACE MARK
>> *** Number Forms (Latin)
>> Ⅳ: U+2163    6 ROMAN NUMERAL FOUR
>> *** Mathematical Operators (Common)
>> ∞: U+221E    2 INFINITY
>> *** Miscellaneous Symbols (Common)
>> ♦: U+2666    1 BLACK DIAMOND SUIT
>> *** Alphabetic Presentation Forms (Latin)
>> fi: U+FB01    2 LATIN SMALL LIGATURE FI
>> *** Arabic Presentation Forms-B (Common)
>> : U+FEFF    1 ZERO WIDTH NO-BREAK SPACE
>> 
>> You can see the form feeds we used before RFC8650, as well as the 
>> dreadful BOM already (which is in the Arabic Presentation Forms-B 
>> block, in case you didn’t know that).
>> 
>> _______________________________________________
>> Rfc-markdown mailing list
>> Rfc-markdown@ietf.org
>> https://www.ietf.org/mailman/listinfo/rfc-markdown
> 
> _______________________________________________
> Rfc-markdown mailing list
> Rfc-markdown@ietf.org
> https://www.ietf.org/mailman/listinfo/rfc-markdown