Re: [Rfc-markdown] [xml2rfc] [irsg] character sets, was UPDATE regarding <u>
Carsten Bormann <cabo@tzi.org> Sun, 05 March 2023 23:55 UTC
Return-Path: <cabo@tzi.org>
X-Original-To: rfc-markdown@ietfa.amsl.com
Delivered-To: rfc-markdown@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id F014CC14F736 for <rfc-markdown@ietfa.amsl.com>; Sun, 5 Mar 2023 15:55:53 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.896
X-Spam-Level:
X-Spam-Status: No, score=-6.896 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LDagOYE1bMJl for <rfc-markdown@ietfa.amsl.com>; Sun, 5 Mar 2023 15:55:50 -0800 (PST)
Received: from smtp.zfn.uni-bremen.de (smtp.zfn.uni-bremen.de [134.102.50.15]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 176FDC14F75F for <rfc-markdown@ietf.org>; Sun, 5 Mar 2023 15:55:50 -0800 (PST)
Received: from [192.168.217.124] (p548dc9a4.dip0.t-ipconnect.de [84.141.201.164]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4PVJW81qG9zDCbW; Mon, 6 Mar 2023 00:55:48 +0100 (CET)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <a39b8c32-6f4f-4caf-8400-1846ea25faa2@betaapp.fastmail.com>
Date: Mon, 06 Mar 2023 00:55:47 +0100
Cc: rfc-markdown@ietf.org
X-Mao-Original-Outgoing-Id: 699753347.8447371-ae5160646e4781813cfea620f3e285da
Content-Transfer-Encoding: quoted-printable
Message-Id: <940B4C2A-9253-4E05-AF01-0BA123BAE072@tzi.org>
References: <20230304190316.05346A51F3D2@ary.qy> <5081F069-705D-4707-85EB-DBA11D594D19@tzi.org> <a39b8c32-6f4f-4caf-8400-1846ea25faa2@betaapp.fastmail.com>
To: Martin Thomson <mt@lowentropy.net>
X-Mailer: Apple Mail (2.3608.120.23.2.7)
Archived-At: <https://mailarchive.ietf.org/arch/msg/rfc-markdown/9CtdtiFztfaJ_5llrPW8KwaUg50>
Subject: Re: [Rfc-markdown] [xml2rfc] [irsg] character sets, was UPDATE regarding <u>
X-BeenThere: rfc-markdown@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: "rfc-markdown is a discussion list for people writing I-Ds and RFCs in Markdown and the authors of the tools used for that." <rfc-markdown.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rfc-markdown>, <mailto:rfc-markdown-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rfc-markdown/>
List-Post: <mailto:rfc-markdown@ietf.org>
List-Help: <mailto:rfc-markdown-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rfc-markdown>, <mailto:rfc-markdown-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 05 Mar 2023 23:55:54 -0000
On 2023-03-06, at 00:42, Martin Thomson <mt@lowentropy.net> wrote: > > Carsten, have you considered how the report might differ for <artwork>? I see a few documents with a lot of symbols in artwork, which might make the report noisier than is ideal. I’m not sure — in the end these characters also must be in xml2rfc's PDF repertoire. > That would maybe imply working from XML, which has some interesting implications for those characters that xml2rfc treats specially, like non-breaking spaces and zero-width spaces of different types. I think it would be interesting to create a variety of targeted text extractors for RFCXML. (Note that RFCXML does store a lot of text in attributes, which therefore also need to be extracted and sorted into the buckets we want to have.) Grüße, Carsten > > On Mon, Mar 6, 2023, at 00:19, Carsten Bormann wrote: >> On 2023-03-04, at 20:03, John Levine <johnl@taugh.com> wrote: >>> >>> One issue is our policy about where any non-ASCII goes, but a separate >>> issue that Carsten has run into is exotic characters beyond the 3000 >>> or so that are in the fonts we normally use. >> >> Kramdown-rfc 1.6.26 now has an `echars` utility that is more talkative >> about unicode blocks and unicode scripts. Examples below. I hope this >> is useful in our quest to rescue these Unicode arcana from being held >> captive by Unicode super-geeks. >> >> Grüße, Carsten >> >> >> This is the RFC-to-be where RPC happened to pick a character from the >> Dingbats block that started this discussion: >> >> $ echars rfc/authors/rfc9340.txt >> *** Latin-1 Supplement (Latin) >> ß: U+00DF 1 LATIN SMALL LETTER SHARP S >> á: U+00E1 3 LATIN SMALL LETTER A WITH ACUTE >> ä: U+00E4 1 LATIN SMALL LETTER A WITH DIAERESIS >> é: U+00E9 5 LATIN SMALL LETTER E WITH ACUTE >> ó: U+00F3 1 LATIN SMALL LETTER O WITH ACUTE >> ø: U+00F8 1 LATIN SMALL LETTER O WITH STROKE >> ü: U+00FC 7 LATIN SMALL LETTER U WITH DIAERESIS >> *** Latin Extended-A (Latin) >> ć: U+0107 2 LATIN SMALL LETTER C WITH ACUTE >> č: U+010D 2 LATIN SMALL LETTER C WITH CARON >> ę: U+0119 1 LATIN SMALL LETTER E WITH OGONEK >> ł: U+0142 2 LATIN SMALL LETTER L WITH STROKE >> š: U+0161 1 LATIN SMALL LETTER S WITH CARON >> *** General Punctuation (Common) >> –: U+2013 1 EN DASH >> *** Dingbats (Common) >> ➔: U+2794 2 HEAVY WIDE-HEADED RIGHTWARDS ARROW >> *** Miscellaneous Mathematical Symbols-A (Common) >> ⟩: U+27E9 61 MATHEMATICAL RIGHT ANGLE BRACKET >> *** Arabic Presentation Forms-B (Common) >> : U+FEFF 1 ZERO WIDTH NO-BREAK SPACE >> >> >> For comparison, an RFC out of the pre-v3 times: >> >> $ echars rfc/rfc8265.txt >> *** Basic Latin (Common) >> "\f": U+000C 25 <control-000C> >> *** Latin-1 Supplement >> ¹: U+00B9 1 SUPERSCRIPT ONE (Common) >> ß: U+00DF 3 LATIN SMALL LETTER SHARP S (Latin) >> å: U+00E5 1 LATIN SMALL LETTER A WITH RING ABOVE (Latin) >> *** Latin Extended-A (Latin) >> ſ: U+017F 2 LATIN SMALL LETTER LONG S >> *** Greek and Coptic (Greek) >> Σ: U+03A3 2 GREEK CAPITAL LETTER SIGMA >> π: U+03C0 2 GREEK SMALL LETTER PI >> ς: U+03C2 2 GREEK SMALL LETTER FINAL SIGMA >> σ: U+03C3 2 GREEK SMALL LETTER SIGMA >> *** Ogham (Ogham) >> : U+1680 1 OGHAM SPACE MARK >> *** Number Forms (Latin) >> Ⅳ: U+2163 6 ROMAN NUMERAL FOUR >> *** Mathematical Operators (Common) >> ∞: U+221E 2 INFINITY >> *** Miscellaneous Symbols (Common) >> ♦: U+2666 1 BLACK DIAMOND SUIT >> *** Alphabetic Presentation Forms (Latin) >> fi: U+FB01 2 LATIN SMALL LIGATURE FI >> *** Arabic Presentation Forms-B (Common) >> : U+FEFF 1 ZERO WIDTH NO-BREAK SPACE >> >> You can see the form feeds we used before RFC8650, as well as the >> dreadful BOM already (which is in the Arabic Presentation Forms-B >> block, in case you didn’t know that). >> >> _______________________________________________ >> Rfc-markdown mailing list >> Rfc-markdown@ietf.org >> https://www.ietf.org/mailman/listinfo/rfc-markdown > > _______________________________________________ > Rfc-markdown mailing list > Rfc-markdown@ietf.org > https://www.ietf.org/mailman/listinfo/rfc-markdown
- Re: [Rfc-markdown] [xml2rfc] [irsg] character set… Carsten Bormann
- Re: [Rfc-markdown] [xml2rfc] [irsg] character set… Martin Thomson
- Re: [Rfc-markdown] [xml2rfc] [irsg] character set… Carsten Bormann
- Re: [Rfc-markdown] [xml2rfc] [irsg] character set… Martin Thomson
- Re: [Rfc-markdown] [xml2rfc] [irsg] character set… Carsten Bormann