Re: [rfc-i] entities and unicode

Carsten Bormann <cabo@tzi.org> Fri, 03 December 2021 10:37 UTC

Return-Path: <rfc-interest-bounces@rfc-editor.org>
X-Original-To: ietfarch-rfc-interest-archive@ietfa.amsl.com
Delivered-To: ietfarch-rfc-interest-archive@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 87F633A045E; Fri, 3 Dec 2021 02:37:20 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.65
X-Spam-Level:
X-Spam-Status: No, score=-2.65 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.25, MAILING_LIST_MULTI=-1, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id NgD1wMrc147p; Fri, 3 Dec 2021 02:37:16 -0800 (PST)
Received: from rfc-editor.org (rfc-editor.org [IPv6:2001:1900:3001:11::31]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 226683A0476; Fri, 3 Dec 2021 02:37:16 -0800 (PST)
Received: from rfcpa.amsl.com (localhost [IPv6:::1]) by rfc-editor.org (Postfix) with ESMTP id E8278163323; Fri, 3 Dec 2021 02:37:14 -0800 (PST)
X-Original-To: rfc-interest@rfc-editor.org
Delivered-To: rfc-interest@rfc-editor.org
Received: from localhost (localhost [127.0.0.1]) by rfc-editor.org (Postfix) with ESMTP id D3F4A163323 for <rfc-interest@rfc-editor.org>; Fri, 3 Dec 2021 02:37:13 -0800 (PST)
X-Virus-Scanned: amavisd-new at rfc-editor.org
Received: from rfc-editor.org ([127.0.0.1]) by localhost (rfcpa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id hJV9RbHZARFT for <rfc-interest@rfc-editor.org>; Fri, 3 Dec 2021 02:37:09 -0800 (PST)
Received: from gabriel-smtp.zfn.uni-bremen.de (gabriel-smtp.zfn.uni-bremen.de [134.102.50.15]) by rfc-editor.org (Postfix) with ESMTPS id 24AD911FDDF for <rfc-interest@rfc-editor.org>; Fri, 3 Dec 2021 02:37:08 -0800 (PST)
Received: from [192.168.217.118] (p5089a436.dip0.t-ipconnect.de [80.137.164.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-smtp.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4J58QV1jmczDCf7; Fri, 3 Dec 2021 11:37:06 +0100 (CET)
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <20211203101748.GA26129@miek.nl>
Date: Fri, 03 Dec 2021 11:37:05 +0100
Cc: RFC Interest <rfc-interest@rfc-editor.org>
X-Mao-Original-Outgoing-Id: 660220625.8838429-58f692caaabbd18fccf98436d6ee3428
Message-Id: <473FCE46-AF5B-4B89-ADF1-B68E908BE59F@tzi.org>
References: <20211203101748.GA26129@miek.nl>
To: Miek Gieben <miek@miek.nl>
X-Mailer: Apple Mail (2.3608.120.23.2.7)
Subject: Re: [rfc-i] entities and unicode
X-BeenThere: rfc-interest@rfc-editor.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: "A list for discussion of the RFC series and RFC Editor functions." <rfc-interest.rfc-editor.org>
List-Unsubscribe: <https://www.rfc-editor.org/mailman/options/rfc-interest>, <mailto:rfc-interest-request@rfc-editor.org?subject=unsubscribe>
List-Archive: <http://www.rfc-editor.org/pipermail/rfc-interest/>
List-Post: <mailto:rfc-interest@rfc-editor.org>
List-Help: <mailto:rfc-interest-request@rfc-editor.org?subject=help>
List-Subscribe: <https://www.rfc-editor.org/mailman/listinfo/rfc-interest>, <mailto:rfc-interest-request@rfc-editor.org?subject=subscribe>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: rfc-interest-bounces@rfc-editor.org
Sender: rfc-interest <rfc-interest-bounces@rfc-editor.org>

On 2021-12-03, at 11:17, Miek Gieben <miek@miek.nl> wrote:
> 
> Hello all,
> 
> In https://www.rfc-editor.org/materials/FAQ-xml2rfcv3.html it says I need to wrap unicode
> characters in <u> tags (which is already a bit confusing:
> https://github.com/rfc-format/draft-iab-xml2rfc-v3-bis/issues/205).
> 
> Due to some other bug, I was testing (html) entities and if I put:
> 
>    <t>this is some dashes &#x2011;</t>
> 
> In the XML, xml2rfc --text renders the "-" (but then the proper unicode one) in the text
> document... which now puzzles me.

Can’t reproduce (xml2rfc 3.11.1).

The v2 support in xml2rfc turns the en-dash into a single hyphen/minus and the em-dash into a double hyphen/minus.  Umlauts turn into language specific substitutes, e.g, Keränen turns into Keraenen (great for German, ouch for Finnish).

But if I put these characters into a v3 source, they are all converted into a single hyphen/minus during rendering.
(Umlauts turn into beautiful &#228; in the plaintext and HTML [the latter with &amp; in the source]; Ker&#228;nen — I’d rather have xml2rfc turn into green smoke here.)

> Is <u> really needed? Or are entities not allowed? Or something else that I'm not seeing?

There is also <contact>, which the grammar unfortunately breaks in most contexts, but does work in plain paragraphs.

https://mailarchive.ietf.org/arch/msg/rfc-markdown/7spMIPZdz6S8_NbZZ2WcSB-ScPY

Grüße, Carsten

_______________________________________________
rfc-interest mailing list
rfc-interest@rfc-editor.org
https://www.rfc-editor.org/mailman/listinfo/rfc-interest