[Rfc-markdown] The <tt> train wreck

Carsten Bormann <cabo@tzi.org> Fri, 13 August 2021 23:13 UTC

Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
Date: Sat, 14 Aug 2021 01:13:32 +0200
Cc: rfc-markdown@ietf.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <04BFB6A7-7601-409D-8101-237242F6F38A@tzi.org>
To: rfc-interest@rfc-editor.org
Archived-At: <https://mailarchive.ietf.org/arch/msg/rfc-markdown/sJhKqGSnSyG85JcV_gXKvgT_uII>
Subject: [Rfc-markdown] The <tt> train wreck
Precedence: list

The original RFCXML RFC 2629 did not have any elements to indicate emphasis (often rendered as italic/oblique and/or bold type).  More normatively in practice, XML2RFCv1 had »<spanx style=«, which provided text spans with the following properties:

| style   | nobreak | decorator | type      | tt |
|---------|---------|-----------|-----------|----|
| emph    |         | _         | em        | -  |
| strong  |         | *         | strong    | -  |
| nobreak | x       | none      | -         | -  |
| vbare   | x       | none      | -         | x  |
| verb    | x       | "         | -         | x  |
| vemph   | x       | _         | em        | x  |
| vstrong | x       | *         | strong    | x  |
| vdeluxe | x       | *_ and _* | em strong | x  |

The “style" column is the style attribute that could be given to spanx.

The "nobreak" column indicates whether a non-breaking semantics was intended (trying harder to keep pieces together around /@&|-+#%: characters).

For TXT, the “decorator" column indicates what character is used _around_ the span in plaintext rendering.  Note that only “nobreak" and “vbare” had no decorator.  Using »"« as a decorator for “verb” certainly was an unloved compromise that did, however, work well enough.

For HTML, I note that the nobreak functionality was commented out in my version of XML2RFCv1 (there seemed to be no easy way to translate it into HTML at the time).  The columns “type” and “tt” indicate the font to be used in HTML: “type” provides the variation of the base font, and “tt” indicates whether the base font is monospaced or not.

Much of this functionality was known only to people who actually looked into the source code of XML2RFC.  I haven’t checked XML2RFCv2, but it seems that some of these features should have survived into v2, but have decayed.  Since almost nobody cared about the HTML renderings (it was irrelevant for the actual RFC publishing), I’m not sure that full support was checked extensively — XML2RFCv1 was available and could be used by authors that did need the full functionality.

Enter v3.

I can’t find any horizontal no-breaking support (except for that which Unicode provides).  Unexplicably whole-document options like --table-hyphen-breaks were introduced to create tweaks that should have been applied for specific items.

Emph and strong were finally spread out into their own elements, <em and <strong.
These can be combined with each other and with monospaced font selection, so the latter was turned into its own element, <tt.
As <spanx style=“verb”> was the only supported form of monospacing in xml2rfcv2 at the time, <tt putatively was its replacement.

So we have the last two columns of my table above covered, but not the second and the third.

Little thinking was wasted about the plaintext rendering of these new span elements — after all, XML2RFCv2 had emph, strong, and verb, and these seemed to work.
In fact, the de-facto v3 manual https://xml2rfc.tools.ietf.org/xml2rfc-doc.html until today doesn’t mention plaintext rendering of these elements.

So, since RFC8650, v3 documents have been keyboarded and proof-read under the assumption that <em, <strong, and <tt are the replacements of the spanx styles “emph”, “strong”, and “verb”.

Note that these span elements not only set the font for each of the characters in the span; they also have a delimiting semantics.  This is obvious when decorators are used in the plaintext form, but also on the HTML side, the <tt element is rendered as a separate HTML element, which with its CSS styles makes sure <tt>a </tt> (note the space after the a) looks different from <tt>a</tt>.

The »"« decorator for what is now <tt continues to be unloved.
Since 2020-06-20, there is some weird code in XML2RFCv3 that guesses that these decorators are unwanted in certain table contexts; the fact that this code tends to guess wrong (of course!) already has led to a bug report [0].
Apparently this will be fixed by removing the decorators entirely [1].

So, after 400+ RFCs have been published under the assumption of (and proofread against) decorators enabling understanding the plaintext rendering of <tt, the meaning of <tt will be retroactively changed from <spanx style=“verb”> to <spanx style=“vbare”> (which hasn’t even been available for the decade most people used XML2RFCv2).

I have pointed out why the decorators are needed in certain cases [2].
As several people point out, there are also cases where the <tt decorators are unneeded or even somewhat ugly.
Which of these are the case depends on the authors’ intent with the <tt.
As that is not captured (as it used to be in verb vs. vbare), there is no hope to get this decision right in the two formatters, TXT and HTML/PDF.

TL;DR:

The decision to always remove the decorations from the TXT rendering of <tt> is wrong.
This is because unfortunately <tt> is broken, i.e., ill-conceived, for its application.
As this has been enshrined, there can be no backward-compatible “right thing", only repairs going forward.
(The decision also points out that the way we currently reach these decisions is broken; maybe we can revisit that particular point when we come up with a new decision structure.)

Grüße, Carsten

PS.: I have CCed this to rfc-markdown because there also is no good way in markdown to keyboard the distinction between a decorative <tt that can be ignored in plaintext and one that *needs* a representation (“decorators”):  Like this part of XML2RFCv3, markdown also was not designed to format into plaintext.  It would be nice if there were a way to create a workaround from kramdown-rfc, but the problem is that the single piece of XML that kramdown-rfc puts out needs to generate both the TXT and the HTML/PDF version, and there is no way in RFCXMLv3 to indicate the variant processing needed for TXT.

[0]: <https://mailarchive.ietf.org/arch/msg/xml2rfc/30tTnMMcJHCIH8t8-s_NVLZYrFg>
[1]: https://trac.ietf.org/trac/xml2rfc/ticket/600
[2]: E—mail that unfortunately only went to a subgroup of people discussing the issue and therefore isn’t archived; reproduced below.  It first discussed the need to resurrect some non-breaking semantics, and then (search for “txtquotes”) discussed the need for the author to actively choose between decorated (»txtquotes=“true”«) and undecorated (»txtquotes=“true”«) variants.

> On 7/8/21 1:33 AM, Carsten Bormann wrote:
>> I probably should add that on the authoring side, a variant of <code> with non-breaking semantics is needed.
>> But that can be done in the authoring tool (kramdown-rfc) by transliterating space, hyphen etc. into their non-breaking equivalents, so it probably doesn’t need support from xml2rfc.
[…]

I ran into this requirements (non-breaking semantics) when I converted the XML for RFC 6125, where the RFC-editor (I assume) had converted some, but not all hyphens in syntax snippets that were in quotes into non-breaking hyphens.
(Note that they didn’t bother with the at the time clumsy <spanx style=“verb”> but put in the quotes that would have been generated anyway from that directly, as that is equivalent with TXT-only production.)

 (or its equivalent) containing a "reg&nbhy;name".  (Matching only the
 "reg&nbhy;name" rule from <xref target='URI'/> limits verification to DNS
 domain names, thereby differentiating a URI&nbhy;ID from a

 A certificate for this service might include SRV-IDs of 
 "_xmpp&nbhy;client.im.example.org" and "_xmpp&nbhy;server.im.example.org” 
 (see <xref target='XMPP'/>), a DNS-ID of "im.example.org", and an XMPP-specific

Clearly, those quotes were desired in the TXT output here (but, I don’t think, would not have been in the HTML), so these all would be specified as <tt txtquotes=“true”> in my imagined new world.

On the matter of my suggested txtquotes bit, I also just looked into RFC 8949, because I’m familiar with it and it is a non-trivial document.

The instances of <tt>false… (true, …) would actually improve with txtquotes=“false".

The instances of <tt>0 and <tt>0.0 also work with txtquotes=“false".

The reference to

 the <tt>date-time</tt> production in <xref target="RFC3339" 

can be understood either way, but for txtquotes=“false” that understandability hinges on the production having a nominal name; if it were <tt txtquotes=“false" >second</tt>, this would not work at all.

Similar with

 doesn't match the <tt>URI-reference</tt> production, the string is invalid.</li>

Neutral for

 the encoded text string <tt>0x62c0ae</tt>

Getting more of a problem:

 interested in this information.  For example, <tt>_</tt> or <tt>_3</tt>.

Completely broken:
     <t indent="0" pn="section-appendix.c-5">Note that <tt>well_formed</tt>
     returns the major type for well-formed

Note that one is the function name and the other one is the property defined in this RFC.
(Yes, in this case the reader can guess which is which by one being snake_case and the other kebab-case.  But ouch.)
So to make the TXT acceptable, the XML and thus the HTML would need to be changed here.

In the RFCXMLv2 times, the authors could decide whether they wanted TXT quotes and just leave the decoration off if they didn’t.  But with better styling available on the HTML side, they’ll want to switch on <tt> and sometimes regret if that semantics is suppressed on the TXT side.

By the way, that same “switch off the fallback” bit would tremendously improve <em> and <strong> as well in certain cases, too:

  *  *SHA-384* and *SHA-512* hash functions are efficient for 64-bit
     hardware.

(I’ve seen much worse, but can’t find an example of that right now.  I did see one in the course of today!)

So maybe this would be ignore-in-plain-text=“true” instead of txtquotes=“false”.

Grüße, Carsten

[Rfc-markdown] The <tt> train wreck Carsten Bormann
Re: [Rfc-markdown] [rfc-i] The <tt> train wreck Carsten Bormann
Re: [Rfc-markdown] [rfc-i] The <tt> train wreck Martin Thomson
[Rfc-markdown] Trying to mitigate "Re: The <tt> t… Carsten Bormann
Re: [Rfc-markdown] Trying to mitigate "Re: The <t… Jay Daley
Re: [Rfc-markdown] Trying to mitigate "Re: The <t… Miek Gieben
Re: [Rfc-markdown] Trying to mitigate "Re: The <t… Carsten Bormann
Re: [Rfc-markdown] Trying to mitigate "Re: The <t… Carsten Bormann
Re: [Rfc-markdown] Trying to mitigate "Re: The <t… John Levine
Re: [Rfc-markdown] Trying to mitigate "Re: The <t… Julian Reschke
Re: [Rfc-markdown] Trying to mitigate "Re: The <t… Carsten Bormann
Re: [Rfc-markdown] Trying to mitigate "Re: The <t… Carsten Bormann