Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong

Carsten Bormann <cabo@tzi.org> Sun, 28 February 2021 22:07 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: xml2rfc@ietfa.amsl.com
Delivered-To: xml2rfc@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D29933A084A for <xml2rfc@ietfa.amsl.com>; Sun, 28 Feb 2021 14:07:41 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.02
X-Spam-Level:
X-Spam-Status: No, score=-0.02 tagged_above=-999 required=5 tests=[RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qZUMDTfXawkU for <xml2rfc@ietfa.amsl.com>; Sun, 28 Feb 2021 14:07:37 -0800 (PST)
Received: from gabriel-vm-2.zfn.uni-bremen.de (gabriel-vm-2.zfn.uni-bremen.de [134.102.50.17]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 894673A083E for <xml2rfc@ietf.org>; Sun, 28 Feb 2021 14:07:37 -0800 (PST)
Received: from [192.168.217.123] (p5089a828.dip0.t-ipconnect.de [80.137.168.40]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-vm-2.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4DpcvW4TqqzyNK; Sun, 28 Feb 2021 23:07:35 +0100 (CET)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.4\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <d96fc964-f367-dc8f-bdf3-a76b90abd042@alum.mit.edu>
Date: Sun, 28 Feb 2021 23:07:35 +0100
Cc: xml2rfc@ietf.org
X-Mao-Original-Outgoing-Id: 636242854.09843-af36550eac83097514a5a31033eb5281
Content-Transfer-Encoding: quoted-printable
Message-Id: <26DCBA0D-AA14-461F-9992-CC631774877E@tzi.org>
References: <20210227191644.165F76F105E2@ary.qy> <28B528D6-7CBA-4735-A5EE-C7061D1C1D0C@tzi.org> <3dc1abe5-24bf-3b12-7b58-d06c7cde428e@taugh.com> <BBA9B16E-5B06-419D-9ABE-BFB7E69B54C9@tzi.org> <6603926-561f-c9b8-2612-2afb9847b71@taugh.com> <20210228173825.GE30153@localhost> <14ad2b3e-852a-28b1-27ae-5e25ec7823bc@taugh.com> <a7734631-a4f3-cee1-1ee7-e9e0bd3d534a@gmail.com> <d96fc964-f367-dc8f-bdf3-a76b90abd042@alum.mit.edu>
To: Paul Kyzivat <pkyzivat@alum.mit.edu>
X-Mailer: Apple Mail (2.3608.120.23.2.4)
Archived-At: <https://mailarchive.ietf.org/arch/msg/xml2rfc/Jjd7xVnKk7v8VbWy4XdqiSmp5_8>
Subject: Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong
X-BeenThere: xml2rfc@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <xml2rfc.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/xml2rfc/>
List-Post: <mailto:xml2rfc@ietf.org>
List-Help: <mailto:xml2rfc-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 28 Feb 2021 22:07:42 -0000

> 
> Two things are being muddled here:

Indeed.


> 1) two spaces at end of sentences in .txt output;

And 1a) how to render sentence spacing in the HTML (which, AFAIK, is not helping here).

A sentence spacing of two character positions has been the traditional rendering in plain text.  
That is suboptimal, of course (as many things are in monospaced environments).  


> 2) how two distinguish sentence endings by xml2rfc in xml input.
> 
> There has been *some* discussion of using two spaces in the input for (2), but it doesn't work that way now and there are many issues in changing it to work that way. It isn't evident to me that it is a serious proposal.

So this is about sentence detection, not about sentence spacing.
(Which in turn can make use of sentence spacing in the input, but that is orthogonal.)

I’m trying to understand why the traditional method doesn’t work.
There is never any ambiguity with traditional keyboarding (a new line starts after a sentence); except that one has to be careful not to do an input line-break after [.!?] that is within a sentence.

The question is whether there is a need to accommodate multiple sentences per input line for proper sentence detection.
People differ in their style here.
The newline after a sentence rule helps with version control as well, so I have a strong preference for that style.
But for people who like running on on the same input line, the two-space convention has been working well.

XML may not "preserve whitespace”, but what that exactly means here is not clear to me.  
Double spaces in XML input are copied verbatim into the HTML (where they then are swallowed by the HTML processor), so it is not like the processor is not seeing them.

An easy fallback position is to no longer recognize sentence ends within an input line; that certainly solves the Philip R. Zimmermann issue (which shouldn’t be there as we have a <contact> element for natural names, but I digress).

> *If* we had a reliable method for (2) then I doubt there would be much issue with (1). The problem is that the existing method for (2) isn't reliable.

I’d say get rid of the heuristics.

But then, the thinking in the python textwrap module that is being used here to do the heavy lifting is extremely confused, so I don’t know whether that can be parameterized to be sane.  
Xml2rfc certainly tries...

> I haven't checked, but I presume the current problems (2) are also exhibited in html output.

There is no attempt in xml2rfc that I can recognize to have sentence spacing in HTML output.

> ISTM that the real question is whether authors will be willing to manually annotate the xml input to indicate sentence endings.

The answer is: Absolutely not.

> I haven't seen any proposal mentioned that I would willingly use on a regular basis. I would rather suffer with the existing heuristic.

A heuristic causes people to want to work around its limitations.  
Adding a zero-width space after the end of a middle name is a simple workaround for the current heuristic, but, ugh.  
As long as the future of the <contact> element is unclear (and the grammar bugs around it aren’t fixed), I’d also not want to steer people towards that.

Grüße, Carsten