Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong

Brian E Carpenter <brian.e.carpenter@gmail.com> Sun, 28 February 2021 21:59 UTC

Return-Path: <brian.e.carpenter@gmail.com>
X-Original-To: xml2rfc@ietfa.amsl.com
Delivered-To: xml2rfc@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2D91E3A1D26 for <xml2rfc@ietfa.amsl.com>; Sun, 28 Feb 2021 13:59:25 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.2
X-Spam-Level:
X-Spam-Status: No, score=-0.2 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, NICE_REPLY_A=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0G7tyXjXW3rG for <xml2rfc@ietfa.amsl.com>; Sun, 28 Feb 2021 13:59:24 -0800 (PST)
Received: from mail-pg1-x535.google.com (mail-pg1-x535.google.com [IPv6:2607:f8b0:4864:20::535]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0249D3A1D24 for <xml2rfc@ietf.org>; Sun, 28 Feb 2021 13:59:23 -0800 (PST)
Received: by mail-pg1-x535.google.com with SMTP id l2so10295948pgb.1 for <xml2rfc@ietf.org>; Sun, 28 Feb 2021 13:59:23 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=ImU8QA5qhM5vPCz46lQO/BEHK1tRwBeb/g/tmi2w8m8=; b=mRTsd7S0DWKFxEDw9i96woHEXBZi5onb73KnishxnglsgEGbXPePPNUsjMQJ+oN3wE ukHnwk8QZBT4Xcr7gShzlvtz8WK3Mk38nSAMjJ1FbFFe+5pF3bAjwROX2IEdCCwhYbzA eT3MN1VV4BN4M8Bk3IjhUFciBq54rdYeWe1o0NIZ0xtacphP0p1NH53eA9XAUy/4DMIr xIKv5Zqv45q+AixH7MyCsTAJoQsqXgcDQnDvM5lH4lEopnnyyyhyVCdKWsM8OLWiih5Z OluJSYyJjqZzczUZ+PY9e8rNphtz7ZeEBG6Q9Nw+eCHl4T5wCjh/ygQewfHexj3N+OLT OIxg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=ImU8QA5qhM5vPCz46lQO/BEHK1tRwBeb/g/tmi2w8m8=; b=GgCgIgi5+sgokp4Hvj8A8/cAAoIoj8IIGgcC+fou+Bcs41f5IV5TxH90S1fbcJIROj XzkCJfoGQhQ5uJR9ipzbPyhqDn/QmPoQRfk/EaX7vQz4CG89iuMEWoEJtqDNeNTi3wip Gsy3A/lxmn7bW5+wzzJt+KX7hvV2qGGlg4QlDV2lAWI62GF6P2+KQFZF3+qLcox6s4no +O19hOnHtbhLjihnz7ILCAydwydctRw6gNN7b8oJOjEJzLtxTZNjB/RnaV4yGLPgaK1j KYFpgOwKAMWezrOIeqF8jKqGcEukhky9Y01cWB/AOCKWV8IdRxmUJ6EnrrT3tzuiKnIl GkbA==
X-Gm-Message-State: AOAM532M2moy1eMQ4RsuLeFXgxjH5lT6tzK0xbKIIyAEkBDuOZzCZGyQ KlKx5qRR1kIpQ7f9kahAFwN6/DRmtyUykg==
X-Google-Smtp-Source: ABdhPJxWdXEP0wpCD9d+h5H0N6/Jqyu3/Mw+n5AwokL1cthPTBGEWApPjI/4BZBpJAcLymcwbSziEg==
X-Received: by 2002:a62:ee09:0:b029:1c0:ba8c:fcea with SMTP id e9-20020a62ee090000b02901c0ba8cfceamr12069669pfi.7.1614549561161; Sun, 28 Feb 2021 13:59:21 -0800 (PST)
Received: from [192.168.178.20] ([151.210.131.28]) by smtp.gmail.com with ESMTPSA id 190sm15404563pfv.155.2021.02.28.13.59.19 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 28 Feb 2021 13:59:20 -0800 (PST)
To: Paul Kyzivat <pkyzivat@alum.mit.edu>, xml2rfc@ietf.org
References: <20210227191644.165F76F105E2@ary.qy> <28B528D6-7CBA-4735-A5EE-C7061D1C1D0C@tzi.org> <3dc1abe5-24bf-3b12-7b58-d06c7cde428e@taugh.com> <BBA9B16E-5B06-419D-9ABE-BFB7E69B54C9@tzi.org> <6603926-561f-c9b8-2612-2afb9847b71@taugh.com> <20210228173825.GE30153@localhost> <14ad2b3e-852a-28b1-27ae-5e25ec7823bc@taugh.com> <a7734631-a4f3-cee1-1ee7-e9e0bd3d534a@gmail.com> <d96fc964-f367-dc8f-bdf3-a76b90abd042@alum.mit.edu>
From: Brian E Carpenter <brian.e.carpenter@gmail.com>
Message-ID: <3d0300d1-b9de-ffe6-7b87-6726ab6228cd@gmail.com>
Date: Mon, 1 Mar 2021 10:59:16 +1300
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.9.1
MIME-Version: 1.0
In-Reply-To: <d96fc964-f367-dc8f-bdf3-a76b90abd042@alum.mit.edu>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/xml2rfc/EtKhu6ITAx7T6JWdUCyQkwahzYk>
Subject: Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong
X-BeenThere: xml2rfc@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <xml2rfc.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/xml2rfc/>
List-Post: <mailto:xml2rfc@ietf.org>
List-Help: <mailto:xml2rfc-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 28 Feb 2021 21:59:25 -0000

On 01-Mar-21 09:55, Paul Kyzivat wrote:
> On 2/28/21 2:51 PM, Brian E Carpenter wrote:
>> On 01-Mar-21 06:54, John R Levine wrote:
>>>> Provided it doesn't also lose alternative Unicode whitespace characters,
>>>> using &emsp; is an option. In a pinch we could have an element to mark
>>>> the end of a sentence (<s/>).
>>>
>>> At the end of every sentence? That's, uh, quite a stretch. Are we sure
>>> this problem is worth that much effort by every author?
>>
>> Since we're designing on the hoof here, I suggest you'd need a construct like
>> <literal value="Philip R. Zimmermann"/>.
>>
>> But much simpler to scrap the double space rule.
> 
> Two things are being muddled here:
> 
> 1) two spaces at end of sentences in .txt output;
> 
> 2) how two distinguish sentence endings by xml2rfc in xml input.
> 
> There has been *some* discussion of using two spaces in the input for 
> (2), but it doesn't work that way now and there are many issues in 
> changing it to work that way. It isn't evident to me that it is a 
> serious proposal.
> 
> *If* we had a reliable method for (2) then I doubt there would be much 
> issue with (1). The problem is that the existing method for (2) isn't 
> reliable.
> 
> I haven't checked, but I presume the current problems (2) are also 
> exhibited in html output.

Why would they be? The html format has single spaces. (Just checked in
RFC8981, which was announced half an hour ago.)

> ISTM that the real question is whether authors will be willing to 
> manually annotate the xml input to indicate sentence endings. I haven't 
> seen any proposal mentioned that I would willingly use on a regular 
> basis. I would rather suffer with the existing heuristic.

But much simpler to scrap the double space rule.

    Brian