Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong

Julian Reschke <julian.reschke@gmx.de> Mon, 01 March 2021 08:39 UTC

Return-Path: <julian.reschke@gmx.de>
X-Original-To: xml2rfc@ietfa.amsl.com
Delivered-To: xml2rfc@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 625253A1822 for <xml2rfc@ietfa.amsl.com>; Mon, 1 Mar 2021 00:39:09 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0
X-Spam-Level:
X-Spam-Status: No, score=0 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FREEMAIL_FROM=0.001, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_BLOCKED=0.001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=gmx.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 2BByO_cv0Sm0 for <xml2rfc@ietfa.amsl.com>; Mon, 1 Mar 2021 00:39:08 -0800 (PST)
Received: from mout.gmx.net (mout.gmx.net [212.227.15.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8E0A53A1821 for <xml2rfc@ietf.org>; Mon, 1 Mar 2021 00:39:07 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gmx.net; s=badeba3b8450; t=1614587945; bh=eyZq2aWbh3liN2HDnwsvnKUcRPYHp3S2xvZrSm/B0rA=; h=X-UI-Sender-Class:Subject:To:References:From:Date:In-Reply-To; b=L0gLWLjUnYD08+OHnc4gncTZn1EkWDvuwlR6iPUq802ZH7popr/1MDQaK0l5wWm89 E2YcE22Y+AfZJU+8H/KD59twZCek+mH6DZGVbBbHvp8hxDVCTWE8gQ6SjU8M00b+D+ /8zKOLuJtuPaMXrHU0nuynB0u07Ok0sR8w7Kzdp0=
X-UI-Sender-Class: 01bb95c1-4bf8-414a-932a-4f6e2808ef9c
Received: from [192.168.178.20] ([91.61.54.12]) by mail.gmx.net (mrgmx004 [212.227.17.190]) with ESMTPSA (Nemesis) id 1N17UW-1lwtTm1Hdy-012Unq for <xml2rfc@ietf.org>; Mon, 01 Mar 2021 09:34:00 +0100
To: xml2rfc@ietf.org
References: <20210227191644.165F76F105E2@ary.qy> <28B528D6-7CBA-4735-A5EE-C7061D1C1D0C@tzi.org> <3dc1abe5-24bf-3b12-7b58-d06c7cde428e@taugh.com> <BBA9B16E-5B06-419D-9ABE-BFB7E69B54C9@tzi.org> <6603926-561f-c9b8-2612-2afb9847b71@taugh.com> <20210228173825.GE30153@localhost> <14ad2b3e-852a-28b1-27ae-5e25ec7823bc@taugh.com> <a7734631-a4f3-cee1-1ee7-e9e0bd3d534a@gmail.com> <d96fc964-f367-dc8f-bdf3-a76b90abd042@alum.mit.edu> <26DCBA0D-AA14-461F-9992-CC631774877E@tzi.org>
From: Julian Reschke <julian.reschke@gmx.de>
Message-ID: <45ca32a4-65df-7eea-84f0-b5451698a27b@gmx.de>
Date: Mon, 1 Mar 2021 09:33:57 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.8.0
MIME-Version: 1.0
In-Reply-To: <26DCBA0D-AA14-461F-9992-CC631774877E@tzi.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-Provags-ID: V03:K1:V2tCHMZUTfT2EyG7IKAscQHZT/1SLl09i9yPB84gO/Ag5ul+ZMo mPQ2gPXneSpWcv4NQQCFIbVkj4fYqwtR6bFzGXuZa+m4SuNCK7ukB5E9dqbKp/4nw+WmFVq sFBIU4L8XNGWOICIqXB6qCN7XNOchGiFEwQg+7Ge9AJX1f18tA9WEaI9m+5YBM3UqyMWpj6 Lx0d9v/v+S9jCTdRd2XeQ==
X-UI-Out-Filterresults: notjunk:1;V03:K0:x9m8qAezmlg=:RwI5tK9CXaEEr1ahfz59nb O45RXX8MPzUgoGcncVHiwgy1qRxXNRmd5ka8lJtr77p0b5QsP1PE/EB8U8ZaZqdSEMN7HnC9g DFg06l0BAJv3B0xTQ9j5o5wGlQgOAapPsEhQMSPhZUxxFosjukEDmMhBWShfTurj0JspD/T1s eH+qCjveJrGylgNz56pfhb3blQQQ17oH8TwsiqNHrSdlNBB4EPv+wjKUmrE5CMVqUsmCbKQ3K t/ZVOcRH/5sUtyPslGdsw5ATio3qHYQTH5aWd/O7MymCYo9lZVs7XeuuFHLMBCZmOwuFPRe+4 1mcLo9WpQou0sVXGF5qK43f39BsvVik7P12s/3b2hgD8rYk7ZANRZOVlyi6KyVWF07ZWfb/sW NggFIV8px8VApPjFu1GKHEFmV/g75gfOX7QFU/Xi9hQIrb67YKkkk6q5CRal11tgRSnEjMFBN JZLlLHzWRcUk+SZgnVjFJOEERvCbbuocNCxvjyu7oE1HXjucYV7BK/zqwzxor0+e1hpqQy16K UrvKtH9CQRMfYgHcJ2n+cN378O00kL5qiKUcKQrHLizvmUb7zypor+EV1CECFp/OpsOTUQwXC L83JnLRJQPDmBqaWp9fGSvOceUIR/wvuLCQazxO1Ubckl5LlbNjD9LprZ6lk5xh3Rjgo4mrY8 SdBExf+Q9H7xpZwnYhxdqfKmr2rzGIg6hWlV2oWzuF9qfu3nDx22Z0VALwe1NvwrwmY3B4Jvd U6uGNe54AZt2FV/6Pux6eTz+R7FCT0+mPMnYcGTQqAYP+fPzSpQWbSvHDElI8Cq/fF0c5+sbU +EHl05tQYQlXKIWjCS+qjgMu47szF05hWMAcu2A5tNfDXN+tCWwLscmwWZqoK6kUoJgFwLBL8 gt42ZvfGWjx0iIcYZN/Q==
Archived-At: <https://mailarchive.ietf.org/arch/msg/xml2rfc/yOeziVCq7fcU022P5-R-Th3KSMw>
Subject: Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong
X-BeenThere: xml2rfc@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <xml2rfc.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/xml2rfc/>
List-Post: <mailto:xml2rfc@ietf.org>
List-Help: <mailto:xml2rfc-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 01 Mar 2021 08:39:09 -0000

Am 28.02.2021 um 23:07 schrieb Carsten Bormann:
> ...
> So this is about sentence detection, not about sentence spacing.
> (Which in turn can make use of sentence spacing in the input, but that is orthogonal.)
>
> I’m trying to understand why the traditional method doesn’t work.
> There is never any ambiguity with traditional keyboarding (a new line starts after a sentence); except that one has to be careful not to do an input line-break after [.!?] that is within a sentence.
>
> The question is whether there is a need to accommodate multiple sentences per input line for proper sentence detection.
> People differ in their style here.
> The newline after a sentence rule helps with version control as well, so I have a strong preference for that style.
> But for people who like running on on the same input line, the two-space convention has been working well.
>
> XML may not "preserve whitespace”, but what that exactly means here is not clear to me.

XML parsers *do* preserve whitespace in element contents.

The tricky part is changing *applications* that use XML from one content
model (whitespace is insignifant) to another model (whitespace is
"mostly" insignificant) after the fact.

I also fear that we'll end up with *other* edge cases then.

For instance, if I have this source:

<t>
    PGP is a family of software systems developed by Philip R.
    Zimmermann from which OpenPGP is based.
</t>

Is the "R." a sentence end?

> Double spaces in XML input are copied verbatim into the HTML (where they then are swallowed by the HTML processor), so it is not like the processor is not seeing them.

But do they survive the preptool step? If so, that would IMHO be a bug.

> ...

Best regards, Julian