Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong

Paul Kyzivat <pkyzivat@alum.mit.edu> Mon, 01 March 2021 04:58 UTC

Return-Path: <pkyzivat@alum.mit.edu>
X-Original-To: xml2rfc@ietfa.amsl.com
Delivered-To: xml2rfc@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3B96B3A14F6 for <xml2rfc@ietfa.amsl.com>; Sun, 28 Feb 2021 20:58:38 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.103
X-Spam-Level:
X-Spam-Status: No, score=-0.103 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, NICE_REPLY_A=-0.001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=alum.mit.edu
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id bbijtQ7Jyea0 for <xml2rfc@ietfa.amsl.com>; Sun, 28 Feb 2021 20:58:36 -0800 (PST)
Received: from NAM12-BN8-obe.outbound.protection.outlook.com (mail-bn8nam12on2063.outbound.protection.outlook.com [40.107.237.63]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 426773A14F5 for <xml2rfc@ietf.org>; Sun, 28 Feb 2021 20:58:36 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Snz6IMqpno+ry+Y/tgGx01S9WHWg1logEXNj+PuCwIgsWAXrMZsy6jTb5K3LoOQGxRSSsfZSF2pzRAw9l1kAb20duONB5Da/6cxZAJ8TlIGxMPH7A8LkyY5fldMGqQZmtk0QcWqyaRKYpM0PISR1GZn0yZ9PxBZAde7TvmmpxcnLeCUJeMf4kexOuFeFDtHBuFBRTVO0/HRY2DodF6VIG9jBn4BiBAlO/0fz5crQMi0SIas12Pt/PhX7l8RMtakythDKM/MBiMiBTODSANTYs+kUqnMkHJ4DnRRvi9rkYPTZGpZlqQ3uQWOACVm0EZYF4wVSKXtrwSKwr3fxmikjbg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=i4dmT58Tr2TbRDZqgdzdBlilcQu4Zmydxu7hPDlxMbE=; b=HJvfqeStOaEd8TMH7WF44oOxqFIJHIPm1u+mA7UiAuG47BKIp2iie95/SBBCIIHTvUlTEKY9GzftTIHCZP+3sM7YOMUr5u/z3NXrYFFKXABRWKF3rsNqcfPzlmVZ8sbefdBq9YIfVtD9VgvGQ22Rhbz92DUHq7k+M9QgGYyBrfrh06xSdqNZcKqA95lDFj5jH8Ezqn1g/ulCUjpz7e8XZkQ2CkafX1dvAGwDKhegUm6s4EgHFCPKDYU9mytERztb0lhFhGYyvRKbmwjR6KEi2YM5IDUqWFL8INVyIPv1s2k6nrpvTVLJAsu3Pui70C5N1y6yhnebpx5esA2WL2I7Ag==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 18.7.68.33) smtp.rcpttodomain=gmail.com smtp.mailfrom=alum.mit.edu; dmarc=bestguesspass action=none header.from=alum.mit.edu; dkim=none (message not signed); arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=alum.mit.edu; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=i4dmT58Tr2TbRDZqgdzdBlilcQu4Zmydxu7hPDlxMbE=; b=dTa3NQNIJlvPkPKD+EyVZDtCygASYucbyt3UUX6El/yu3YgRgXXa/DyeEejRTKAglfjXvVGf3idWAxbKL3OmUiF+TLlwrH4RvuUXkHYd7//H4QWPQLkRzQKxrF04hpKBp/yLzOPdsVY+H11Eq3qyrA6s78GxIhEv706mMwiGwOY=
Received: from DS7PR05CA0014.namprd05.prod.outlook.com (2603:10b6:5:3b9::19) by BY5PR12MB3715.namprd12.prod.outlook.com (2603:10b6:a03:1a5::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3890.20; Mon, 1 Mar 2021 04:58:34 +0000
Received: from CY1NAM02FT055.eop-nam02.prod.protection.outlook.com (2603:10b6:5:3b9:cafe::c5) by DS7PR05CA0014.outlook.office365.com (2603:10b6:5:3b9::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3912.9 via Frontend Transport; Mon, 1 Mar 2021 04:58:34 +0000
X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 18.7.68.33) smtp.mailfrom=alum.mit.edu; gmail.com; dkim=none (message not signed) header.d=none;gmail.com; dmarc=bestguesspass action=none header.from=alum.mit.edu;
Received-SPF: Pass (protection.outlook.com: domain of alum.mit.edu designates 18.7.68.33 as permitted sender) receiver=protection.outlook.com; client-ip=18.7.68.33; helo=outgoing-alum.mit.edu;
Received: from outgoing-alum.mit.edu (18.7.68.33) by CY1NAM02FT055.mail.protection.outlook.com (10.152.74.80) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3890.19 via Frontend Transport; Mon, 1 Mar 2021 04:58:33 +0000
Received: from MacBook-Air.localdomain (c-24-62-227-142.hsd1.ma.comcast.net [24.62.227.142]) (authenticated bits=0) (User authenticated as pkyzivat@ALUM.MIT.EDU) by outgoing-alum.mit.edu (8.14.7/8.12.4) with ESMTP id 1214wVHF021707 (version=TLSv1/SSLv3 cipher=AES128-GCM-SHA256 bits=128 verify=NOT); Sun, 28 Feb 2021 23:58:32 -0500
To: Brian E Carpenter <brian.e.carpenter@gmail.com>, xml2rfc@ietf.org
References: <20210227191644.165F76F105E2@ary.qy> <28B528D6-7CBA-4735-A5EE-C7061D1C1D0C@tzi.org> <3dc1abe5-24bf-3b12-7b58-d06c7cde428e@taugh.com> <BBA9B16E-5B06-419D-9ABE-BFB7E69B54C9@tzi.org> <6603926-561f-c9b8-2612-2afb9847b71@taugh.com> <20210228173825.GE30153@localhost> <14ad2b3e-852a-28b1-27ae-5e25ec7823bc@taugh.com> <a7734631-a4f3-cee1-1ee7-e9e0bd3d534a@gmail.com> <d96fc964-f367-dc8f-bdf3-a76b90abd042@alum.mit.edu> <3d0300d1-b9de-ffe6-7b87-6726ab6228cd@gmail.com>
From: Paul Kyzivat <pkyzivat@alum.mit.edu>
Message-ID: <05b3065e-737d-e282-15f8-8327617a001d@alum.mit.edu>
Date: Sun, 28 Feb 2021 23:58:31 -0500
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:78.0) Gecko/20100101 Thunderbird/78.8.0
MIME-Version: 1.0
In-Reply-To: <3d0300d1-b9de-ffe6-7b87-6726ab6228cd@gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-EOPAttributedMessage: 0
X-MS-PublicTrafficType: Email
X-MS-Office365-Filtering-Correlation-Id: 4e430618-9f5f-4186-e60d-08d8dc6eaa86
X-MS-TrafficTypeDiagnostic: BY5PR12MB3715:
X-Microsoft-Antispam-PRVS: <BY5PR12MB37151FB7BD301077DBB5C684F99A9@BY5PR12MB3715.namprd12.prod.outlook.com>
X-MS-Oob-TLC-OOBClassifiers: OLM:10000;
X-MS-Exchange-SenderADCheck: 1
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: yXYwpItJOdb+Yq2c30QvcaR44KbjLg/tdj1U0+dba5HCovqy6wHz2+PseMElrpQwF4OV5b0CS3fc0SbsSsC6rYlyy9+TgzBLwwcCzbql+K7ZWeRNqx76IzGZuShi62GB/91EqQVTJv35WWLCJHXAvrl9rJ4DkeSGrgRuVIIbP3wyz1OQhDgWAfEClu6UdASSJX1CXszYDzICMhQbumLfNBXf9buvrTHzsnLelRH9ASANd6LmjvgZA96DebRoVPmxfovyIB5XDruL6IWU/gn5LURNE9EC9OAJ55ECn2wvGCC6GfMo6svKP9+Jo1KZ6qjGTfOljmcVZ1KNx9E659yirdgRa9Nx9VV3zxcvIWsOzy1F08hO/WpZCE4ABsE7c047K0NESVuzcfxOeaxI3AaiD3jlGbfuYfJyM0p2GWZlYDcTBLvVwEmfYJdbAc7NVK33LolooxeWu0XEgqNcFij8rT046e6cclBnDYyTmgjyrX3UJwhYlD1xNgTr3C1w8vE3uccRnv6xihmAsqqQNaeQKX8eRH1q+oEFTdY4X4kwsEjTlEapDstvcqIWAHX0wSFrPR44dS4rchU0Y1mKXEVoDYdeXmmiWKJ/gFlvZXetKwz6CQOgdZabP1ArYSzTYvb/GRN5C1ZNMue1jHjiLAL3WE2LKz0qV2BZmND2GL8870jt465Kc6bEcqGG+AysEAm8
X-Forefront-Antispam-Report: CIP:18.7.68.33; CTRY:US; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:outgoing-alum.mit.edu; PTR:outgoing-alum.mit.edu; CAT:NONE; SFS:(39860400002)(396003)(376002)(346002)(136003)(36840700001)(46966006)(53546011)(82310400003)(336012)(31696002)(36860700001)(7596003)(83380400001)(5660300002)(70586007)(478600001)(82740400003)(8936002)(75432002)(8676002)(86362001)(786003)(316002)(70206006)(47076005)(26005)(31686004)(36906005)(356005)(186003)(2906002)(956004)(2616005)(43740500002); DIR:OUT; SFP:1101;
X-OriginatorOrg: alum.mit.edu
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 01 Mar 2021 04:58:33.6216 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 4e430618-9f5f-4186-e60d-08d8dc6eaa86
X-MS-Exchange-CrossTenant-Id: 3326b102-c043-408b-a990-b89e477d582f
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3326b102-c043-408b-a990-b89e477d582f; Ip=[18.7.68.33]; Helo=[outgoing-alum.mit.edu]
X-MS-Exchange-CrossTenant-AuthSource: CY1NAM02FT055.eop-nam02.prod.protection.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY5PR12MB3715
Archived-At: <https://mailarchive.ietf.org/arch/msg/xml2rfc/RqDr7ZzI9UGPXt6w9XaOEELoI08>
Subject: Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong
X-BeenThere: xml2rfc@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <xml2rfc.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/xml2rfc/>
List-Post: <mailto:xml2rfc@ietf.org>
List-Help: <mailto:xml2rfc-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 01 Mar 2021 04:58:38 -0000

On 2/28/21 4:59 PM, Brian E Carpenter wrote:
> On 01-Mar-21 09:55, Paul Kyzivat wrote:
>> On 2/28/21 2:51 PM, Brian E Carpenter wrote:
>>> On 01-Mar-21 06:54, John R Levine wrote:
>>>>> Provided it doesn't also lose alternative Unicode whitespace characters,
>>>>> using &emsp; is an option. In a pinch we could have an element to mark
>>>>> the end of a sentence (<s/>).
>>>>
>>>> At the end of every sentence? That's, uh, quite a stretch. Are we sure
>>>> this problem is worth that much effort by every author?
>>>
>>> Since we're designing on the hoof here, I suggest you'd need a construct like
>>> <literal value="Philip R. Zimmermann"/>.
>>>
>>> But much simpler to scrap the double space rule.
>>
>> Two things are being muddled here:
>>
>> 1) two spaces at end of sentences in .txt output;
>>
>> 2) how two distinguish sentence endings by xml2rfc in xml input.
>>
>> There has been *some* discussion of using two spaces in the input for
>> (2), but it doesn't work that way now and there are many issues in
>> changing it to work that way. It isn't evident to me that it is a
>> serious proposal.
>>
>> *If* we had a reliable method for (2) then I doubt there would be much
>> issue with (1). The problem is that the existing method for (2) isn't
>> reliable.
>>
>> I haven't checked, but I presume the current problems (2) are also
>> exhibited in html output.
> 
> Why would they be? The html format has single spaces. (Just checked in
> RFC8981, which was announced half an hour ago.)

I didn't check, but gathered from the discussion that html had big 
spaces of some sort between sentences. That would seem to be the analog 
to two spaces in txt. And if we don't expect that in html then there is 
even less justification for the two spaces in txt.

>> ISTM that the real question is whether authors will be willing to
>> manually annotate the xml input to indicate sentence endings. I haven't
>> seen any proposal mentioned that I would willingly use on a regular
>> basis. I would rather suffer with the existing heuristic.
> 
> But much simpler to scrap the double space rule.

yes.