Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong

Paul Kyzivat <pkyzivat@alum.mit.edu> Sun, 28 February 2021 20:55 UTC

Return-Path: <pkyzivat@alum.mit.edu>
X-Original-To: xml2rfc@ietfa.amsl.com
Delivered-To: xml2rfc@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A5F113A1C49 for <xml2rfc@ietfa.amsl.com>; Sun, 28 Feb 2021 12:55:55 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.103
X-Spam-Level:
X-Spam-Status: No, score=-0.103 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, NICE_REPLY_A=-0.001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=alum.mit.edu
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id z5TrjgoXnBc5 for <xml2rfc@ietfa.amsl.com>; Sun, 28 Feb 2021 12:55:53 -0800 (PST)
Received: from NAM04-CO1-obe.outbound.protection.outlook.com (mail-eopbgr690041.outbound.protection.outlook.com [40.107.69.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id BED483A1C48 for <xml2rfc@ietf.org>; Sun, 28 Feb 2021 12:55:53 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=RSUxPoCZ7zpiRoUtG4JiuWu/xNQH45WhI7VKTO6TP/wazKHvDnaFQT2lhwqni3CEDFDD8pL/uZI80hzamYjOitOA+B3jcQ9v+CL4Fmt+vZqUh1bJKurCEzh+uBULua7G/TFRzgIv/Cfl35nA07wmz6hLK6CEUyJN4+BgCDMnca0kRxkaAZKyLWJLff2PfI5lPhprTgfgdc3ehhL1YKFegWGT1VOTK5Cgy8f6mOnKog5AL5pfROkRa3hLqhK+PB1xBx0KhNi8s1Lcrv2lZkvaqG07TAKw2tjlKaXaIXO7/aOaXxaVgos/kHpg6z2J0XCWStPPd/bmBwj6hLvyHoo43Q==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=GFdAqVeCYXQ94CRV0OrP6DqpLTDQFHH3oPfcOKH47Is=; b=Rw5opjZIlyWXIk/VdKbM2f+nbKecBh01Li29J0Xiqtt+9NJPgGoZdP1bP3mar2qxQBiVsTw3/JA9U+mgEuFpGWIwD0a2aCtZ+RYku4txnGdkCf5QJ/02hMbaBNLUf8vh1Q42il0HyEV8LWnuBQQFFNKhY75CWO9Ddvq/P7AfDh/o3zU3QWYlRmq/P8QlLLisauO9F40c2lwPSrRhWJornsKCesYHTx+vw7BbcS0iTcM4K8b9xvAXo6+lSZ77RmH83UxhJyJxDJ8NJhpLMLCvz/IPS5JAU6TylqcVqL4WklK0saVaeRgEsfj0fiqkmls6Wxk8+Y1SPUDhImBBqb1fxQ==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 18.7.68.33) smtp.rcpttodomain=ietf.org smtp.mailfrom=alum.mit.edu; dmarc=bestguesspass action=none header.from=alum.mit.edu; dkim=none (message not signed); arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=alum.mit.edu; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=GFdAqVeCYXQ94CRV0OrP6DqpLTDQFHH3oPfcOKH47Is=; b=TPXSGFY9uQCyndG0Ss1vXNGmxxd0Uy4x1z5GIOWCxlcxpH7XXgXlXgEM5In7prWYsFE90IvJ7Fzn7QnN8OFBZP7bQrXfBGBFS6AdchLiutoh5WL5kXFFzFAWAJyHtLzc4/y74FW7mdAtDaZPGVQDdidlOEAO5kaSKFx+3RqopNs=
Received: from MN2PR16CA0039.namprd16.prod.outlook.com (2603:10b6:208:234::8) by CY4PR1201MB0152.namprd12.prod.outlook.com (2603:10b6:910:1b::8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3890.23; Sun, 28 Feb 2021 20:55:52 +0000
Received: from BL2NAM02FT027.eop-nam02.prod.protection.outlook.com (2603:10b6:208:234:cafe::81) by MN2PR16CA0039.outlook.office365.com (2603:10b6:208:234::8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3890.19 via Frontend Transport; Sun, 28 Feb 2021 20:55:52 +0000
X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 18.7.68.33) smtp.mailfrom=alum.mit.edu; ietf.org; dkim=none (message not signed) header.d=none;ietf.org; dmarc=bestguesspass action=none header.from=alum.mit.edu;
Received-SPF: Pass (protection.outlook.com: domain of alum.mit.edu designates 18.7.68.33 as permitted sender) receiver=protection.outlook.com; client-ip=18.7.68.33; helo=outgoing-alum.mit.edu;
Received: from outgoing-alum.mit.edu (18.7.68.33) by BL2NAM02FT027.mail.protection.outlook.com (10.152.77.160) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3890.19 via Frontend Transport; Sun, 28 Feb 2021 20:55:51 +0000
Received: from MacBook-Air.localdomain (c-24-62-227-142.hsd1.ma.comcast.net [24.62.227.142]) (authenticated bits=0) (User authenticated as pkyzivat@ALUM.MIT.EDU) by outgoing-alum.mit.edu (8.14.7/8.12.4) with ESMTP id 11SKtoSN022138 (version=TLSv1/SSLv3 cipher=AES128-GCM-SHA256 bits=128 verify=NOT) for <xml2rfc@ietf.org>; Sun, 28 Feb 2021 15:55:51 -0500
To: xml2rfc@ietf.org
References: <20210227191644.165F76F105E2@ary.qy> <28B528D6-7CBA-4735-A5EE-C7061D1C1D0C@tzi.org> <3dc1abe5-24bf-3b12-7b58-d06c7cde428e@taugh.com> <BBA9B16E-5B06-419D-9ABE-BFB7E69B54C9@tzi.org> <6603926-561f-c9b8-2612-2afb9847b71@taugh.com> <20210228173825.GE30153@localhost> <14ad2b3e-852a-28b1-27ae-5e25ec7823bc@taugh.com> <a7734631-a4f3-cee1-1ee7-e9e0bd3d534a@gmail.com>
From: Paul Kyzivat <pkyzivat@alum.mit.edu>
Message-ID: <d96fc964-f367-dc8f-bdf3-a76b90abd042@alum.mit.edu>
Date: Sun, 28 Feb 2021 15:55:50 -0500
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:78.0) Gecko/20100101 Thunderbird/78.7.1
MIME-Version: 1.0
In-Reply-To: <a7734631-a4f3-cee1-1ee7-e9e0bd3d534a@gmail.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-EOPAttributedMessage: 0
X-MS-PublicTrafficType: Email
X-MS-Office365-Filtering-Correlation-Id: b61dff03-18f7-47db-df6e-08d8dc2b3bdd
X-MS-TrafficTypeDiagnostic: CY4PR1201MB0152:
X-Microsoft-Antispam-PRVS: <CY4PR1201MB01525141AE54E2A2482AA690F99B9@CY4PR1201MB0152.namprd12.prod.outlook.com>
X-MS-Oob-TLC-OOBClassifiers: OLM:10000;
X-MS-Exchange-SenderADCheck: 1
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: 7OC3KiR7IeLzpI5JY1PEiHEbirXvnZumiy2Bp+GbklmeNSGyN5mBlSfiFeSo9V9dz8yQK+O4fBT4F0z8O4iBJF/faODtcOjAgHj69GbBx+FrTWC5RsUgloJxYjNoMDgdI2mMep32ELrQQwE2kwei4fAr1SNCl1/X4P7/j2/wce6USy13N25T8h6zEwEU2vBDnwwMyp+vZV2FXATYqum0+vrYv4t8Oq6XcQoqQVkC4SyzJxppbOw95aRX2zZLl4EurPnbjBuUt9ml0EEgim2a/bmJWG5mcUrQt2cC0/IuYsQu8NF3ZKzJBeTxEAtj4N7MFxSkO90SXuydJq8hl+Xb/cfJx0Cfkl6QsthiVTh7VGNswYgY2a7Fp0cRwxLkl3aPpoZRjjhAscRT3c4XQpJLLJNkXnZf90cdME0CaiWFxM+vVwEThXle5Ezr/5V+q8RypfDqbHNcbOgqmtOb88o6Mdhbvv330E4BA6yK0SadeP6KMmIYOzaqrYonlh9Eu+FUqF/jcwiqwQIuMNYRIl8HSHMTILHzFsZYRwTvPscB0oYRJs2PQo9Kbi6pEDNKlJzZmGi0xHXEyHXU6JEvaP2ccqrrN4zdhy/wi4PuSB+6YeAWgQ7+273X+wqCYfTbXCDn+2zLMRsaSjCp2zKY722WTmTpQnF8yRcwl5AzLStEiB5uvdMb03DOLkNoBYgrASBX
X-Forefront-Antispam-Report: CIP:18.7.68.33; CTRY:US; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:outgoing-alum.mit.edu; PTR:outgoing-alum.mit.edu; CAT:NONE; SFS:(39860400002)(376002)(396003)(136003)(346002)(36840700001)(46966006)(47076005)(83380400001)(31686004)(786003)(70206006)(956004)(356005)(2616005)(336012)(478600001)(36906005)(7596003)(316002)(70586007)(82740400003)(53546011)(2906002)(186003)(8676002)(8936002)(86362001)(36860700001)(31696002)(82310400003)(26005)(5660300002)(6916009)(75432002)(43740500002); DIR:OUT; SFP:1101;
X-OriginatorOrg: alum.mit.edu
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 28 Feb 2021 20:55:51.8172 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: b61dff03-18f7-47db-df6e-08d8dc2b3bdd
X-MS-Exchange-CrossTenant-Id: 3326b102-c043-408b-a990-b89e477d582f
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3326b102-c043-408b-a990-b89e477d582f; Ip=[18.7.68.33]; Helo=[outgoing-alum.mit.edu]
X-MS-Exchange-CrossTenant-AuthSource: BL2NAM02FT027.eop-nam02.prod.protection.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY4PR1201MB0152
Archived-At: <https://mailarchive.ietf.org/arch/msg/xml2rfc/RB7M00BDnlU2TrLc_j2UONkBzqI>
Subject: Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong
X-BeenThere: xml2rfc@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <xml2rfc.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/xml2rfc/>
List-Post: <mailto:xml2rfc@ietf.org>
List-Help: <mailto:xml2rfc-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 28 Feb 2021 20:55:56 -0000

On 2/28/21 2:51 PM, Brian E Carpenter wrote:
> On 01-Mar-21 06:54, John R Levine wrote:
>>> Provided it doesn't also lose alternative Unicode whitespace characters,
>>> using &emsp; is an option. In a pinch we could have an element to mark
>>> the end of a sentence (<s/>).
>>
>> At the end of every sentence? That's, uh, quite a stretch. Are we sure
>> this problem is worth that much effort by every author?
> 
> Since we're designing on the hoof here, I suggest you'd need a construct like
> <literal value="Philip R. Zimmermann"/>.
> 
> But much simpler to scrap the double space rule.

Two things are being muddled here:

1) two spaces at end of sentences in .txt output;

2) how two distinguish sentence endings by xml2rfc in xml input.

There has been *some* discussion of using two spaces in the input for 
(2), but it doesn't work that way now and there are many issues in 
changing it to work that way. It isn't evident to me that it is a 
serious proposal.

*If* we had a reliable method for (2) then I doubt there would be much 
issue with (1). The problem is that the existing method for (2) isn't 
reliable.

I haven't checked, but I presume the current problems (2) are also 
exhibited in html output.

ISTM that the real question is whether authors will be willing to 
manually annotate the xml input to indicate sentence endings. I haven't 
seen any proposal mentioned that I would willingly use on a regular 
basis. I would rather suffer with the existing heuristic.

	Thanks,
	Paul