Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong

John R Levine <johnl@taugh.com> Sun, 28 February 2021 17:14 UTC

Return-Path: <johnl@taugh.com>
X-Original-To: xml2rfc@ietfa.amsl.com
Delivered-To: xml2rfc@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7D6583A196A for <xml2rfc@ietfa.amsl.com>; Sun, 28 Feb 2021 09:14:34 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.199
X-Spam-Level:
X-Spam-Status: No, score=-0.199 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=iecc.com header.b=oTT+iFOy; dkim=pass (2048-bit key) header.d=taugh.com header.b=IwYYsXjF
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LXPyFXR0HrCr for <xml2rfc@ietfa.amsl.com>; Sun, 28 Feb 2021 09:14:32 -0800 (PST)
Received: from gal.iecc.com (gal.iecc.com [IPv6:2001:470:1f07:1126:0:43:6f73:7461]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4BAA43A196C for <xml2rfc@ietf.org>; Sun, 28 Feb 2021 09:14:32 -0800 (PST)
Received: (qmail 78086 invoked from network); 28 Feb 2021 17:14:28 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=iecc.com; h=date:message-id:from:to:cc:subject:in-reply-to:references:mime-version:content-type; s=13104.603bcf74.k2102; bh=ttf7KgWhCWTYqRl2MkzvM58grKtelFDqNK1/ATC837Q=; b=oTT+iFOytgpTv8VMNDsrBS7w9OrZMxJVeQkdsVsUofo8k1NxQ35yHE+bQZOCAz+8SI6PIgiWCCRGD4XxlxAAo3PQ4WnpwRc9M7/hc7wnBy+/7DkQO8TYAUiEGTK9lnmBC0zreR/zXk9tldloNbBFvYwnr4TM9dhtTv/z6uWqLsHWv6FMybTBanDh7oz54/P0bjWe76PnXJCjggR/QQ3XRe8+D3M/UWsRnvtIfVj1WMuR0ujCILHi0Uw9KiffE3IvkaZYlAVTdWXz05/O/4JLQyRdbXyoKkTsimshq0Xmq5BjG/deaCuw2A43s5Q1d0GU1Yz3ISKz7iUCoPPIijRXgg==
DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=taugh.com; h=date:message-id:from:to:cc:subject:in-reply-to:references:mime-version:content-type; s=13104.603bcf74.k2102; bh=ttf7KgWhCWTYqRl2MkzvM58grKtelFDqNK1/ATC837Q=; b=IwYYsXjFO8QIqe8GU61bc78OxCgvi4THQ8AIDsJWff+vSuQcJn4B3msY4C+cIb9WccDQaOWurfYp2wS852otwKztO2a33sBao3GbiQ2PbyxtpDdfMzlRe40rsdTqHg8DsnmecVM7tsN4SJt6+C4oS3WFuV/or5OFULFcB1rQy4+hGhhn1YVQf2mf5RGFvvO5qjTlVTKLNa6FniBZhDXvGQECI/aaH5F00Uj7iPn3Kjlm8v4ryPZCJKUNNBg9BS+7xUSLVrQK515YVpB2LgsUx7YU4B+NJ4Nq5Q5FZeXq73ghPCEBlVWVARBH5y6/aWkX/StzfV5aC8tnQiICXcyrFg==
Received: from ary.qy ([IPv6:2001:470:1f07:1126::78:696d:6170]) by imap.iecc.com ([IPv6:2001:470:1f07:1126::78:696d:6170]) with ESMTPS (TLS1.2 ECDHE-RSA AES-256-GCM AEAD) via TCP6; 28 Feb 2021 17:14:28 -0000
Received: by ary.qy (Postfix, from userid 501) id 1D7096F275AB; Sun, 28 Feb 2021 12:14:27 -0500 (EST)
Received: from localhost (localhost [127.0.0.1]) by ary.qy (Postfix) with ESMTP id DFBE76F2758D; Sun, 28 Feb 2021 12:14:27 -0500 (EST)
Date: 28 Feb 2021 12:14:27 -0500
Message-ID: <6603926-561f-c9b8-2612-2afb9847b71@taugh.com>
From: "John R Levine" <johnl@taugh.com>
To: "Carsten Bormann" <cabo@tzi.org>
Cc: xml2rfc@ietf.org
In-Reply-To: <BBA9B16E-5B06-419D-9ABE-BFB7E69B54C9@tzi.org>
References: <20210227191644.165F76F105E2@ary.qy> <28B528D6-7CBA-4735-A5EE-C7061D1C1D0C@tzi.org> <3dc1abe5-24bf-3b12-7b58-d06c7cde428e@taugh.com> <BBA9B16E-5B06-419D-9ABE-BFB7E69B54C9@tzi.org>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="0-1810996299-1614532467=:97350"
Archived-At: <https://mailarchive.ietf.org/arch/msg/xml2rfc/rdb2vE7I50caXDX4zcQxDR9fgz4>
Subject: Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong
X-BeenThere: xml2rfc@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <xml2rfc.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/xml2rfc/>
List-Post: <mailto:xml2rfc@ietf.org>
List-Help: <mailto:xml2rfc-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 28 Feb 2021 17:14:34 -0000

>> Having been through the publishing process in a lot of books, I can report that no matter how good your tools are, the only way to typeset stuff of professional quality is to do hand tweaks where the tools don't get it quite right.  For a bunch of reasons we have decided we're not doing that and I would prefer not to say oh, but THIS tweak is worth it.
>
> For properly doing sentence spacing, what is needed is a way to signal sentence ends.
> For 50 years, the convention in keyboarding manuscripts has been that dots at the end of the input line and dots followed by two spaces (here we are actually using two spaces — in the manuscript!) are periods (i.e., sentence ends).
> That works exceedingly well.

I don't see how that would work in xml2rfc where <t> elements don't 
preserve spacing.

> The discussion came up because xml2rfc treated the dot in “Philip R. Zimmermann” as a sentence end.
> This is a mere bug, and bugs can be fixed.

It is a hard bug to fix.  You can add a heuristic not to treat a single 
letter followed by a dot as a sentence end, but it's wrong sometimes since 
the single letter word "I" occasionally ends a sentence.

As I said, you can make guesses but you can't fully automate it.

Regards,
John Levine, johnl@taugh.com, Taughannock Networks, Trumansburg NY
Please consider the environment before reading this e-mail. https://jl.ly