Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong

Nico Williams <> Sun, 28 February 2021 17:38 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 194273A19C7 for <>; Sun, 28 Feb 2021 09:38:34 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -0.2
X-Spam-Status: No, score=-0.2 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (1024-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id BXidMNNF0y-C for <>; Sun, 28 Feb 2021 09:38:33 -0800 (PST)
Received: from ( []) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id CE9E23A19C6 for <>; Sun, 28 Feb 2021 09:38:32 -0800 (PST)
X-Sender-Id: dreamhost|x-authsender|
Received: from (localhost []) by (Postfix) with ESMTP id 61FE2680F4D; Sun, 28 Feb 2021 17:38:31 +0000 (UTC)
Received: from (100-96-133-25.trex.outbound.svc.cluster.local []) (Authenticated sender: dreamhost) by (Postfix) with ESMTPA id E585A680F12; Sun, 28 Feb 2021 17:38:30 +0000 (UTC)
X-Sender-Id: dreamhost|x-authsender|
Received: from ( []) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384) by (trex/6.0.2); Sun, 28 Feb 2021 17:38:31 +0000
X-MC-Relay: Neutral
X-MailChannels-SenderId: dreamhost|x-authsender|
X-MailChannels-Auth-Id: dreamhost
X-Madly-Wide-Eyed: 640f5cc81550b18a_1614533911183_3076056641
X-MC-Loop-Signature: 1614533911183:1791913227
X-MC-Ingress-Time: 1614533911183
Received: from (localhost []) by (Postfix) with ESMTP id 9DCE97E4B5; Sun, 28 Feb 2021 09:38:30 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed;; h=date :from:to:cc:subject:message-id:references:mime-version :content-type:in-reply-to:content-transfer-encoding; s=; bh=A0p/OQxLHU2j/SzjGRmvS2ZfOS8=; b=lnPxLrDZbuk /iFZmg6IYBVK/n5rhl3Vd6xoi8srJi6EZ6ZJCFSUsoHMi2+pjK35GG3DBy6obwjh qf7ewgKO1mOUwq0m1o23yMOCH8O3lGj8iCGyDuB7z78YUHOd5GeznJ3I1yw0FKc0 5Skk2B5t0B4HoQJlGzQw+BE5YkL+yj3A=
Received: from localhost (unknown []) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: by (Postfix) with ESMTPSA id B1E3C7E4AF; Sun, 28 Feb 2021 09:38:28 -0800 (PST)
Date: Sun, 28 Feb 2021 11:38:26 -0600
X-DH-BACKEND: pdx1-sub0-mail-a68
From: Nico Williams <>
To: John R Levine <>
Cc: Carsten Bormann <>,
Message-ID: <20210228173825.GE30153@localhost>
References: <20210227191644.165F76F105E2@ary.qy> <> <> <> <>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <>
User-Agent: Mutt/1.9.4 (2018-02-28)
Content-Transfer-Encoding: quoted-printable
Archived-At: <>
Subject: Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sun, 28 Feb 2021 17:38:34 -0000

On Sun, Feb 28, 2021 at 12:14:27PM -0500, John R Levine wrote:
> I don't see how that would work in xml2rfc where <t> elements don't preserve
> spacing.

Provided it doesn't also lose alternative Unicode whitespace characters,
using &emsp; is an option.  In a pinch we could have an element to mark
the end of a sentence (<s/>).

> > The discussion came up because xml2rfc treated the dot in “Philip R. Zimmermann” as a sentence end.
> > This is a mere bug, and bugs can be fixed.
> It is a hard bug to fix.  You can add a heuristic not to treat a single
> letter followed by a dot as a sentence end, but it's wrong sometimes since
> the single letter word "I" occasionally ends a sentence.

Indeed.  "U.S." is a perfect example, because it can appear in the
middle of a sentence or at the end of a sentence, and you can't even use
the case of the first letter of the next word to distinguish because
words naming principal U.S. insitutions are capitalized after "U.S."