Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong

Nico Williams <nico@cryptonector.com> Mon, 01 March 2021 05:23 UTC

Return-Path: <nico@cryptonector.com>
X-Original-To: xml2rfc@ietfa.amsl.com
Delivered-To: xml2rfc@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 918773A156B for <xml2rfc@ietfa.amsl.com>; Sun, 28 Feb 2021 21:23:43 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.219
X-Spam-Level:
X-Spam-Status: No, score=-0.219 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cryptonector.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9MF-0rT8CAva for <xml2rfc@ietfa.amsl.com>; Sun, 28 Feb 2021 21:23:41 -0800 (PST)
Received: from earwig.ash.relay.mailchannels.net (earwig.ash.relay.mailchannels.net [23.83.222.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 00E443A1569 for <xml2rfc@ietf.org>; Sun, 28 Feb 2021 21:23:39 -0800 (PST)
X-Sender-Id: dreamhost|x-authsender|nico@cryptonector.com
Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id 82EB34825C3; Mon, 1 Mar 2021 05:23:35 +0000 (UTC)
Received: from pdx1-sub0-mail-a29.g.dreamhost.com (100-96-13-44.trex.outbound.svc.cluster.local [100.96.13.44]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id 1FC2B4827B5; Mon, 1 Mar 2021 05:23:35 +0000 (UTC)
X-Sender-Id: dreamhost|x-authsender|nico@cryptonector.com
Received: from pdx1-sub0-mail-a29.g.dreamhost.com (pop.dreamhost.com [64.90.62.162]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384) by 100.96.13.44 (trex/6.0.2); Mon, 01 Mar 2021 05:23:35 +0000
X-MC-Relay: Neutral
X-MailChannels-SenderId: dreamhost|x-authsender|nico@cryptonector.com
X-MailChannels-Auth-Id: dreamhost
X-Madly-Hysterical: 08b3d6153ce7a8a5_1614576215366_952020112
X-MC-Loop-Signature: 1614576215366:2585135095
X-MC-Ingress-Time: 1614576215365
Received: from pdx1-sub0-mail-a29.g.dreamhost.com (localhost [127.0.0.1]) by pdx1-sub0-mail-a29.g.dreamhost.com (Postfix) with ESMTP id B24C37E3CB; Sun, 28 Feb 2021 21:23:34 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h=date :from:to:cc:subject:message-id:references:mime-version :content-type:in-reply-to; s=cryptonector.com; bh=dVdlNdZhyJObiM +F4LlO+zeySsM=; b=vbiy+47Q8FKBDzdNrwhlYWkF9KTRMQYWkpGyvJ/dfViFu1 xtA24kXpC/svsAYpA8uU2SmnxxGitU0icref4sd23CqsUigqCwYKeXwOWWIQYmWB YQEsKQMV2M+GhpW8DKG64ASoDC0Ahcq11uUqlODvGtETtTaFIvFAOYmcvK8Fw=
Received: from localhost (unknown [24.28.108.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: nico@cryptonector.com) by pdx1-sub0-mail-a29.g.dreamhost.com (Postfix) with ESMTPSA id 9F9A57F682; Sun, 28 Feb 2021 21:23:33 -0800 (PST)
Date: Sun, 28 Feb 2021 23:23:31 -0600
X-DH-BACKEND: pdx1-sub0-mail-a29
From: Nico Williams <nico@cryptonector.com>
To: Brian E Carpenter <brian.e.carpenter@gmail.com>
Cc: Paul Kyzivat <pkyzivat@alum.mit.edu>, xml2rfc@ietf.org
Message-ID: <20210301052330.GG30153@localhost>
References: <20210227191644.165F76F105E2@ary.qy> <28B528D6-7CBA-4735-A5EE-C7061D1C1D0C@tzi.org> <3dc1abe5-24bf-3b12-7b58-d06c7cde428e@taugh.com> <BBA9B16E-5B06-419D-9ABE-BFB7E69B54C9@tzi.org> <6603926-561f-c9b8-2612-2afb9847b71@taugh.com> <20210228173825.GE30153@localhost> <14ad2b3e-852a-28b1-27ae-5e25ec7823bc@taugh.com> <a7734631-a4f3-cee1-1ee7-e9e0bd3d534a@gmail.com> <d96fc964-f367-dc8f-bdf3-a76b90abd042@alum.mit.edu> <3d0300d1-b9de-ffe6-7b87-6726ab6228cd@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <3d0300d1-b9de-ffe6-7b87-6726ab6228cd@gmail.com>
User-Agent: Mutt/1.9.4 (2018-02-28)
Archived-At: <https://mailarchive.ietf.org/arch/msg/xml2rfc/lP6zL1cC99R801aLXC7mO4bD67k>
Subject: Re: [xml2rfc] assuming that period (.) ends a sentence is sometimes wrong
X-BeenThere: xml2rfc@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <xml2rfc.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/xml2rfc/>
List-Post: <mailto:xml2rfc@ietf.org>
List-Help: <mailto:xml2rfc-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xml2rfc>, <mailto:xml2rfc-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 01 Mar 2021 05:23:44 -0000

On Mon, Mar 01, 2021 at 10:59:16AM +1300, Brian E Carpenter wrote:
> > On 2/28/21 2:51 PM, Brian E Carpenter wrote:
> >> Since we're designing on the hoof here, I suggest you'd need a construct like
> >> <literal value="Philip R. Zimmermann"/>.

Consider these examples:

  It's the biggest stimilus in the U.S. Congress budget history.

Now, with the power of grammar we can parse that as one sentence.

But now try this one:

  Some wonder what is the biggest problem facing the U.S. Congress.

Hmmm.  Well, yes, a stickler for grammar will tell you that parses as
just one sentence.

There is no heuristic you can implement for that case, and wider space
after sentence-ending periods really does help readers parse sentences.

One could say "don't write ambiguous sentences" / "rewrite to avoid
ambiguity".  Like:

  Some wonder what is the biggest problem facing the United States
  Congress.

Well, OK, if there's no desire to do anything about this in xml2rfc, I
guess that will be authors' only resort.

Here's a possible scheme that doesn't count on XML preserving any space
for xml2rfc:

  if (s[i] == '.' && s[i-1] == '.') ||  # E.g., ellipsis
     ('.' && s[i-2] != '.')             # I.e., not a dotted acronym
    if s[i+1] == &nbsp
      # sentence does not end
    else
      # sentence ends
  else if s[i] == '.' && s[i-2] == '.'  # I.e., a dotted acronym
    if s[i+1] == &emsp;
      # sentence ends
    else
      # sentence does not end

Replace &nbsp; and &emsp; with elements if need be.

Nico
--