Re: [abnf-discuss] constrained-01 - advantage?

Sean Leonard <dev+ietf@seantek.com> Mon, 14 November 2016 04:55 UTC

Return-Path: <dev+ietf@seantek.com>
X-Original-To: abnf-discuss@ietfa.amsl.com
Delivered-To: abnf-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 362B5129631 for <abnf-discuss@ietfa.amsl.com>; Sun, 13 Nov 2016 20:55:39 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.601
X-Spam-Level:
X-Spam-Status: No, score=-2.601 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id AXk7bHkP5l8j for <abnf-discuss@ietfa.amsl.com>; Sun, 13 Nov 2016 20:55:36 -0800 (PST)
Received: from mxout-07.mxes.net (mxout-07.mxes.net [216.86.168.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 704A2129542 for <abnf-discuss@ietf.org>; Sun, 13 Nov 2016 20:55:36 -0800 (PST)
Received: from dhcp-898b.meeting.ietf.org (unknown [31.133.137.139]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id A4F9D22E1F3; Sun, 13 Nov 2016 23:55:34 -0500 (EST)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: Sean Leonard <dev+ietf@seantek.com>
In-Reply-To: <a5d764a8-c560-bed3-095f-f1a1a5e35688@gmx.de>
Date: Mon, 14 Nov 2016 13:55:32 +0900
Content-Transfer-Encoding: quoted-printable
Message-Id: <F49254F8-41B7-499F-8745-5F2374693AA7@seantek.com>
References: <5828DD42.8010009@gmail.com> <36FC0A35-2ADA-4710-ABFB-08E8B916718E@seantek.com> <a5d764a8-c560-bed3-095f-f1a1a5e35688@gmx.de>
To: Julian Reschke <julian.reschke@gmx.de>
X-Mailer: Apple Mail (2.3124)
Archived-At: <https://mailarchive.ietf.org/arch/msg/abnf-discuss/Kwm6yIHHYFjeDRq5avrQRaAFvUk>
Cc: Doug Royer <douglasroyer@gmail.com>, abnf-discuss@ietf.org
Subject: Re: [abnf-discuss] constrained-01 - advantage?
X-BeenThere: abnf-discuss@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: "General discussion about tools, activities and capabilities involving the ABNF meta-language" <abnf-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/abnf-discuss>, <mailto:abnf-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/abnf-discuss/>
List-Post: <mailto:abnf-discuss@ietf.org>
List-Help: <mailto:abnf-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/abnf-discuss>, <mailto:abnf-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 14 Nov 2016 04:55:39 -0000

> On Nov 14, 2016, at 12:39 PM, Julian Reschke <julian.reschke@gmx.de> wrote:
> 
> On 2016-11-14 00:21, Sean Leonard wrote:
>> ...
>> Well:
>> 
>>> resent-field    = "Resent-" field-name ":" unstructured CRLF
>> 
>> 
>> Is simpler than:
>> 
>>> resent-field  = resent-unstructured / resent-date / resent-from / resent-sender / ...
>> 
>> 
>> First of all, it looks simpler, because it is.
>> 
>> Second of all, <resent-date>, <resent-from>, <resent-sender> etc. are all “subsumed” in <resent-unstructured>, leading to an intentionally ambiguous grammar. The problem with the existing way of writing things out, is that the former are all superfluous when the latter is present. Another way to write it tends to be:
>> 
>> field = date / from / sender / … / extension-field
>> 
>> extension-field = field-name ":" unstructured CRLF
>> 
>> Well it makes perfect sense when you read it, of course, but when a computer / parser parses productions, every <date> is going to match with <extension-field> as well. The point being <extension-field> is not really doing what you think it’s doing in ABNF: it’s really <any-and-every-field>.
>> 
>> There is nothing “wrong” with creating an ambiguous grammar, but it can lead to subtle parsing errors if you are not careful. Usually ambiguous grammars are undesirable because you want your code to go down one, and only one, path. An automated ABNF validator should be able to detect ambiguous grammars and inform you about that.
>> 
>> 
>> When you actually go to program what the ABNF represents, most people are not going to program:
>> 
>> if (chars[0] = ‘d’ && chars[1] = ‘a’ & chars[2] = ’t’ & chars[3] = ‘e’ & chars[4] = ‘:’) {
>>  // parse as date
>> } else if (chars[0] = ‘f’ …) {
>>  // parse as from
>> } else if (…) {
>>  // ...
>> } else {
>>  field-name = everything before ‘:’;
>>  unstructured = everything between ‘:’ and CRLF;
>> }
>> ...
> 
> If this is the only use case (is it?), then I'd say that you're dealing with a problem you could avoid in the first place.
> 
> See how RFC 7230 distinguishes between parsing the HTTP message info fields and field values, and then has completely *separate* ABNFs for each field value.

Yes, I saw that. It is a difference approach to the “mail standards” approach, which uses a lot of / and =/ and extension-header syntaxes.

A disadvantage of the RFC 7230 approach is that the relationship between the generic header production and specific headers is not formalized. For example, 7230 says:

   HTTP-message = start-line *( header-field CRLF ) CRLF [ message-body
    ]

   header-field = field-name ":" OWS field-value OWS

   Via = *( "," OWS ) ( received-protocol RWS received-by [ RWS comment
    ] ) *( OWS "," [ OWS ( received-protocol RWS received-by [ RWS
    comment ] ) ] )


All that is very interesting, but if you want to verify that the ABNF is correct (or if you want to verify that sample HTTP messages conform to the ABNF), you have to do extra steps to extract each header-field and match them to the particular field values.

It is desirable to express the relationship between the <Via> production, and the <field-name> “Via”. Specifically, when <field-name> is “Via”, then <field-value> is <Via>.

A recent place where this leads to a protocol problem is the PKCS #11 URI scheme. RFC 7512 says that “|” can appear in <pk11-query-res-avail>, as a delimiter to some command-path. But “|” is not a part of URIs under [RFC3986]. Therefore, any [RFC3986] conforming URI parser is going to reject what RFC 7512 says are valid pkcs11: URIs. If the relationship between URI@[RFC3986] and pk11-URI@[RFC7512] could have been expressed formally (namely that pk11-URI is a subset of URI), then a validator could have easily flagged that problem during the editorial process.


Regards,

Sean