Re: [abnf-discuss] constrained-01

Re: [abnf-discuss] constrained-01 - advantage?

Sean Leonard <dev+ietf@seantek.com> Mon, 14 November 2016 04:55 UTC

Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: Sean Leonard <dev+ietf@seantek.com>
In-Reply-To: <a5d764a8-c560-bed3-095f-f1a1a5e35688@gmx.de>
Date: Mon, 14 Nov 2016 13:55:32 +0900
Content-Transfer-Encoding: quoted-printable
Message-Id: <F49254F8-41B7-499F-8745-5F2374693AA7@seantek.com>
References: <5828DD42.8010009@gmail.com> <36FC0A35-2ADA-4710-ABFB-08E8B916718E@seantek.com> <a5d764a8-c560-bed3-095f-f1a1a5e35688@gmx.de>
To: Julian Reschke <julian.reschke@gmx.de>
Archived-At: <https://mailarchive.ietf.org/arch/msg/abnf-discuss/Kwm6yIHHYFjeDRq5avrQRaAFvUk>
Cc: Doug Royer <douglasroyer@gmail.com>, abnf-discuss@ietf.org
Subject: Re: [abnf-discuss] constrained-01 - advantage?
Precedence: list

> On Nov 14, 2016, at 12:39 PM, Julian Reschke <julian.reschke@gmx.de> wrote:
> 
> On 2016-11-14 00:21, Sean Leonard wrote:
>> ...
>> Well:
>> 
>>> resent-field    = "Resent-" field-name ":" unstructured CRLF
>> 
>> 
>> Is simpler than:
>> 
>>> resent-field  = resent-unstructured / resent-date / resent-from / resent-sender / ...
>> 
>> 
>> First of all, it looks simpler, because it is.
>> 
>> Second of all, <resent-date>, <resent-from>, <resent-sender> etc. are all “subsumed” in <resent-unstructured>, leading to an intentionally ambiguous grammar. The problem with the existing way of writing things out, is that the former are all superfluous when the latter is present. Another way to write it tends to be:
>> 
>> field = date / from / sender / … / extension-field
>> 
>> extension-field = field-name ":" unstructured CRLF
>> 
>> Well it makes perfect sense when you read it, of course, but when a computer / parser parses productions, every <date> is going to match with <extension-field> as well. The point being <extension-field> is not really doing what you think it’s doing in ABNF: it’s really <any-and-every-field>.
>> 
>> There is nothing “wrong” with creating an ambiguous grammar, but it can lead to subtle parsing errors if you are not careful. Usually ambiguous grammars are undesirable because you want your code to go down one, and only one, path. An automated ABNF validator should be able to detect ambiguous grammars and inform you about that.
>> 
>> 
>> When you actually go to program what the ABNF represents, most people are not going to program:
>> 
>> if (chars[0] = ‘d’ && chars[1] = ‘a’ & chars[2] = ’t’ & chars[3] = ‘e’ & chars[4] = ‘:’) {
>>  // parse as date
>> } else if (chars[0] = ‘f’ …) {
>>  // parse as from
>> } else if (…) {
>>  // ...
>> } else {
>>  field-name = everything before ‘:’;
>>  unstructured = everything between ‘:’ and CRLF;
>> }
>> ...
> 
> If this is the only use case (is it?), then I'd say that you're dealing with a problem you could avoid in the first place.
> 
> See how RFC 7230 distinguishes between parsing the HTTP message info fields and field values, and then has completely *separate* ABNFs for each field value.

Yes, I saw that. It is a difference approach to the “mail standards” approach, which uses a lot of / and =/ and extension-header syntaxes.

A disadvantage of the RFC 7230 approach is that the relationship between the generic header production and specific headers is not formalized. For example, 7230 says:

   HTTP-message = start-line *( header-field CRLF ) CRLF [ message-body
    ]

   header-field = field-name ":" OWS field-value OWS

   Via = *( "," OWS ) ( received-protocol RWS received-by [ RWS comment
    ] ) *( OWS "," [ OWS ( received-protocol RWS received-by [ RWS
    comment ] ) ] )

All that is very interesting, but if you want to verify that the ABNF is correct (or if you want to verify that sample HTTP messages conform to the ABNF), you have to do extra steps to extract each header-field and match them to the particular field values.

It is desirable to express the relationship between the <Via> production, and the <field-name> “Via”. Specifically, when <field-name> is “Via”, then <field-value> is <Via>.

A recent place where this leads to a protocol problem is the PKCS #11 URI scheme. RFC 7512 says that “|” can appear in <pk11-query-res-avail>, as a delimiter to some command-path. But “|” is not a part of URIs under [RFC3986]. Therefore, any [RFC3986] conforming URI parser is going to reject what RFC 7512 says are valid pkcs11: URIs. If the relationship between URI@[RFC3986] and pk11-URI@[RFC7512] could have been expressed formally (namely that pk11-URI is a subset of URI), then a validator could have easily flagged that problem during the editorial process.

Regards,

Sean

[abnf-discuss] constrained-01 - advantage? Doug Royer
Re: [abnf-discuss] constrained-01 - advantage? Sean Leonard
Re: [abnf-discuss] constrained-01 - advantage? Doug Royer
Re: [abnf-discuss] constrained-01 - advantage? Julian Reschke
Re: [abnf-discuss] constrained-01 - advantage? Sean Leonard
Re: [abnf-discuss] constrained-01 - advantage? Sean Leonard
Re: [abnf-discuss] constrained-01 - advantage? Julian Reschke
Re: [abnf-discuss] constrained-01 - advantage? Jonathan Hansford
Re: [abnf-discuss] constrained-01 - advantage? Paul Kyzivat
Re: [abnf-discuss] constrained-01 - advantage? Sean Leonard
Re: [abnf-discuss] constrained-01 - advantage? Sean Leonard
Re: [abnf-discuss] constrained-01 - advantage? Jonathan Hansford
Re: [abnf-discuss] constrained-01 - advantage? Doug Royer
Re: [abnf-discuss] constrained-01 - advantage? Sean Leonard
Re: [abnf-discuss] constrained-01 - advantage? Jonathan Hansford