Re: [abnf-discuss] Mail regarding draft-seantek-constrained-abnf

Sean Leonard <dev+ietf@seantek.com> Fri, 08 July 2016 16:23 UTC

Return-Path: <dev+ietf@seantek.com>
X-Original-To: abnf-discuss@ietfa.amsl.com
Delivered-To: abnf-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D255012D518; Fri, 8 Jul 2016 09:23:56 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.6
X-Spam-Level:
X-Spam-Status: No, score=-2.6 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4QBJoLKA3xkT; Fri, 8 Jul 2016 09:23:54 -0700 (PDT)
Received: from mxout-08.mxes.net (mxout-08.mxes.net [216.86.168.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0ABCF12D7E9; Fri, 8 Jul 2016 09:23:42 -0700 (PDT)
Received: from [192.168.123.7] (unknown [75.83.2.34]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id 47CF550A85; Fri, 8 Jul 2016 12:23:41 -0400 (EDT)
To: Paul Kyzivat <pkyzivat@alum.mit.edu>, Jonathan Hansford <jonathan@hansfords.net>, "draft-seantek-constrained-abnf@ietf.org" <draft-seantek-constrained-abnf@ietf.org>
References: <20160708121202.A3E5D12D19F@ietfa.amsl.com> <43d3ffea-57de-f7ef-e740-e448564008ed@alum.mit.edu>
From: Sean Leonard <dev+ietf@seantek.com>
Message-ID: <bf12cdc4-8d17-d6aa-cc01-5afad19127ac@seantek.com>
Date: Fri, 08 Jul 2016 09:23:02 -0700
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0
MIME-Version: 1.0
In-Reply-To: <43d3ffea-57de-f7ef-e740-e448564008ed@alum.mit.edu>
Content-Type: multipart/alternative; boundary="------------AB5C8BEC35837BE2D0F6DA3C"
Archived-At: <https://mailarchive.ietf.org/arch/msg/abnf-discuss/WKXyIMVDrIq5vu-kOCtoRezBHvY>
Cc: "abnf-discuss@ietf.org" <abnf-discuss@ietf.org>
Subject: Re: [abnf-discuss] Mail regarding draft-seantek-constrained-abnf
X-BeenThere: abnf-discuss@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: "General discussion about tools, activities and capabilities involving the ABNF meta-language" <abnf-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/abnf-discuss>, <mailto:abnf-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/abnf-discuss/>
List-Post: <mailto:abnf-discuss@ietf.org>
List-Help: <mailto:abnf-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/abnf-discuss>, <mailto:abnf-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Jul 2016 16:23:57 -0000

+abnf-discuss@

On 7/8/2016 8:06 AM, Paul Kyzivat wrote:
> On 7/8/16 8:11 AM, Jonathan Hansford wrote:
>> Hi,
>>
>>
>>
>> I’ve just discovered draft-seantek-constrained-abnf-00 and have a couple
>> of questions:
>
> Note that while Sean and I have been bouncing ideas around about this 
> for some time, the draft is a trial balloon and the specific text is 
> all Sean's. I'm not yet wild about some of the syntax, but it is good 
> to have something tangible written down to talk about.
>
> So far it seems that only Sean and I are interested. If you are, then 
> that would make three.

Yay for three!

>
>> 1)      Does Constrained ABNF work with RFC 7405 “Case-Sensitive String
>> Support in ABNF”? Either way, it might be worth mentioning that RFC in
>> your I-D.
>
> Certainly it does. The functionality is entirely orthogonal.

Yes, what Paul said.

>
>> 2)      Has there been any call to further constrain ABNF such that
>> rules can have length constraints applied to them? I’m not sure how much
>> benefit there would be for RFCs (though I can think of examples where
>> identifying such a constraint within the ABNF rather than just within
>> the accompanying text would be beneficial), but a Constrained ABNF
>> compliant parser that included length constraints would be good.
>
> I'm not aware of any such thing. Of course it already has repetition 
> constraints, but I guess that is not what you are looking for.
>
> I have difficulty imagining how you would fit such constraints into 
> the syntax. The syntax of ABNF is pretty nice for what it does, but it 
> doesn't leave much room for extension because we are running out of 
> special characters. (Perhaps the possibilities could be expanded by 
> using unicode characters. Imagine what could be done with emojis as 
> operator characters.)

Length constraints are expressible with current ABNF. Generically, 
suppose you have an identifier comprised of ALPHAs and DIGITs:

identifier = 1*(ALPHA / DIGIT)

Well, if you want to limit identifier to 20 chars, then you can say max 
twenty:

identifier = 1*20(ALPHA / DIGIT)

As Paul says, this is a repetition constraints, and is easy to analyze 
with the repeated symbols are fixed-width. But what about variable width?

Well, suppose you have Domain from RFC 5321:

    Domain         = sub-domain *("." sub-domain)

    sub-domain     = Let-dig [Ldh-str]
    Let-dig        = ALPHA / DIGIT
    Ldh-str        = *( ALPHA / DIGIT / "-" ) Let-dig



Now suppose you want to constrain Domain to be 255 chars. Such a 
constrained domain might be called DNSDomain. Then you can say:

    DNSDomain ^ Domain = 1*255ASCII


...and that's it! Basically you have to compose a production that 
evaluates to a single symbol (character/integer), where the domain of 
the symbol is any possible character that can appear in the production. 
The minimum conforming production would be:

    DNSDomain ^ Domain = 1*255(ALPHA / DIGIT / "-" / ".")


The maximum conforming production would be:

    DNSDomain ^ Domain = 1*255(%x00-FFFFFFFFFFFFFFFFFFFFFFFFF...(infinity))


Reducing the production from a non-maximum to the minimum form, can be 
evaluated programmatically. A parser would need to evaluate the 
constrained production for all "reachable" single terminal symbols 
(characters/integers that may appear at least once), which can be done 
more-or-less in O(n) time. It adds all such integers to a set, and then 
expresses the set.

It can also be shown that you can also integrate the constraint into the 
base production, at least for all non-recursive productions. In this 
case, you would have to break down Domain into a very, very long string 
of terminal symbol possibilities (essentially, expressing it as a huge 
finite state machine). Once you get all the possible states, you can 
limit each symbol combination so that the total combinations do not 
exceed 255. This can probably be done computationally, but the proof 
would require a lot of math that I do not care to figure out right now. 
:) And it's unclear how computationally difficult it would be to do it 
in the general case.

Actually now that I think about it, if your constraint is to limit the 
total number of symbols, you can probably prove that any recursive 
expression can be expressed as a finite state machine. Therefore, 
integrating length constraints would actually be computable. It would 
also be super-ugly, but you get the point.

***

Regarding Paul's Unicode point: it would be nice for ABNF to have an 
extension that defines the domain of the input symbols, with specific 
options for: ASCII-only, octets, fixed-bit-width units, UTF-8, 
UTF-16LE/BE, and Unicode scalar values (0-0xD7FF, 0xE000-0x10FFFF), and 
Unicode characters (non-characters such as 0xFFFF are excluded). The 
most important of these are ASCII-only, octets, and Unicode scalar values.

I thought about writing such an I-D, but, it would take more work than I 
have time for right now, and one would have to address a few of issues.

Currently the domain of ABNF is any unsigned integer, which is an 
infinite set. This is good in that you can express anything positive, 
and bad in that you can't express negative symbol productions (e.g., all 
symbols EXCEPT "Q"). Once you constrain the set to finite, you can 
express things like the regex [^Q]. This makes ABNF compilers more 
complex. If you really want to express exclusionary productions, maybe 
BNF languages aren't the right tool for the job?

A compatibility problem arises when you want to reference productions 
from other specs, that have different domains. Suppose that one spec is 
authored in the UTF-8 domain (see LDAP specs e.g., RFC 4512), and other 
spec is authored in the Unicode scalar value domain. This is a real 
issue that I am encountering in draft-seantek-certspec, which I 
encourage you to read...I wanted to import (reference) distinguishedName 
from LDAP / RFC 4514, but distinguishedName includes UTF-8 multibyte 
productions, and draft-seantek-certspec is authored to be at least 
potentially character set neutral. It would be okay (in 
draft-seantek-certspec) to write the string in ISO-2022-JP and then 
convert it to something else.

Converting between domains is computable (e.g., the Unicode scalar value 
expressed as %x1F60A 😊 can be broken down into %xD83D.DE0A or 
%xF0.9F.98.8A), but symbology for doing it would be arcane. We want to 
keep ABNF as simple as possible, and that stuff strikes me as calling 
for a complex syntax.

Any proposal that involves Unicode is going to have to at least think 
about character properties and classes. See UTS #18.

Both the Unicode Consortium and W3C use BNF flavors to describe Unicode 
things. As ABNF uses the same mathematical theories and their BNF 
flavors, it would be good to replicate syntax where possible.

Regards,

Sean

>
>     Thanks,
>     Paul
>