Re: [abnf-discuss] Mail regarding draft-seantek-constrained-abnf

"Jonathan Hansford" <jonathan@hansfords.net> Fri, 08 July 2016 21:08 UTC

Return-Path: <jonathan@hansfords.net>
X-Original-To: abnf-discuss@ietfa.amsl.com
Delivered-To: abnf-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id BA7DD12D51D; Fri, 8 Jul 2016 14:08:51 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.989
X-Spam-Level:
X-Spam-Status: No, score=-1.989 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, T_KAM_HTML_FONT_INVALID=0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=hansfords.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id S6lotVOjbB1N; Fri, 8 Jul 2016 14:08:47 -0700 (PDT)
Received: from server.myfast.site (server.myfast.site [212.113.130.90]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8B1B712D76B; Fri, 8 Jul 2016 14:08:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=hansfords.net; s=default; h=Content-Type:MIME-Version:Message-ID:Date: Subject:In-Reply-To:References:Cc:To:From:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=9/HDsmLtk2in6ogh5KwFTq7m91n6OGLvSMqULickxug=; b=e61mfEfp9zmdqrWpqyISTtpCu r+TAWHnzIGCVLvSaY/HbD8l/i9LR3lV7xHZN/t0bbnOA51OQKilYmIQ8h3bushKOMlre04SWP9Im+ kRrl5r4LKFxNvQjgHadNozCApYI2uyuKzaTn2uG890MI0UxPfxtX6pNBKqpe5tN9Gizg9QACfia84 np41yW8yXsWWfCUlckAhExZylbbOU6NVgUhG1e3ROPa8GZTRmaLke77JZ3rqGrxvoWdcoAt+h6YU8 Fh9nkEEbpJh5JGEmka+8EOmMuEl+FFT133DEQ2ZOOc0VYt/L+AHzy7IESKAZp9mPMIpCECfUDpKaM z7FwFBgbw==;
Received: from hansfords.plus.com ([84.92.116.209]:39528 helo=Vanguard) by server.myfast.site with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.87) (envelope-from <jonathan@hansfords.net>) id 1bLd0u-001QAo-Ai; Fri, 08 Jul 2016 22:08:44 +0100
From: Jonathan Hansford <jonathan@hansfords.net>
To: 'Sean Leonard' <dev+ietf@seantek.com>, 'Paul Kyzivat' <pkyzivat@alum.mit.edu>, draft-seantek-constrained-abnf@ietf.org
References: <20160708121202.A3E5D12D19F@ietfa.amsl.com> <43d3ffea-57de-f7ef-e740-e448564008ed@alum.mit.edu> <bf12cdc4-8d17-d6aa-cc01-5afad19127ac@seantek.com>
In-Reply-To: <bf12cdc4-8d17-d6aa-cc01-5afad19127ac@seantek.com>
Date: Fri, 08 Jul 2016 22:08:48 +0100
Message-ID: <017701d1d95c$ebf89310$c3e9b930$@hansfords.net>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_NextPart_000_0178_01D1D965.4DBE0C80"
X-Mailer: Microsoft Outlook 15.0
Thread-Index: AQFxNY8WXtK2YNa1UkIKc51t946CUgHbOpfBAeisLe+gsi5jwA==
Content-Language: en-gb
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - server.myfast.site
X-AntiAbuse: Original Domain - ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - hansfords.net
X-Get-Message-Sender-Via: server.myfast.site: authenticated_id: jonathan@hansfords.net
X-Authenticated-Sender: server.myfast.site: jonathan@hansfords.net
Archived-At: <https://mailarchive.ietf.org/arch/msg/abnf-discuss/7sZFWgjUbXvTWaTBINbM9FLiKCs>
Cc: abnf-discuss@ietf.org
Subject: Re: [abnf-discuss] Mail regarding draft-seantek-constrained-abnf
X-BeenThere: abnf-discuss@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: "General discussion about tools, activities and capabilities involving the ABNF meta-language" <abnf-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/abnf-discuss>, <mailto:abnf-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/abnf-discuss/>
List-Post: <mailto:abnf-discuss@ietf.org>
List-Help: <mailto:abnf-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/abnf-discuss>, <mailto:abnf-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Jul 2016 21:08:52 -0000

Paul and Sean,

 

Thanks for the replies. Life is too hectic at the moment, but I’m hoping I will find some time to properly read the I-D and your replies and see how it would all work for me. Is there a parser for constrained ABNF at the moment?

 

Jonathan

 

=O)

 

From: Sean Leonard [mailto:dev+ietf@seantek.com] 
Sent: 08 July 2016 17:23
To: Paul Kyzivat <pkyzivat@alum.mit.edu>; Jonathan Hansford <jonathan@hansfords.net>; draft-seantek-constrained-abnf@ietf.org
Cc: abnf-discuss@ietf.org
Subject: Re: Mail regarding draft-seantek-constrained-abnf

 

+abnf-discuss@

On 7/8/2016 8:06 AM, Paul Kyzivat wrote:

On 7/8/16 8:11 AM, Jonathan Hansford wrote: 



Hi, 



I’ve just discovered draft-seantek-constrained-abnf-00 and have a couple 
of questions: 


Note that while Sean and I have been bouncing ideas around about this for some time, the draft is a trial balloon and the specific text is all Sean's. I'm not yet wild about some of the syntax, but it is good to have something tangible written down to talk about. 

So far it seems that only Sean and I are interested. If you are, then that would make three. 


Yay for three!








1)      Does Constrained ABNF work with RFC 7405 “Case-Sensitive String 
Support in ABNF”? Either way, it might be worth mentioning that RFC in 
your I-D. 


Certainly it does. The functionality is entirely orthogonal. 


Yes, what Paul said.








2)      Has there been any call to further constrain ABNF such that 
rules can have length constraints applied to them? I’m not sure how much 
benefit there would be for RFCs (though I can think of examples where 
identifying such a constraint within the ABNF rather than just within 
the accompanying text would be beneficial), but a Constrained ABNF 
compliant parser that included length constraints would be good. 


I'm not aware of any such thing. Of course it already has repetition constraints, but I guess that is not what you are looking for. 

I have difficulty imagining how you would fit such constraints into the syntax. The syntax of ABNF is pretty nice for what it does, but it doesn't leave much room for extension because we are running out of special characters. (Perhaps the possibilities could be expanded by using unicode characters. Imagine what could be done with emojis as operator characters.) 


Length constraints are expressible with current ABNF. Generically, suppose you have an identifier comprised of ALPHAs and DIGITs:

identifier = 1*(ALPHA / DIGIT)

Well, if you want to limit identifier to 20 chars, then you can say max twenty:

identifier = 1*20(ALPHA / DIGIT)

As Paul says, this is a repetition constraints, and is easy to analyze with the repeated symbols are fixed-width. But what about variable width?

Well, suppose you have Domain from RFC 5321:



 
   Domain         = sub-domain *("." sub-domain)
   sub-domain     = Let-dig [Ldh-str]
   Let-dig        = ALPHA / DIGIT
   Ldh-str        = *( ALPHA / DIGIT / "-" ) Let-dig



Now suppose you want to constrain Domain to be 255 chars. Such a constrained domain might be called DNSDomain. Then you can say:




   DNSDomain ^ Domain = 1*255ASCII


...and that's it! Basically you have to compose a production that evaluates to a single symbol (character/integer), where the domain of the symbol is any possible character that can appear in the production. The minimum conforming production would be:




   DNSDomain ^ Domain = 1*255(ALPHA / DIGIT / "-" / ".")


The maximum conforming production would be:




   DNSDomain ^ Domain = 1*255(%x00-FFFFFFFFFFFFFFFFFFFFFFFFF...(infinity))


Reducing the production from a non-maximum to the minimum form, can be evaluated programmatically. A parser would need to evaluate the constrained production for all "reachable" single terminal symbols (characters/integers that may appear at least once), which can be done more-or-less in O(n) time. It adds all such integers to a set, and then expresses the set.

It can also be shown that you can also integrate the constraint into the base production, at least for all non-recursive productions. In this case, you would have to break down Domain into a very, very long string of terminal symbol possibilities (essentially, expressing it as a huge finite state machine). Once you get all the possible states, you can limit each symbol combination so that the total combinations do not exceed 255. This can probably be done computationally, but the proof would require a lot of math that I do not care to figure out right now. :) And it's unclear how computationally difficult it would be to do it in the general case.

Actually now that I think about it, if your constraint is to limit the total number of symbols, you can probably prove that any recursive expression can be expressed as a finite state machine. Therefore, integrating length constraints would actually be computable. It would also be super-ugly, but you get the point.

***

Regarding Paul's Unicode point: it would be nice for ABNF to have an extension that defines the domain of the input symbols, with specific options for: ASCII-only, octets, fixed-bit-width units, UTF-8, UTF-16LE/BE, and Unicode scalar values (0-0xD7FF, 0xE000-0x10FFFF), and Unicode characters (non-characters such as 0xFFFF are excluded). The most important of these are ASCII-only, octets, and Unicode scalar values.

I thought about writing such an I-D, but, it would take more work than I have time for right now, and one would have to address a few of issues.

Currently the domain of ABNF is any unsigned integer, which is an infinite set. This is good in that you can express anything positive, and bad in that you can't express negative symbol productions (e.g., all symbols EXCEPT "Q"). Once you constrain the set to finite, you can express things like the regex [^Q]. This makes ABNF compilers more complex. If you really want to express exclusionary productions, maybe BNF languages aren't the right tool for the job?

A compatibility problem arises when you want to reference productions from other specs, that have different domains. Suppose that one spec is authored in the UTF-8 domain (see LDAP specs e.g., RFC 4512), and other spec is authored in the Unicode scalar value domain. This is a real issue that I am encountering in draft-seantek-certspec, which I encourage you to read...I wanted to import (reference) distinguishedName from LDAP / RFC 4514, but distinguishedName includes UTF-8 multibyte productions, and draft-seantek-certspec is authored to be at least potentially character set neutral. It would be okay (in draft-seantek-certspec) to write the string in ISO-2022-JP and then convert it to something else.

Converting between domains is computable (e.g., the Unicode scalar value expressed as %x1F60A 😊 can be broken down into %xD83D.DE0A or %xF0.9F.98.8A), but symbology for doing it would be arcane. We want to keep ABNF as simple as possible, and that stuff strikes me as calling for a complex syntax.

Any proposal that involves Unicode is going to have to at least think about character properties and classes. See UTS #18.

Both the Unicode Consortium and W3C use BNF flavors to describe Unicode things. As ABNF uses the same mathematical theories and their BNF flavors, it would be good to replicate syntax where possible.

Regards,

Sean





    Thanks, 
    Paul