Re: Syntax

"Clive D.W. Feather" <clive@demon.net> Wed, 10 January 2007 10:46 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1H4ayf-0007oB-85; Wed, 10 Jan 2007 05:46:41 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1H4aye-0007o4-VR for cosmogol@ietf.org; Wed, 10 Jan 2007 05:46:40 -0500
Received: from anchor-internal-1.mail.demon.net ([195.173.56.100]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1H4ayd-0007fw-Hw for cosmogol@ietf.org; Wed, 10 Jan 2007 05:46:40 -0500
Received: from finch-staff-1.server.demon.net (finch-staff-1.server.demon.net [193.195.224.1]) by anchor-internal-1.mail.demon.net with ESMTP� id l0AAkcNQ023282Wed, 10 Jan 2007 10:46:38 GMT
Received: from clive by finch-staff-1.server.demon.net with local (Exim 3.36 #1) id 1H4ayb-0008aE-00; Wed, 10 Jan 2007 10:46:37 +0000
Date: Wed, 10 Jan 2007 10:46:37 +0000
From: "Clive D.W. Feather" <clive@demon.net>
To: Julian Reschke <julian.reschke@gmx.de>
Message-ID: <20070110104637.GA32555@finch-staff-1.thus.net>
References: <45A129E9.50905@gmx.de> <20070107205255.GA14621@sources.org> <45A20F62.9060306@gmx.de> <20070108204618.GA29407@sources.org> <45A34BC6.3050407@gmx.de>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <45A34BC6.3050407@gmx.de>
User-Agent: Mutt/1.5.3i
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 082a9cbf4d599f360ac7f815372a6a15
Cc: Stephane Bortzmeyer <bortzmeyer@nic.fr>, cosmogol@ietf.org
Subject: Re: Syntax
X-BeenThere: cosmogol@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: DIscussion on state machine specification in IETF protocols <cosmogol.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/cosmogol>, <mailto:cosmogol-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/cosmogol>
List-Post: <mailto:cosmogol@ietf.org>
List-Help: <mailto:cosmogol-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/cosmogol>, <mailto:cosmogol-request@ietf.org?subject=subscribe>
Errors-To: cosmogol-bounces@ietf.org

Julian Reschke said:
> The reason why XML's production is complex is that it excludes
> characters that do not belong into identifiers.
>
> The escaping rule quoted above doesn't solve that problem at all; it's
> just an escaping rule.

Right.

This sort of exclusion is not easy to do in syntax. It's easier to do in a
separate restriction (in the C Standard, the syntax simply says you have a
"universal character name" and then there's a separate requirement:

       [#3] Each universal character name in  an  identifier  shall
       designate  a character whose encoding in ISO/IEC 10646 falls
       into one of the ranges specified in annex D.
).

>> Hmmm, how many IETF formats are in Unicode? (Apart from those based
>> only on XML, like Atom in RFC 4287.) ABNF is not, for instance (right,
>> it is not a few format, the RFC is recent but it derives from an older
>> format.)
> I just tried to understand how RFC4234 works with non-ASCII characters,
> and it's not obvious at all. Section 2.4 seems to deal with it but
> really sounds a bit like hand-waving.

I take it as saying that you have to write your grammar to show the
encoding that you want to use, and may want to have alternative grammars
for different contexts. It doesn't let you use non-ASCII characters in
grammars.

In RFC3977 we wrote the following, which seems to be acceptable to IETF:

    UTF8-non-ascii = UTF8-2 / UTF8-3 / UTF8-4
    UTF8-2    = %xC2-DF UTF8-tail
    UTF8-3    = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail /
                %xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail
    UTF8-4    = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail /
                %xF4 %x80-8F 2UTF8-tail
    UTF8-tail = %x80-BF

So, putting it together, one approach would be something like this:

    name       = identifier / quoted-name

    identifier = id-initial *(["-"] 1*id-char)
    id-char    = id-initial / DIGIT
    id-initial = ALPHA / UTF8-non-ascii
        ; Each UTF8-non-ascii in an identifier shall designate a character
        ; whose encoding in ISO/IEC 10646 falls into one of the ranges
        ; specified in appendix X.

    quoted-name = DQUOTE 1*q-char DQUOTE
    q-char      = %x21 / %x23-5B / %x5D-7E / UTF-8-non-ascii / q-escape
        ; excludes DQUOTE and BACKSLASH from ASCII
    q-escape    = %x5C.75 4HEXDIG / %x5C.55 8HEXDIG

-- 
Clive D.W. Feather  | Work:  <clive@demon.net>   | Tel:    +44 20 8495 6138
Internet Expert     | Home:  <clive@davros.org>  | Fax:    +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
THUS plc            |                            |

_______________________________________________
Cosmogol mailing list
Cosmogol@ietf.org
https://www1.ietf.org/mailman/listinfo/cosmogol