Re: [EAI] Erratum? Mixing character-based and byte based ABNF in RFC 6531

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Mon, 19 March 2012 11:39 UTC

Message-ID: <4F671AE5.3070305@it.aoyama.ac.jp>
Date: Mon, 19 Mar 2012 20:39:17 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100722 Eudora/3.0.4
MIME-Version: 1.0
To: John C Klensin <klensin@jck.com>
References: <4F649967.7040403@it.aoyama.ac.jp> <FE6ACCC569526CA1BBAE2D1E@PST.JCK.COM>
In-Reply-To: <FE6ACCC569526CA1BBAE2D1E@PST.JCK.COM>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
Cc: ima@ietf.org
Subject: Re: [EAI] Erratum? Mixing character-based and byte based ABNF in RFC 6531
Precedence: list

Hello John,

Many thanks to you and others for your comments.

On 2012/03/18 1:47, John C Klensin wrote:
> Martin,
>
> While I think there are a number of things in RFC 6531 that
> could be expressed better (I think those who would like to see
> an Internet Standard version in September might reasonably start
> working on edits now ... although I'd rather see reviews of the
> pending set of documents),

These will be forthcomming, hopefully this week. I started with RFC 
6530/1/2 because I hadn't looked at the base specs in a while and 
because I needed to look at them anyway for the mailto: spec (which, 
among many other things, actually is on the WG's charter).

> I'm not sure I see an actual error
> here.
>
> Certainly you are correct that U-labels are defined in IDNA as a
> list of Unicode code points, independent of encoding.  IDNA
> itself is a layer (or half-layer) below anything in any of the
> base SMTPUTF8 documents, including RFC 6531.  But the EAI WG
> made a very explicit decision that anything that went on the
> wire was required to be in UTF-8.  If we were writing prose, it
> would be reasonable to express the rule as something like "...
> MUST be a U-label and any U-labels used in conformance with this
> standard MUST be encoded in UTF-8".

Yes, indeed.

> The best way to express that in ABNF is a separate question.

My understanding so far is that the main purpose of ABNF is direct or 
indirect executability. Although ABNF can express context-free 
languages, in the bulk of uses in the IETF, there is no recursion among 
rules, and in these cases, conversion to regular expressions is often 
done. There lots of regular expression engines running on bytes, and 
there are many (sometimes the same ones) running on (Unicode or 
otherwise) characters, but I don't know of one where you could do both 
things at the same time. Maybe you do?

> Personally, I think we are sooner or later going to have to open
> RFC 5234 and add some rules to deal smoothly with characters
> outside the ASCII range and with the difference between
> character-abstractions and particular encodings.  But those are
> not problems that 6531 can solve.

No need to worry. These problems are *already solved* at
http://tools.ietf.org/html/rfc5234#section-2.3:

 >>>>
2.3.  Terminal Values

    Rules resolve into a string of terminal values, sometimes called
    characters.  In ABNF, a character is merely a non-negative integer.
    In certain contexts, a specific mapping (encoding) of values into a
    character set (such as ASCII) will be specified.
 >>>>

(Also check out http://tools.ietf.org/html/rfc5234#section-2.4, but 
please note that this is about how ABNFs themselves are encoded, not 
about what they describe.)

While we are at it, please also note that the word "octet" appears only 
two times in RFC 5234, as opposed to about 25 mentions of "character".

Also, please note that this is not only theory, it has already been 
used, in http://tools.ietf.org/html/rfc3987#section-2.2, after due 
explanation:

 >>>>
                                     Character numbers are taken from the
    UCS, without implying any actual binary encoding.  Terminals in the
    ABNF are characters, not bytes.

 >>>>
    ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                   / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                   / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                   / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                   / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                   / %xD0000-DFFFD / %xE1000-EFFFD

    iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD
 >>>>

Some ABNF checkers barked on some parts of the ABNF used there, but that 
was the checker's fault, not the ABNF's.

> I look forward to your erratum, but will push for it to be
> classified as "hold for document update".

That would be fine by me.

> If Jiankang or others
> responded to it by creating a "needs to be revised to clarify
> how a collection of i18n issues should be handled" proposed
> erratum to 5234, that would be only fitting.

No, it wouldn't, because there is no need to fix 5234.

> The "mailto:" issue ultimately involves the same abstraction
> versus expression/encoding issues that, IMO, are the weak
> underbelly of the IRI work.

I agree that insofar as "mailto:" is an URI/IRI scheme, it will share 
encoding "issues" with IRIs in general. That should come as no surprise.

However, I have to clearly and strongly disagree with (casual or 
on-purpose) interjections such as "weak underbelly". I hope we can all 
avoid such terms in technical discussion.

> If you want to adhere to the letter
> and spirit of RFC 6531 (and SMTPUTF8 generally), you are stuck
> with normalized UTF-8 encoding -- no A-labels and no other
> Unicode encoding forms.

I would expect that not only I, but also everybody else, will adhere to 
the letter and spirit of RFC 6531, which says that *in the SMTP 
protocol*, it's UTF-8 and UTF-8 only. Same for RFC 6532 and header 
fields *in transit*.

However, in good IETF tradition, I'd expect that implementations of all 
kinds (MUAs, MTAs,...) have all the liberty they need to choose other 
encodings where that makes sense for them, as long as they respect the 
RFCs on the wire. I may even have seen wording to that effect in one or 
the other (or both) of the above RFCs, but I assumed that as a matter of 
course and so didn't pay enough attention to be sure about it. As just 
one concrete example, I would assume that an MUA written in Java would 
use UTF-16 to store email addresses in memory.

> If you think that "mailto:" should
> represent a different layer of abstraction,

It's not that *I* think, it's that because the "mailto:" scheme is an 
URI/IRI scheme, it has to work like an URI/IRI scheme.

Please note that this isn't just an IRI issue, the issue is present in 
URIs, too. Please see the first paragraph of 
http://tools.ietf.org/html/rfc3986#section-2. For an actual example, 
please think about an URI embedded in an EBCDIC-encoded text (for those 
not old enough to remember EBCDIC, please try with UTF-16).

> e.g., whatever the
> IRI abstraction du jour happens to be,

I'm in danger of repeating myself, but I have to clearly and strongly 
disagree with (casual or on-purpose) interjections such as "abstraction 
du jour". I hope we can all avoid such terms in technical discussion.

[To give everybody some background, the design alternatives listed in 
http://tools.ietf.org/html/rfc3987#appendix-A represent the state of 
circa 1995, and around that time included the current solution (which of 
course doesn't have to be listed in that appendix). As early as 1996 
(mainly motivated by 
http://tools.ietf.org/html/draft-weider-iab-char-wrkshop-00), it was 
clear that the current solution was the right one. That's roughly 16 
years ago.]

> then go back to code
> points and let RFC 6531 and similar documents impose the
> encoding restrictions.

To get back to my main original point: It's of course clear to me that 
it's RFC 6531 and friends that have to impose the encoding restrictions, 
because otherwise implementations are overly constrained and humans 
(mail addresses have to work on paper and soundwaves, too) can't use 
these things.

But this is orthogonal to the question of whether the ABNF in RFC 6531 
should be written in terms of octets or in terms of (Unicode) code 
points. Either one is fine with me, but that still doesn't mean that 
mixing is a good idea.

Having thought about the text in RFC 6531 a bit more, and looked at 
http://tools.ietf.org/html/rfc5890#section-2.3.2.1 again, I actually 
think it's not even so much a problem of mixing two levels of 
description (characters and octets) as of trying to import an ABNF rule 
when there never was one. Here is what 
http://tools.ietf.org/html/rfc6531#section-3.3 says:

 >>>>
    The following ABNF rule will be imported from RFC 5234, Appendix B.1,
    directly:

    o  <DQUOTE>

    The following ABNF rule will be imported from RFC 5890, Section
    2.3.2.1, directly:

    o  <U-label>
 >>>>

If I go to http://tools.ietf.org/html/rfc5234#appendix-B.1, I indeed find:

 >>>>
          DQUOTE         =  %x22
                                 ; " (Double Quote)
 >>>>

(which I can add directly to the grammar if my ABNF infrastructure 
hasn't it predefined already). On the other hand, if I got to 
http://tools.ietf.org/html/rfc5890#section-2.3.2.1, I see some very 
precisely worked out definitions that I should be able to turn into 
whatever I need for my implementation in due time if I'm not lucky 
enough to find a library that has already done that. However, (and 
actually for good reasons,) even the word "ABNF" is nowhere in sight in 
RFC 5890.

So I currently think that I will propose an erratum along the following 
lines:

In http://tools.ietf.org/html/rfc6531#section-3.3, replace

<<<<
    The following ABNF rule will be imported from RFC 5890, Section
    2.3.2.1, directly:

    o  <U-label>

    The following rules are extended in ABNF [RFC5234] as follows.

    sub-domain   =/  U-label
     ; extend the definition of sub-domain in RFC 5321, Section 4.1.2
<<<<

by

 >>>>
    The following rules are extended in ABNF [RFC5234] as follows.

    sub-domain   =/  <U-label as defined in RFC 5890, encoded in UTF-8>
     ; extend the definition of sub-domain in RFC 5321, Section 4.1.2
 >>>>

There are a few other places where RFC 5890 is mentioned in RFC 6531, 
and which may have to be tweaked, too.

Regards,    Martin.

> (speaking only for myself, I hope obviously)
>
>     john

[EAI] Erratum? Mixing character-based and byte ba… Martin J. Dürst
Re: [EAI] Erratum? Mixing character-based and byt… Jiankang Yao
Re: [EAI] Erratum? Mixing character-based and byt… John C Klensin
Re: [EAI] Erratum? Mixing character-based and byt… ned+ima
Re: [EAI] Erratum? Mixing character-based and byt… John C Klensin
Re: [EAI] Erratum? Mixing character-based and byt… ned+ima
Re: [EAI] Erratum? Mixing character-based and byt… John C Klensin
Re: [EAI] Erratum? Mixing character-based and byt… Martin J. Dürst
Re: [EAI] Erratum? Mixing character-based and byt… Shawn Steele
Re: [EAI] Erratum? Mixing character-based and byt… ned+ima