Re: [Sedate] Additional issues (was: Re: Follow-Up to Interim meeting today!)

--On Thursday, October 21, 2021 12:58 +0200 Carsten Bormann
<cabo@tzi.org> wrote:

> Hi John,
> 
> thank you for this reminder.
> 
> Clearly, where we provide extensibility, we need to make sure
> that these extensions are well-defined, and we need to write
> up some requirements so people cooking up some extensions know
> what to do.  The current draft interposes an organization that
> needs to do the work (multichar namespace), and I was
> proposing to disintermediate that.
> 
> The other half of your missive seems to be about finding
> alternative representations for Unicode information.  That
> presumes we do need Unicode information.  That is a question
> that I'd like to answer first.

Actually, I tried to raise that question too, but didn't want
the note to be even longer.   In some sense, several of my
comments there could be interpreted as "going to Unicode is a
big step; avoid it if possible".  On the other hand (and the
reason I went on so long), if we are ever going to need
non-ASCII characters, this is the time to get them in because
retrofitting them later is often much harder.

> If we do not need Unicode information in the timestamps (see
> also [0]), then we can restrict the charset of the 3339+
> string [1] to ASCII.  If we do, there is one simple and one
> complex approach: The simple one is to use UTF-8, the complex
> one is to cook up one more bespoke UTF, in the same general
> class as UTF-7, percent-encoding, puny code etc.  I'd rather
> use UTF-8.

Mostly agreed, although I'd substitute "obvious" for "simple"
because neither choice eliminates those comparison,
normalization, security, confusion (if we care), etc., problems.

> [0]:
> https://github.com/ietf-wg-sedate/draft-ietf-sedate-datetime-e
> xtended/issues/5 
> [1]:
> https://github.com/ietf-wg-sedate/draft-ietf-sedate-datetime-e
> xtended/issues/9
> 
>> On 2021-10-21, at 12:40, John C Klensin <john-ietf@jck.com>
>> wrote:
>> 
>> In my last message, I mentioned that some principles (and
>> warnings) came up during the call that were not captured in
>> the issues list.  The ones I remember are summarized below.
>> Some of these comments were Ned Freed's (and he may want to
>> elaborate if I've gotten things wrong), some mine, and a few
>> from others ... with considerable overlap.
>> 
>> (1) We have a great deal of experience with leaving stubs for
>> extensions or writing specifications without specifying the
>> details of how those extensions will actually work and be
>> integrated.  Almost all of that experience has been very bad
>> and much of has involved interoperability problems down the
>> line. One that featured prominently for me during the meeting
>> discussion was the suggestion that IDNA and Punycode encoding
>> might be a good example.  Actually, it is a terrible one: RFCs
>> 1034 and 1035 essentially specified that a DNS label could
>> contain any octet but that octets with the high bit off were
>> to be interpreted as ASCII.  It then specified case-matching.
>> So, when people came along and wanted to use non-ASCII
>> labels, they started making things up, initially including
>> various (unidentified) parts of ISO 8859 and assorted ideas
>> about how to handle case in scripts that had it and
>> relationships they consider equivalent in other scripts.
>> IDNA, or something very like it, was necessary to avoid older
>> applications going crazy when they expected ASCII and allowed
>> nothing else.  But, equally important, it was one of the few
>> options for dealing with all of the possible interpretations
>> of octets in labels with the high bit turned on (they might
>> not even be characters) without either a flag day or a DNS
>> extension (which would also cause older applications to not
>> work as intended).    As a rather different example, Ned or I
>> could tell you a lot about the "MIME-version: 1.0" header and
>> why it could not be useful if a different version were ever
>> needed.
>> 
>> Note also that the IAB has been doing work in this area,
>> particularly in its EDM Program.  See
>> https://www.iab.org/activities/programs/evolvability-deployab
>> ility-maintainability-edm-program/
>> 
>> Lesson: If things are going to be put into the spec that
>> either explicitly allow  extensions or invite them from
>> people who have special needs or want to show how clever they
>> can be (the latter invite security problems as well as
>> interoperability ones).  If those hooks go in, we must
>> specify how the extension mechanism works, registries if they
>> are needed, and how applications are expected to behave if
>> they see an extension they don't recognize.  Examples are
>> important.
>> 
>> 
>> (2) For extensions, alternate sets of time zones, and so on,
>> doing things that require guessing what was intended or using
>> heuristics on content or context are a horrible idea.
>> Implementations guess wrong and so do users and
>> interoperability and security problems (and just plain
>> confusion) follow.
>> 
>> Lesson: Need to be explicit about mechanisms, content, and
>> identification (with registered keywords and labels or
>> otherwise).  And examples are important.
>> 
>> 
>> (3) "UTF-8" is not an answer to a requirement for non-ASCII
>> [1] characters.  Think in terms of Unicode.  UTF-8 will
>> probably turn out to be the right encoding if used directly
>> (i.e., it actually appears in an extended date-time string).
>> However, if the WG decides to create what IDNA called an
>> ASCII-compatible encoding (ACE), it may turn out to better to
>> base that on code points (as the Punycode encoding does)
>> rather than on UTF-8 (as in the %-encoding in URIs and other
>> web contexts).  I don't have an opinion about the best answer
>> at this point, but the question needs to be examined
>> carefully.  Also, as soon as one moves past ASCII (actually,
>> even with ASCII, but we usually cope), overlapping questions
>> of normalization, comparison, case-dependent and independent
>> matching, visual confusion, and so on arise and, if
>> interoperability is of concern, need to be dealt with.  There
>> are three documents that address various parts of the issues
>> or at least illustrate what they are about:
>> 
>> 	Unicode UTS #39, "Unicode Security Mechanisms",
>> 	http://www.unicode.org/reports/tr39/
>> 	
>> 	W3C Charmod-Norm, "Character Model for the World Wide
>> 	Web: String Matching".
>> 	https://www.w3.org/TR/charmod-norm/
>> 	
>> 	RFC 8264, "PRECIS Framework: Preparation, Enforcement,
>> 	and Comparison of Internationalized Strings in
>> 	Application Protocols".
>> 
>> At least in my experience, none of these is sufficient by
>> itself for gaining a an understanding of the issues or even
>> where all of the traps lie.  They also leave choices to those
>> specifying protocols or formats.    And, unless what we
>> decide to do can build on and profile PRECIS, we are going to
>> need to sort it out specifically for SEDATE. 
>> 
>> Finally, while the distinction between "used only in
>> protocols" and "visible to users" is useful, there is a
>> variation on the former of treating whatever goes into the
>> protocol as a code that must be mapped to and from local
>> forms by applications.  It can be another way of keeping the
>> syntax ASCII but still having the user see something that is
>> appropriately localized.
>> 
>> Lesson: Moving beyond ASCII is a big step and saying "use
>> UTF-8" and then waving one's hands excludes possibly useful
>> and important alternatives and leads to problems.  And,
>> especially if the spec goes beyond ASCII, examples are
>> important.
>> 
>> best,
>>    john
>> 
>> 
>> [1] "ASCII", sometimes known in the IETF as "US-ASCII" is the
>> convenient and most-used term.  Those who are concerned about
>> its being US-centric should think in terms of ISO 646 Basic
>> Version.
>> 
>> -- 
>> Sedate mailing list
>> Sedate@ietf.org
>> https://www.ietf.org/mailman/listinfo/sedate
>