Re: [Sedate] Additional issues (was: Re: Follow-Up to Interim meeting today!)

Hi John,

thank you for this reminder.

Clearly, where we provide extensibility, we need to make sure that these extensions are well-defined, and we need to write up some requirements so people cooking up some extensions know what to do.  The current draft interposes an organization that needs to do the work (multichar namespace), and I was proposing to disintermediate that.

The other half of your missive seems to be about finding alternative representations for Unicode information.  That presumes we do need Unicode information.  That is a question that I’d like to answer first.

If we do not need Unicode information in the timestamps (see also [0]), then we can restrict the charset of the 3339+ string [1] to ASCII.  If we do, there is one simple and one complex approach: The simple one is to use UTF-8, the complex one is to cook up one more bespoke UTF, in the same general class as UTF-7, percent-encoding, puny code etc.  I’d rather use UTF-8.

Grüße, Carsten

[0]: https://github.com/ietf-wg-sedate/draft-ietf-sedate-datetime-extended/issues/5
[1]: https://github.com/ietf-wg-sedate/draft-ietf-sedate-datetime-extended/issues/9

> On 2021-10-21, at 12:40, John C Klensin <john-ietf@jck.com> wrote:
> 
> In my last message, I mentioned that some principles (and
> warnings) came up during the call that were not captured in the
> issues list.  The ones I remember are summarized below.   Some
> of these comments were Ned Freed's (and he may want to elaborate
> if I've gotten things wrong), some mine, and a few from others
> ... with considerable overlap.
> 
> (1) We have a great deal of experience with leaving stubs for
> extensions or writing specifications without specifying the
> details of how those extensions will actually work and be
> integrated.  Almost all of that experience has been very bad and
> much of has involved interoperability problems down the line.
> One that featured prominently for me during the meeting
> discussion was the suggestion that IDNA and Punycode encoding
> might be a good example.  Actually, it is a terrible one: RFCs
> 1034 and 1035 essentially specified that a DNS label could
> contain any octet but that octets with the high bit off were to
> be interpreted as ASCII.  It then specified case-matching.  So,
> when people came along and wanted to use non-ASCII labels, they
> started making things up, initially including various
> (unidentified) parts of ISO 8859 and assorted ideas about how to
> handle case in scripts that had it and relationships they
> consider equivalent in other scripts.  IDNA, or something very
> like it, was necessary to avoid older applications going crazy
> when they expected ASCII and allowed nothing else.  But, equally
> important, it was one of the few options for dealing with all of
> the possible interpretations of octets in labels with the high
> bit turned on (they might not even be characters) without either
> a flag day or a DNS extension (which would also cause older
> applications to not work as intended).    As a rather different
> example, Ned or I could tell you a lot about the "MIME-version:
> 1.0" header and why it could not be useful if a different
> version were ever needed.
> 
> Note also that the IAB has been doing work in this area,
> particularly in its EDM Program.  See
> https://www.iab.org/activities/programs/evolvability-deployability-maintainability-edm-program/
> 
> Lesson: If things are going to be put into the spec that either
> explicitly allow  extensions or invite them from people who have
> special needs or want to show how clever they can be (the latter
> invite security problems as well as interoperability ones).  If
> those hooks go in, we must specify how the extension mechanism
> works, registries if they are needed, and how applications are
> expected to behave if they see an extension they don't
> recognize.  Examples are important.
> 
> 
> (2) For extensions, alternate sets of time zones, and so on,
> doing things that require guessing what was intended or using
> heuristics on content or context are a horrible idea.
> Implementations guess wrong and so do users and interoperability
> and security problems (and just plain confusion) follow.
> 
> Lesson: Need to be explicit about mechanisms, content, and
> identification (with registered keywords and labels or
> otherwise).  And examples are important.
> 
> 
> (3) "UTF-8" is not an answer to a requirement for non-ASCII [1]
> characters.  Think in terms of Unicode.  UTF-8 will probably
> turn out to be the right encoding if used directly (i.e., it
> actually appears in an extended date-time string).  However, if
> the WG decides to create what IDNA called an ASCII-compatible
> encoding (ACE), it may turn out to better to base that on code
> points (as the Punycode encoding does) rather than on UTF-8 (as
> in the %-encoding in URIs and other web contexts).  I don't have
> an opinion about the best answer at this point, but the question
> needs to be examined carefully.  Also, as soon as one moves past
> ASCII (actually, even with ASCII, but we usually cope),
> overlapping questions of normalization, comparison,
> case-dependent and independent matching, visual confusion, and
> so on arise and, if interoperability is of concern, need to be
> dealt with.  There are three documents that address various
> parts of the issues or at least illustrate what they are about:
> 
> 	Unicode UTS #39, "Unicode Security Mechanisms",
> 	http://www.unicode.org/reports/tr39/
> 	
> 	W3C Charmod-Norm, "Character Model for the World Wide
> 	Web: String Matching".
> 	https://www.w3.org/TR/charmod-norm/
> 	
> 	RFC 8264, "PRECIS Framework: Preparation, Enforcement,
> 	and Comparison of Internationalized Strings in
> 	Application Protocols".
> 
> At least in my experience, none of these is sufficient by itself
> for gaining a an understanding of the issues or even where all
> of the traps lie.  They also leave choices to those specifying
> protocols or formats.    And, unless what we decide to do can
> build on and profile PRECIS, we are going to need to sort it out
> specifically for SEDATE. 
> 
> Finally, while the distinction between "used only in protocols"
> and "visible to users" is useful, there is a variation on the
> former of treating whatever goes into the protocol as a code
> that must be mapped to and from local forms by applications.  It
> can be another way of keeping the syntax ASCII but still having
> the user see something that is appropriately localized.
> 
> Lesson: Moving beyond ASCII is a big step and saying "use UTF-8"
> and then waving one's hands excludes possibly useful and
> important alternatives and leads to problems.  And, especially
> if the spec goes beyond ASCII, examples are important.
> 
> best,
>    john
> 
> 
> [1] "ASCII", sometimes known in the IETF as "US-ASCII" is the
> convenient and most-used term.  Those who are concerned about
> its being US-centric should think in terms of ISO 646 Basic
> Version.
> 
> -- 
> Sedate mailing list
> Sedate@ietf.org
> https://www.ietf.org/mailman/listinfo/sedate