Re: [MMUSIC] Resolving IESG issues with RFC4566bis-35: a=charset

On 6/8/19 1:52 PM, Christer Holmberg wrote:
> Hi,
> 
> I don't have a strong opinion, but I would really like to avoid a "MUST-MAY-SHOULD NOT" definition.
> 
> One bold approach would be to only allow, and mandate support of, a set of charsets. The definition and usage of any additional charsets need to be defined in a separate document.

While I think it would be possible to restrict what any new usages do, 
the problem is covering existing usages. (Even though this may be a 
non-existent problem.)

ISTM there is an interesting problem when using any charset that 
includes multibyte sequences. A charset-dependent part always begins 
midway through a line, and consists of a byte-string followed by CR LF. 
The byte-string consists of a sequence of bytes not including NUL, CR, LF.

This means that anywhere that the syntax resolves to byte-string, it 
allows sequences of bytes that may not be valid in the charset in which 
that is interpreted. This is *not* a syntax error. But presumably this 
should be treated as a semantic error in context. This is not currently 
called out anywhere. Seems like it should be.

Maybe if we call out this problem, it can be considered to resolve the 
problem of the ambiguity in using troublesome charsets such as UTF-16. 
But this isn't just a problem with non-default charsets - it can also 
occur with UTF-8. If this is good enough then it is a simple solution to 
the problem.

	Thanks,
	Paul

> Regards,
> 
> Christer
> 
> 
> On 07/06/2019, 17.38, "mmusic on behalf of Paul Kyzivat" <mmusic-bounces@ietf.org on behalf of pkyzivat@alum.mit.edu> wrote:
> 
>      MMUSIC SDP fans,
>      
>      The message below already went to mmusic, but here I'm reducing the
>      distribution list to only mmusic so we don't spam the iesg with our
>      internal discussion.
>      
>      It seems that Alexy doesn't want to let us sweep the charset issues
>      under the rug.
>      
>      Would his suggestion to restrict the charsets permitted to be used be
>      acceptable? Repeating it:
>      
>      >> I would actually suggest that the document should tighten the definition of which charsets are allowed. For textual media types we now recommend use of UTF-8 (which should be the default) and possibly allowing a few others.
>      >>
>      >> So I suggest that the new definition of a=charset be along the lines of "MUST support UTF-8 and US-ASCII. MAY support ISO-8859-1. SHOULD NOT use any other charsets".
>      
>      	Thanks,
>      	Paul
>      
>      On 6/7/19 10:24 AM, Paul Kyzivat wrote:
>      > Alexy,
>      >
>      > When we first realized the issues with charset we thought we were very
>      > near the end of these revisions, and there didn't seem to be much taste
>      > for opening this can of worms. But the collection of iesg comments have
>      > led me to do a fair number of revisions. So I will ask again if there is
>      > willingness to make this kind of change. The main concern is with
>      > backward compatibility - is there any use in the wild of other charsets.
>      > I doubt it, but don't have any data to back that up.
>      >
>      > (The whole a=charset thing is a pain without much gain. Much trouble
>      > identifying things that are and aren't charset-dependent.)
>      >
>      >      Thanks,
>      >      Paul
>      >
>      > On 6/7/19 8:09 AM, Alexey Melnikov wrote:
>      >> Hi Barry/Paul,
>      >>
>      >> On Mon, Jun 3, 2019, at 8:54 PM, Barry Leiba wrote:
>      >>> Hi, Paul.  Sticking my oar in with Alexey's here, just on a couple of
>      >>> items:
>      >>>
>      >>>>> In Section 1:
>      >>>>>
>      >>>>> electronic mail using the MIME   extensions [RFC5322]
>      >>>>>
>      >>>>> This needs another reference for MIME. E.g. RFC 2045.
>      >>>>
>      >>>> I don't understand. This paragraph is referencing examples of protocols
>      >>>> that can be used to *transport* SDP. RFC5322 references the mail
>      >>>> message
>      >>>> format that would be used to encapsulate SDP if it were transported via
>      >>>> email. (Though it doesn't actually mention the *transport* protocols
>      >>>> used for mail messages.)
>      >>>>
>      >>>> ISTM that it is the containing protocols that should reference rfc2045.
>      >>>> RFC5322 does so, and so says how to carry SDP in mail messages. SIP is
>      >>>> itself effectively an extension to RFC2045 though it doesn't say so.
>      >>>
>      >>> Alexey's point is that you explicitly mention "MIME extensions" and
>      >>> don't provide a reference for it.  I'll go a bit farther to say that
>      >>> you're not just talking about message *format* here, but also SMTP as
>      >>> the transport (more correctly, application-layer) protocol, yes?  So
>      >>> this should say something more like, "electronic mail [RFC5321] using
>      >>> the MIME extensions [RFC2045]".  I don't think you need 5322, because
>      >>> 822 is cited by 2045, and that is obsoleted by 2822, and that by 5322.
>      >>> But I think you do need to cite SMTP and MIME.
>      >>
>      >> Yes, exactly.
>      >>
>      >>>>> In 6.10:
>      >>>>>
>      >>>>>      Note that a character set specified MUST still prohibit the
>      >>>>> use of
>      >>>>>      bytes 0x00 (Nul), 0x0A (LF), and 0x0d (CR).
>      >>>>>
>      >>>>> This doesn’t actually say what you intended. None of the common
>      >>>>> charsets
>      >>>>> prohibit these bytes. I think you meant that when using such
>      >>>>> charsets, these
>      >>>>> characters MUST NOT be used in values.
>      >>>
>      >>> Adding to what Alexey says, and maybe clarifying a bit: Character set
>      >>> and encoding are different things.  The character set is the
>      >>> abstraction of the characters used, and the encoding is how they're
>      >>> represented.  The encoding is what creates the bytes on the wire.  One
>      >>> problem is that "ASCII" refers to both, so it's confusing.  But with
>      >>> Unicode, "Unicode" is the character set and "UTF-8" is (usually) the
>      >>> encoding.
>      >>
>      >> Right. And the term "charset" is encoding of a particular character
>      >> set. It might be worth using it below.
>      >>
>      >>> But your point here is that the three byte values you list MUST NOT
>      >>> appear in the string, and that has nothing to do with the character
>      >>> set or the encoding.  Those three bytes are prohibited.
>      >>>
>      >>> You say that quite well in Section 5:
>      >>>
>      >>>     Text-containing fields such as the session-name-field and
>      >>>     information-field are octet strings that may contain any octet with
>      >>>     the exceptions of 0x00 (Nul), 0x0a (ASCII newline), and 0x0d (ASCII
>      >>>     carriage return).
>      >>>
>      >>> ... and in 5.13:
>      >>>
>      >>>     Attribute values are octet strings, and MAY use any octet value
>      >>>     except 0x00 (Nul), 0x0A (LF), and 0x0D (CR).
>      >>>
>      >>> But in 6.10 I think you want something more like this:
>      >>>
>      >>> OLD
>      >>>     Note that a character set specified MUST still prohibit the use of
>      >>>     bytes 0x00 (Nul), 0x0A (LF), and 0x0d (CR).  Character sets
>      >>> requiring
>      >>>     the use of these characters MUST define a quoting mechanism that
>      >>>     prevents these bytes from appearing within text fields.
>      >>> NEW
>      >>>     Note that the restriction specified in Section 5 applies: these
>      >>> strings
>      >>>     MUST NOT contain the bytes 0x00 (Nul), 0x0A (LF), and 0x0d (CR).
>      >>>     Character encodings that use these bytes MUST define a quoting
>      >>>     mechanism that prevents these bytes from appearing within the text
>      >>>     strings.
>      >>> END
>      >>
>      >> I think this is much better, although "use these bytes" is still
>      >> ambiguous. E.g. if these bytes are used to shift between encoding
>      >> modes within a particular charset, then there is a problem. If they
>      >> are just used to convey specific characters, it might not be.
>      >>
>      >> However, see my comment below.
>      >>
>      >>>> I don't recall what the state of character set definitions was in 1998
>      >>>> when this was first published. But it appears that they got carried
>      >>>> away
>      >>>> and over-generalized. It is easy to understand how one might choose to
>      >>>> use ISO 8859-1 rather than UTF-8 since they are closely related and
>      >>>> byte-oriented. But it is unclear how one might use some other
>      >>>> registered
>      >>>> charsets, such as EBCDIC, or other encodings of ISO 10646, such as
>      >>>> UTF-16.
>      >>>>
>      >>>> The bottom line is that use of alternate charsets other than 8859-1 is
>      >>>> underspecified. We considered revamping the definition of charset, but
>      >>>> didn't want to open that can of worms, since in practice it isn't an
>      >>>> issue.
>      >>>
>      >>> I appreciate that, and I think this isn't the place to tackle that.
>      >>> So we just need to get the text here to accurately reflect what you're
>      >>> trying to say.
>      >>
>      >> I would actually suggest that the document should tighten the
>      >> definition of which charsets are allowed. For textual media types we
>      >> now recommend use of UTF-8 (which should be the default) and possibly
>      >> allowing a few others.
>      >>
>      >> So I suggest that the new definition of a=charset be along the lines
>      >> of "MUST support UTF-8 and US-ASCII. MAY support ISO-8859-1. SHOULD
>      >> NOT use any other charsets".
>      >>
>      >>> Hoping to be helpful,
>      >>> Barry
>      >>>
>      >>
>      >
>      > _______________________________________________
>      > mmusic mailing list
>      > mmusic@ietf.org
>      > https://www.ietf.org/mailman/listinfo/mmusic
>      
>      _______________________________________________
>      mmusic mailing list
>      mmusic@ietf.org
>      https://www.ietf.org/mailman/listinfo/mmusic
>      
>