Re: [apps-discuss] Objection to processing draft-ietf-appsawg-text-markdown-* documents as WG drafts (was: Re: Benoit Claise's Discuss on draft-ietf-appsawg-text-markdown-use-cases-02: (with DISCUSS and COMMENT))

Sean Leonard <dev+ietf@seantek.com> Tue, 14 July 2015 16:10 UTC

Message-ID: <55A53423.5000003@seantek.com>
Date: Tue, 14 Jul 2015 09:09:07 -0700
From: Sean Leonard <dev+ietf@seantek.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0
MIME-Version: 1.0
To: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>, John C Klensin <john-ietf@jck.com>, The IESG <iesg@ietf.org>
References: <BC704810D276B2B3DD5EFBAE@JcK-HP8200.jck.com> <55A36D59.1010101@it.aoyama.ac.jp>
In-Reply-To: <55A36D59.1010101@it.aoyama.ac.jp>
Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg="sha1"; boundary="------------ms030909010906030605030704"
Archived-At: <http://mailarchive.ietf.org/arch/msg/apps-discuss/RtpwhO6TUVC52ym0uPhWSrIOBes>
Cc: appsawg-chairs@ietf.org, apps-discuss@ietf.org, draft-ietf-appsawg-text-markdown-use-cases.shepherd@ietf.org, draft-ietf-appsawg-text-markdown-use-cases@ietf.org, draft-ietf-appsawg-text-markdown-use-cases.ad@ietf.org, Benoit Claise <bclaise@cisco.com>
Subject: Re: [apps-discuss] Objection to processing draft-ietf-appsawg-text-markdown-* documents as WG drafts (was: Re: Benoit Claise's Discuss on draft-ietf-appsawg-text-markdown-use-cases-02: (with DISCUSS and COMMENT))
Precedence: list

On 7/13/2015 12:48 AM, Martin J. Dürst wrote:
> On 2015/07/13 08:20, John C Klensin wrote:
>
>> For example, the first paragraph of Section 1.1 of
>> draft-ietf-appsawg-text-markdown includes "a linear sequence of
>> characters in some character set (code)".  That just isn't
>> acceptable terminology.  Not only does it not conform to the
>> recommendations of RFC 6365, but, in a slightly different
>> environment, it would probably be read as meaning something
>> entirely different from what was probably intended.
>
> Agreed. "Character set" as used in RFC 2046 is not a term we want to 
> use in 2015.

To be clear, the drafts got a significant amount of discussion across 
the board, which resulted in the documents submitted to the IESG.

The key objections causing disgruntlement seem to be about terminology 
in or related to the introductory matter. The introduction used to be 
significantly longer (see draft-ietf-appsawg-text-markdown-02), but was 
pared down after significant community input. If there is something 
wrong with the history or the terminology then that is fine but it's 
just an historical or factual error; it's not something that compromises 
the technical integrity of the standard.

The point of the first paragraph is to acknowledge a continuum between 
textual data and binary data (binary data having been edited out, and 
subsequently restored--see prior e-mail), and then to acknowledge the 
limitations of in-band signaling with control characters that are part 
of the (coded) character set. The raison d'être for markup languages is 
that signaling with control characters has proven to be impractical or 
nonextensible. There are many standards (ISO 2022, ISO 6429/ECMA-48, 
etc.) that deal with control characters--even a family of standards for 
word processing interchange with control characters for formatting and 
whatnot. (I forgot that particular standard--does someone know it?) And 
most people don't use them these days, because markup is just easier to 
deal with.

When I (re)wrote the introduction, I relied on the Unicode definition of 
"plain text":
/Plain Text <http://unicode.org/glossary/#plain_text>/. Computer-encoded 
text that consists/only/of a sequence of code points from a given 
standard, with no other formatting or structural information. Plain text 
interchange is commonly used between computer systems that do not share 
higher-level protocols. (See also/rich text 
<http://unicode.org/glossary/#rich_text>/.)

As well as the definitions of "text/plain" from RFC 2045/2046:

        4.1.3 <http://tools.ietf.org/html/rfc2046#section-4.1.3>. Plain
        Subtype

    The simplest and most important subtype of "text" is "plain".  This
    indicates plain text that does not contain any formatting commands or
    directives. Plain text is intended to be displayed "as-is", that is,
    no interpretation of embedded formatting commands, font attribute
    specifications, processing instructions, interpretation directives,
    or content markup should be necessary for proper display.  The
    default media type of "text/plain; charset=us-ascii" for Internet
    mail describes existing Internet practice.  That is, it is the type
    of body defined byRFC 822  <http://tools.ietf.org/html/rfc822>.

Those definitions conflate the distinction between a "character set" 
(which is, literally, a set of abstract characters, or, Unicode: 
/Character Set. <http://unicode.org/glossary/#character_set>/A 
collection of elements used to represent textual information.), and a 
"coded character set" [RFC 6365]. The sentence in the text-markdown 
draft only mirrors what is already out there. Note that RFC 6365 does 
not define character set: it only defines coded character set. So the 
objection lacks basis.

If we in the software industry are going to quibble about definitions, 
it sure would be nice for the SDOs to come together and establish a 
uniform set of definitions for these basic constructs, that are worded 
exactly the same between the different SDOs.

>
>> That
>> paragraph goes on to say "Because they are non-printing, these
>> characters" (referring to "line breaks, page breaks, or other
>> control characters) "are also hard to enter with standard
>> keyboards."   At least for European writing systems, that is
>> plain silly unless one has a keyboard that lacks an "Enter" or
>> "Return" function or is using a _very_ strange input method
>> editor (IME).
>
> For line breaks, I'd argue that they are very easy to enter on pretty 
> much any keyboard, because pretty much any keyboard has an Enter key 
> (and because pretty much every modern language has line- or paragraph 
> breaks).
>
> For page breaks and other control characters, I'd argue that they are 
> difficult to enter with the average keyboard, in any language.
>
> So just remove "line breaks", and the problem should be fixed.

Actually line breaking is and remains a persistent issue. When "Return" 
or "Enter" is pressed on a keyboard, the keyboard generates scancodes, 
typically 5A (MAKE) and F0 FA (RELEASE). It is up to the operating 
system to interpret this, which may variously generate CR, LF, CRLF, or 
something else entirely (activating a button, or for the under-35 crowd, 
sending a text). If your *intent* on a Mac OS X machine is to generate 
the control character CR, you are in for a tough time.

We may make a general distinction between control characters and 
non-control characters (which, variously, are referred to as printable 
characters or graphic characters). Markup languages encode formatting 
information in non-control characters, by overloading the meaning of 
some non-control characters beyond the meaning assigned by the character 
set.

>
>
>> The next paragraph goes on to make a suggestion
>> about "overload certain characters with additional meanings". At
>> least for SGML (and its descendents), that is not the way what
>> happens is described.  I'd suggest it is even less true of
>> LaTex, but YMMD.  What might be intended is something like
>> "certain characters or character sequences are treated as
>> reserved delimiters, with the strings they delimit acting as
>> processing, identification, or formatting directions".
>
> I'd agree that this isn't how SGML or LaTeX would describe it, but 
> it's not actually in any way wrong. In XML, depending on context, '>' 
> means itself or "close tag" (or any of a few more obscure meanings). 
> We are still in the introduction, after all.

Yes, exactly. According to Unicode, ISO 646, RFC 20, and USASI 
X3.4-1968, ">" (code point 3/14) means "GREATER THAN SIGN". It has also 
been the policy of the U.S. Government since Lyndon B. Johnson's 
approval in 1968 (see 
http://www.presidency.ucsb.edu/ws/index.php?pid=28724 ). Any meaning 
other than "GREATER THAN SIGN", such as "close tag", is exactly that: an 
additional meaning.

>
>
>> Continuing with this theme, the "charset" portion of Section 2
>> of draft-ietf-appsawg-text-markdown-06 says:
>>
>>     "...will get along just fine by operating on character
>>     codes that lie in printable US-ASCII, blissfully
>>     oblivious to coded values outside of that range."
>>
>> I don't know what that means in spite of being regularly
>> mistaken for an expert in the area.  Given that you want to be
>> CCS-independent (see RFC6365), I think the first part probably
>> refers to "graphic characters in the ASCII repertoire", but I
>> don't know what "blissfully oblivious..." is trying to tell me.
>> Is it that each Markdown processor has, or assumes, a CCS and
>> encoding and, if anything is encountered outside that range or
>> is a non-graphic character in that range, it will be ignored?
>> Noting that set of exclusions would ignore the character known
>> as SP, I suggest that any such Markdown processor would be
>> seriously broken.  It is more likely that the sentence is wrong.
>
> I know what the sentence tries to refer to. It refers to the fact 
> that, as long as the syntax you are looking at only uses 7-bit bytes 
> values with ASCII semantics (including the usual control characters 
> and space), a processor will work for a wide range of character 
> encodings.
>
> The problem is that this applies to some encodings, but not to others. 
> The criterion is a certain kind of strong ASCII-compatibility, namely 
> that characters in the ASCII range (including C0) are always 
> represented directly as 7-bit bytes, and that these 7-bit bytes always 
> represent the corresponding characters and nothing else.
>
> Encodings that qualify are of course US-ASCII itself, straightforward 
> 8-bit encodings starting with iso-8859-1 and including vendor 
> encodings (Windows, Mac,...), some multibyte encodings such as GB2312 
> and EUC-JP, and UTF-8 (but beware of the BOM). However, it does not 
> include some other encodings, in particular not iso-2022-jp, 
> Shift_JIS, GBK, or GB18030, and of course not UTF-(16|32)(LE|BE).

Many prominent Markdown processors operate on a text stream as input, 
and output the same text stream. By "text stream", I mean that the bits 
from disk (or memory or some other source) are interpreted by a software 
library, such as Perl, Python, the C++ standard library, etc., and are 
given to the Markdown processor in code units. A JavaScript-based 
Markdown interpreter, for example, almost never operates on raw octets; 
it always takes input from some other source that is dumped into 
strings. Strings in JavaScript are (virtualized) to UCS-2 or UTF-16 
16-bit code points. It doesn't matter if the source is in ISO-2022-JP, 
GBK, ISO-8859-1, EBCDIC, or UTF-32BE. What matters is that the Markdown 
processor looks for things like "*" and when it finds things like "*" in 
certain configurations, will replace it with HTML bulleted list tags; in 
other configurations, will replace it with <em> tags; and in other 
configurations, will not change the output at all.

The common theme of Markdown processors is that Markdown operates on 
"punctuation" <http://daringfireball.net/projects/markdown/syntax>:
To this end, Markdown’s syntax is comprised entirely of punctuation 
characters, which punctuation characters have been carefully chosen so 
as to look like what they mean.

...which, by the way, is exactly what that part of the text-markdown 
draft says, right before the quoted part about being "blissfully oblivious".

Bear in mind that this discussion is being had in the context of the 
"charset" parameter, which /combines/ the abstract character set with 
the character encoding. It seems that folks are asking for separation in 
a context where separation does not exist.

>
> So the text should definitely be more careful here.

I suppose the text could be more careful, but when reading the whole 
paragraph, I do not see a problem. In fact, to the extent that a 
Markdown processor (or the input libraries upon which it relies) 
misinterprets content as the wrong (coded) character set compared to the 
author's intent, it is very likely that the Markdown processor will not 
fail--it will just output something crazy. Markdown processors rarely 
fail; they are not designed that way. That is why the text uses the 
phrase "blissfully oblivious".

Sean

Attachment: smime.p7s

[apps-discuss] Objection to processing draft-ietf… John C Klensin
Re: [apps-discuss] Objection to processing draft-… Dave Crocker
Re: [apps-discuss] Objection to processing draft-… Martin J. Dürst
Re: [apps-discuss] Objection to processing draft-… John C Klensin
Re: [apps-discuss] Objection to processing draft-… Sean Leonard

Re: [apps-discuss] Objection to processing draft-ietf-appsawg-text-markdown-* documents as WG drafts (was: Re: Benoit Claise's Discuss on draft-ietf-appsawg-text-markdown-use-cases-02: (with DISCUSS and COMMENT))

Attachment: smime.p7s