[apps-discuss] Fwd: I-D Action: draft-klensin-ftpext-typeu-00.txt

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Mon, 02 April 2012 03:11 UTC

Message-ID: <4F7918C0.9020204@it.aoyama.ac.jp>
Date: Mon, 02 Apr 2012 12:10:56 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100722 Eudora/3.0.4
MIME-Version: 1.0
To: John C Klensin <klensin@jck.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
Cc: "apps-discuss@ietf.org" <apps-discuss@ietf.org>
Subject: [apps-discuss] Fwd: I-D Action: draft-klensin-ftpext-typeu-00.txt
Precedence: list

Hello John, others,

Because there was a new version, I took a look, and I have a few 
comments that I don't want to withhold.

*For my main comment, please see the end of this mail.*



The history section should be relegated to an appendix or removed. [I 
enjoyed reading [RFC0373]; it's very interesting to compare to what we 
have now. It's not completely possible for me to judge how much of it 
was visionary at the time, and how much common sense, maybe in other 
circles than the ARPAnet].



       "So, with allowances for those line termination problems --
    which have been a large issue in many cases -- Image ("binary") and
    ASCII transfers were almost equivalent and the TYPE command became
    less-used."

I'm not a frequent user of FTP anymore, but I think this observation is 
correct. I therefore strongly wonder why we would need this new TYPE. 
For more, please see below.



    "several variations on UTF-16 (possibly with surrogate pairs)":

This is highly misleading. UTF-16 always includes the potential for 
surrogate pairs, by definition. [Of course, many actual data files don't 
include them, but that doesn't have to be called out.] The thing that 
doesn't allow surrogate pairs is called USC-2 (ISO-10646-UCS-2 in 
http://www.iana.org/assignments/character-sets).



                            "When those files are transferred to another
    system with Image type, the result may be completely uninterpretable
    on the target system."

This is of course possible, in particular for executables and the like. 
But I think there is much less variety in formats, and much more 
versatility in tools, so this is less an issue than it was, and even if 
it continues to be an issue, it's not something that can be solved with 
a simple type.



                                "by sending the data in a stream
    conformant to the Net-Unicode format specified in Section 3."

This is confusing, because Net-Unicode is defined in RFC 5198, not in 
section 3. Can probably be fixed with a small wording change.



   "This section specifies a profile of Net-Unicode [RFC5198] for use
    with FTP TYPE U."

In German, there's a saying "Meister, die Arbeit is fertig, soll ich sie 
gleich flicken." (Master, I completed my work, can I start to fix it?) 
It's used when something is created as broken. It may not be the case 
for Net-Unicode, but a sentence like the above just exudes the feeling 
that the original definition may be broken to me, sorry.



MAIN COMMENT STARTS HERE

   "Unicode characters must be transmitted in UTF-8 [RFC3629] as
    specified for Net-Unicode."

This brings me to my main point, which is that the proposal isn't really 
implementable. It assumes that, like in the old days, there is a single 
textual encoding per computer. This worked very well at a time when some 
computers were using (7-bit) US-ASCII, and others were using the basic 
version of EBCDIC (for those who might not know, there are lots of 
EBCDIC variants including double-byte variants for East Asia).

   "However, migration to Unicode has reintroduced many of the old
    issues.  When Unicode is used inside a system, it can be used with
    several different encodings (e.g., UTF-8 and several variations on
    UTF-16 (possibly with surrogate pairs), different assumptions about
    normalization (see "Terminology for Use in Internationalization"
    [i18n-terms] for more discussion) and even new variations on line
    termination conventions.  When those files are transferred to another
    system with Image type, the result may be completely uninterpretable
    on the target system."

This mostly has it wrong. The issue of many different character 
encodings has been around for a long time before Unicode. FTP hasn't 
done anything about this, and apparently has been fine (mostly because 
applications and users know how to deal with the problem: If you see 
garbage on screen, try another application or use another setting).

Unicode is helping in that it greatly *reduces* variation, but in the 
meantime, it adds variation because it leads to a net increase of 
character encodings (even if there were only one encoding form of 
Unicode). The fact that Unicode can come in different encoding forms and 
other variations can definitely be annoying, but a new TYPE is not a 
solution. Why? Because it's not like in the old days that one maker's 
products would use one encoding, and another would use another. There's 
overall probably more UTF-16, and more of the LE variety, on a Windows 
System than on a Linux or Mac system, but there's a lot of UTF-8 on all 
of them, and there's usually also quite a bit of legacy data (e.g. 
Shift_JIS or EUC-JP on a Japanese system) and on average, the OS doesn't 
have a clue about the encoding of the file.

Applications make guesses, and they often get it right. An FTP 
implementation could make guesses, but the problem is that the guess is 
too early; if it is wrong, then it we get weird double encodings. That's 
different from a text editor making a guess; the user can try other 
encodings from a menu until the stuff is right, and the binary data 
isn't messed up.

The situation is even worse for normalization, in the sense that no OS 
has any clue about whether files are normalized or not.

The line ending issue is not as hopeless, in that it's easy to write a 
filter that converts all line endings e.g. to CRLF, even including the 
(quite rare in actual use) new Unicode ones. So the draft might make 
marginal sense if limited to line-ending issues for UTF-8 only. But then 
again, because that's an easy problem, there are lots of applications 
out there that deal with this already, including any serious text editor.


So overall, I don't think the Apps WG should spend time with this, 
unless we hear loud voices from actual FTP implementers that tell us 
that this is needed and will be implementable in a way that solves 
actual problems.

Regards,    Martin.

[apps-discuss] Fwd: I-D Action: draft-klensin-ftp… Martin J. Dürst