Re: Volunteer needed to serve as IANA charset reviewer

"Mark Davis" <mark.davis@icu-project.org> Thu, 07 September 2006 17:40 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1GLNs2-0004EJ-3u; Thu, 07 Sep 2006 13:40:58 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1GL74Y-0000ay-36 for discuss@apps.ietf.org; Wed, 06 Sep 2006 19:44:46 -0400
Received: from nf-out-0910.google.com ([64.233.182.184]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1GL74W-0006hm-9U for discuss@apps.ietf.org; Wed, 06 Sep 2006 19:44:46 -0400
Received: by nf-out-0910.google.com with SMTP id m19so348802nfc for <discuss@apps.ietf.org>; Wed, 06 Sep 2006 16:44:43 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=hU+Rpn9VKbjEcjRsKLYvojjbnT/1WiwBgAftP8ent6j2MHgW520ekMdtch5DPJ+YSrOz9ZVj7TGTWPR38UmKEJNTxptt3kNsw42AIfkpEOHCDAEmhu0P2fC6KLU835rylnGZ8d67Hv3m9e7hCl52Kmw5nVS12wW3nBdWX8G3cpo=
Received: by 10.49.8.4 with SMTP id l4mr1723805nfi; Wed, 06 Sep 2006 16:44:42 -0700 (PDT)
Received: by 10.49.20.10 with HTTP; Wed, 6 Sep 2006 16:44:42 -0700 (PDT)
Message-ID: <30b660a20609061644o56e49c9etfdf2f9f96e03d6fa@mail.gmail.com>
Date: Wed, 06 Sep 2006 16:44:42 -0700
From: Mark Davis <mark.davis@icu-project.org>
To: Ned Freed <ned.freed@mrochek.com>
Subject: Re: Volunteer needed to serve as IANA charset reviewer
In-Reply-To: <01M6VLE70BJ60008CX@mauve.mrochek.com>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_200911_10740580.1157586282234"
References: <p06240600c124bdb12d16@10.0.1.2> <BDA09F0B9086491428F8F2FC@p3.JCK.COM> <01M6VLE70BJ60008CX@mauve.mrochek.com>
X-Google-Sender-Auth: 24aca28e73949c50
X-Spam-Score: 0.3 (/)
X-Scan-Signature: a8041eca2a724d631b098c15e9048ce9
X-Mailman-Approved-At: Thu, 07 Sep 2006 13:40:56 -0400
Cc: Ted Hardie <hardie@qualcomm.com>, ietf-charsets@iana.org, discuss@apps.ietf.org
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

If the registry provided an unambiguous, stable definition of each charset
identifier in terms of an explicit, available mapping to Unicode/10646
(whether the UTF-8 form of Unicode or the UTF-32 code points -- that is just
a difference in format, not content), it would indeed be useful. However, I
suspect quite strongly that it is a futile task. There are a number of
problems with the current registry.

1. Poor registrations (minor)
There are some registered charset names that are not syntactically compliant
to the spec.

2. Incomplete (more important)
There are many charsets (such as some windows charsets) that are not in the
registry, but that are in *far* more widespread use than the majority of the
charsets in the registry. Attempted registrations have just been left
hanging, cf http://mail.apps.ietf.org/ietf/charsets/msg01510.html<http://www.google.com/url?sa=D&q=http%3A%2F%2Fmail.apps.ietf.org%2Fietf%2Fcharsets%2Fmsg01510.html>

2. Ill-defined registrations (crucial)
  a) There are registered names that have useless (inaccessable or unstable)
references; there is no practical way to figure out what the charset
definition is.
  b) There are other registrations that are defined by reference to an
available chart, but when you actually test what the vendor's APIs map to,
they actually *use* a different definition: for example, the chart may say
that 0x80 is undefined, but actually map it to U+0080.
  c) The RFC itself does not settle important issues of identity among
charsets. If a new mapping is added to a charset converter, is that a
different charset (and thus needs a different registration) or not? Does
that go for any superset? etc. We've raised these issues before, but with no
resolution (or even attempt at one) Cf.
http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/charset_questions.html<http://www.google.com/url?sa=D&q=http%3A%2F%2Fdev.icu-project.org%2Fcgi-bin%2Fviewcvs.cgi%2F*checkout*%2Ficuhtml%2Fdesign%2Fcharset_questions.html>

As a product of the above problems, the actual results obtained by using the
iana charset names on any given platform* may vary wildly. For example,
among the iana-registry-named charsets, there were over a million different
mapping differences between Sun's and IBM's Java, total.

* "platform" speaking broadly -- ithe results may vary by OS (Mac vs Windows
vs Linux...), by programming language [Java) or by version of programming
language runtime (IBM vs Sun's Java), or even by product (database version).

In ICU, for example, our requirement was to be able to reproduce the actual,
observeable, character conversions in effect on any platform. With that
goal, we basically had to give up trying to use the IANA registry at all. We
compose mappings by scraping; calling the APIs on those platforms to do
conversions and collecting the results, and providing a different internal
identifier for any differing mapping. We then have a separate name mapping
that goes from each platform's name (the name according to that platform)
for each character to the unique identifier. Cf.
http://icu.sourceforge.net/charts/charset/.

And based on work here at Google, it is pretty clear that -- at least in
terms of web pages -- little reliance can be placed on the charset
information. As imprecise as heuristic charset detection is, it is more
accurate than relying on the charset tags in the html meta element (and what
is in the html meta element is more accurate than what is communicated by
the http protocol).

So while I applaud your goal, I would suspect that that it would be a huge
amount of effort for very little return.

Mark


> I agree that we've reached a point where "use UTF-8" is what we need to be
> pushing for in new protocol development. (Note that I said UTF-8 and not
> Unicode - given the existance of gb18030 [*] I don't regard a
recommendation of
> "use Unicode" as even close to sufficient. The last thing we want is to
see the
> development of specializesd Unicode CESes for Korean, Japanese, Arabic,
Hebrew,
> Thai, and who knows what else.) And if the reason there are new charset
> registrations was because of the perceived need to have new charsets for
use in
> new protocols, I would be in total agreement that a change in focus for
charset
> registration is in order.
>
> But that's not why we're seeing new registrations. The new registrations
we're
> seeing are of legacy charsets used in legacy applications and protocols
that
> for whatever reason never got registered previously. Given that these
things
> are in use in various nooks and crannies around the world, it is
critically
> important that when they are used they are labelled accurately and
> consistently.
>
> The plain fact of the matter is that we have done a miserable job of
producing
> an accurate and useful charset registry, and considerable work needs to be
done
> both to register various missing charsets as well as to clean up the
existing
> registry, which contains many errors. I've seen no interest whatsoever in
> registering new charsets for new protocols, so to my mind pushing back on,
say,
> the recent registration of iso-8859-11, is an overreaction to a
non-problem.
> [**]
>
> > This question is motivated, not by a strong love for Unicode,
> > but by the observation that RFC 2277 requires it and that the
> > IETF is shifting toward it in a number of areas.   More options
> > and possibilities for local codings that are not generally known
> > and supported do not help with interoperability; perhaps it is
> > time to start pushing back.
>
> Well, I have to say that to the extent we've pushed back on registrations,
what
> we've ended up with is ad-hoc mess of unregistered usage. I am therefore
quite
> skeptical of any belief that pushing back on registrations is a useful
tactic.
>
> > And that, of course, would dramatically change the work of the
> > charset reviewer by reducing the volume but increasing the
> > amount of evaluation to be done.
>
> Even if we closed the registry completely there is still a bunch of work
to do
> in terms of registry cleanup.
>
> Now, having said all this, I'm willing to take on the role of charset
reviewer,
> but with the understanding that one of the things I will do is conduct a
> complete overhaul of the existing registry. [***] Such a substantive
change will
> of course require some degree of oversight, which in turn means I'd like
to see
> some commitment from the IESG of support for the effort.
>
> As for qualifications, I did write the charset registration specification,
and
> I also wrote and continue to maintain a fairly full-features charset
conversion
> library. I can provide more detail if anyone cares.
>
>                                 Ned
>
> [*] - For those not fully up to speed on this stuff, gb18030 can be seen
as an
> encoding of Unicode that is backwards compatible with the previous
simplified
> Chinese charsets gb2312 and gbk.
>
> [**] - The less recent attempt to register ISO-2022-JP-2004 is a more
> interesting case. I believe this one needed to be pushed on, but not
> because of potential use in new applications or protocols.
>
> [***] - I have the advantage of being close enough to IANA that I can
drive
> over there and have F2F meetings should the need arise - and I suspect
> it will.
>