Re: Problems (and non-problems) in charset registry (was: Re: Volunteer needed to serve as IANA charset reviewer)

Ned Freed <ned.freed@mrochek.com> Sat, 09 September 2006 14:19 UTC

To: Martin Duerst <duerst@it.aoyama.ac.jp>
Message-id: <01M6ZE60BOFW0008CX@mauve.mrochek.com>
Date: Sat, 09 Sep 2006 06:39:06 -0700
From: Ned Freed <ned.freed@mrochek.com>
Subject: Re: Problems (and non-problems) in charset registry (was: Re: Volunteer needed to serve as IANA charset reviewer)
In-reply-to: "Your message dated Sat, 09 Sep 2006 15:23:39 +0900" <6.0.0.20.2.20060908190213.047611f0@localhost>
MIME-version: 1.0
Content-type: TEXT/PLAIN
References: <p06240600c124bdb12d16@10.0.1.2> <BDA09F0B9086491428F8F2FC@p3.JCK.COM> <01M6VLE70BJ60008CX@mauve.mrochek.com> <30b660a20609061644o56e49c9etfdf2f9f96e03d6fa@mail.gmail.com> <6.0.0.20.2.20060908190213.047611f0@localhost>
Cc: Ted Hardie <hardie@qualcomm.com>, discuss@apps.ietf.org, Ned Freed <ned.freed@mrochek.com>, ietf-charsets@iana.org, Mark Davis <mark.davis@icu-project.org>
Precedence: list
Errors-To: discuss-bounces@apps.ietf.org

> Hello Mark, others,

> I think it's good to have such a collection of problems in the registry.
> But I think it's also fair to say that what Mark lists as problems may
> not in all cases actually be problems.

I agree. I also think that there's a bunch of low-hanging fruit here: Many (but
certainly not all) of the registry problems can be fixed without a huge
investment of time and effort.

Once the obvious stuff is addressed we can discuss how far we want to go,
especially in regards to versioning, variant tagging, and so on. But let's
please not get bogged down in the hard stuff before dealing with the easy
stuff.

> > If the registry provided an unambiguous, stable definition of each charset
> > identifier in terms of an explicit, available mapping to Unicode/10646
> > (whether the UTF-8 form of Unicode or the UTF-32 code points -- that is
> > just a difference in format, not content), it would indeed be useful.
> > However, I suspect quite strongly that it is a futile task. There are a
> > number of problems with the current registry.

> I think the request for an explicit, fixed mapping is a good one,
> but in some cases, it would come at a high cost. A typical example
> is Shift_JIS: We know there are many variants, on the one hand due
> to additions made by makers (or even privately), on the other hand
> also due to some changes in the underlying standard (which go back
> to 1983).

> For an example like Shift_JIS, the question becomes whether
> we want to use a single label, or whether we want to carefully
> label each variant.

Exactly so. And I would provose that we defer worrying about such tricky issues
until the obvious stuff is done. It is always important not to let the best
be the enemy of the good.

> >2. Incomplete (more important)

> > There are many charsets (such as some windows charsets) that are not in the
> > registry, but that are in *far* more widespread use than the majority of the
> > charsets in the registry. Attempted registrations have just been left
> > hanging, cf <http://www.google.com/url?sa=D&q=http%3A%2F%2Fmail.apps.ietf.org%2Fietf%2Fcharsets%2Fmsg01510.html>http://mail.apps.ietf.org/ietf/charsets/msg01510.html<http://www.google.com/url?sa=D&q=http%3A%2F%2Fmail.apps.ietf.org%2Fietf%2Fcharsets%2Fmsg01510.html>

> Some of this is due to the original rather strict management of the registry.
> Some of it is due to the current backlog. A lot of this is due to the fact
> that nobody cares enough to work through the registration process; many
> people think that 'the IETF' or 'the IANA' will just do it. The solution
> is easy: Don't complain, register.

And reviewing registration applications in a timely way might help encourage
more registration activitiy.

> > 2. Ill-defined registrations (crucial)

> >  a) There are registered names that have useless (inaccessable or unstable)
> > references; there is no practical way to figure out what the charset definition
> > is.

> This is possible; some of these registrations are probably irrelevant,
> for others, it's pretty clear what the charset means in general, even
> though there might be implementation differences for some codepoints.

Well said. If nobody can figure out anything about a charset in the registry,
that's a pretty good indication that it's irrlevant at least in terms of
current usage. I suspect quite a few of the incomplete entries fall into
this category.

> >  b) There are other registrations that are defined by reference to an
> > available chart, but when you actually test what the vendor's APIs map to, they
> > actually *use* a different definition: for example, the chart may say that 0x80
> > is undefined, but actually map it to U+0080.

> It's clear that for really faulty charts, the vendors should be blamed,
> and not the registry.

Well, to be fair, vendors sometimes add wierd mappings in response to customer
demand. For example, I've seen codepoints specific to some Microsoft variant of
a national standard charset creep into usage of the standard charset. In such
cases what's a vendor to do when they have a bunch of customers saying "we
don't care what the standard is, either add these codepoints or we'll switch to
the competitor's product that does do this"?

> However, the difference between the published map and the
> actually used API may be due to the fact that 0x80 is indeed not
> part of the encoding as formally defined, and is mapped to U+0080
> just as part of error treatment. For most applications (not for
> all necessarily), it would be a mistake to include error processing
> in the formal definition of an encoding.

Yes, that's an issue too. I've observed wide variations in the handling
of unassigned code points by different converters.

> >  c) The RFC itself does not settle important issues of identity among
> > charsets. If a new mapping is added to a charset converter, is that a different
> > charset (and thus needs a different registration) or not? Does that go for any
> > superset? etc. We've raised these issues before, but with no resolution (or
> > even attempt at one) Cf. <http://www.google.com/url?sa=D&q=http%3A%2F%2Fdev.icu-project.org%2Fcgi-bin%2Fviewcvs.cgi%2F*checkout*%2Ficuhtml%2Fdesign%2Fcharset_questions.html>http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/charset_questions.<http://www.google.com/url?sa=D&q=http%3A%2F%2Fdev.icu-project.org%2Fcgi-bin%2Fviewcvs.cgi%2F*checkout*%2Ficuhtml%2Fdesign%2Fcharset_questions.html>html

> It seems that what you would want, for your purposes, is to use a new
> label if a new character gets added to a legacy encoding, but not use
> a new label e.g. for UTF-8 each time a character gets added.
> So things would be somewhat case-by-case.

I agree - I don't think it is possible to codify a single "best practice" to
handle every case here.

> > As a product of the above problems, the actual results obtained by using the
> > iana charset names on any given platform* may vary wildly. For example, among
> > the iana-registry-named charsets, there were over a million different mapping
> > differences between Sun's and IBM's Java, total.

> It would be better to express these numbers in terms of percentages:
> In the experiment made, how many codepoints were mapped, and for how many
> did you get differences?

> Even better would be to express these numbers in terms of percentages
> of actual average data. My guess is that this would be much lower than
> the percentage of code points.

It certainly better be or thing must not be working very well somewhere!

> This is in no way to belittle the problems, just to make sure we put
> it in proportions.

Right.

> > In ICU, for example, our requirement was to be able to reproduce the actual,
> > observeable, character conversions in effect on any platform. With that goal,
> > we basically had to give up trying to use the IANA registry at all.

> This is understandable. The start of the IANA registry was MIME, i.e.
> email. The goal was to be able to (ultimately visually) decode the
> characters at the other end.

And that tends to drive various tradeoffs in a particular direction. For
example, it tends to argue for fewer tags for minor charset variations because
that's easier to deploy and update.

> ...

>  >And based on work here at Google, it is pretty clear that -- at least in
> > terms of web pages -- little reliance can be placed on the charset information.

> Yes, but this is a problem on a diffent level than the above.
> Above, you are speaking about small variant differences.
> Here, you are speaking about completely mislabled pages.

I don't see any way we can fix the mislabelling problem (which also exists in
email but takes on a somewhat different form there) in any direct way. However,
to the extent that mislabelling has occured due to people getting confused by
our current registry mess, cleaning our own house might help a little. But
probably not much - a site that labels every message they send as iso-8859-1 no
matter what's actually in the message (and I've seen some big ones that do
this) isn't going to be cured by our having a perfectly accurate and totally
comprehensive registry. Like it or not, we cannot fix everything here.

> The problems with small variants don't make the labels in
> the registry unsuitable for labeling Web pages. Most Web
> pages don't contain characters where the minor version
> differences matter, because private use and corporate
> characters don't work well on the Web, and because some
> of the transcoding differences are between e.g. full-width
> and half-width variants, which are mostly irrelevant for
> viewing and in particular for search.

I've observed similar behavior in email. People are pretty adaptable and tend
to figure out what works and what doesn't fairly quickly.

> >So while I applaud your goal, I would suspect that that it would be a huge amount of effort for very little return.

> There are other protocols, in particular email. My understanding is
> that for email, the situation is quite a bit better, because people
> use a dedicated tool (their MUA) to write emails, and emails rarely
> get transcoded on the character level, and there is no server involved,
> whereas users use whatever they can get their hands on to create
> Web pages, the server can mess things up (or help, in some cases),
> and pages may get transcoded.

To be honest, it is hard for me to judge whether the situation for email is
better or worse than for the web or other protocols. Since I work on email
systems a lot more than I work on other stuff, I tend to see reports of lots
more problems with email. But this is probably more the result of being on the
receiving end of bug reports for email while not being on the  receiving end of
bug reports for, say, a web client.

In any case, the Internet is now such a big place with so many nooks
and crannies that I have no idea how you'd figure out where the worst problems
are.  And I'm not sure it matters either.

				Ned

Re: Volunteer needed to serve as IANA charset rev… John C Klensin
Volunteer needed to serve as IANA charset reviewer Ted Hardie
Re: Volunteer needed to serve as IANA charset rev… John C Klensin
Re: Volunteer needed to serve as IANA charset rev… Tim Bray
Re: Volunteer needed to serve as IANA charset rev… Ned Freed
Re: Volunteer needed to serve as IANA charset rev… Keith Moore
Re: Volunteer needed to serve as IANA charset rev… Tim Bray
Re: Volunteer needed to serve as IANA charset rev… Ned Freed
Re: Volunteer needed to serve as IANA charset rev… Bruce Lilly
Re: Volunteer needed to serve as IANA charset rev… Keld Jørn Simonsen
Re: Volunteer needed to serve as IANA charset rev… Terje Bless
Re: Volunteer needed to serve as IANA charset rev… Mark Davis
Re: Volunteer needed to serve as IANA charset rev… Martin Duerst
Problems (and non-problems) in charset registry (… Martin Duerst
Re: Problems (and non-problems) in charset regist… Ned Freed