[Ltru] my technical position on extlang

Here are some thoughts on extlang. The more readable version is at:

http://docs.google.com/Doc?docid=dfqr8rd5_676kxxxjhd&hl=en

Copied here for the archive:

Extlang

The arguments for extlang are that they give superior results, and are thus
worth the complication of having some languages be unavailable as langtag,
and only in a secondary position. I believe that when extlang is examined
carefully, by people who have implemented language tag lookup, that on
balance most people will be worse off than if we retain the structure of
RFC4646, and do not complicate the structure to the overall detriment of
implementations, do not make encompassed languages be in the inferior,
extlang position.

* Links*

   - This document: http://docs.google.com/Doc?id=dfqr8rd5_676kxxxjhd
   - BCP 47: http://www.rfc-editor.org/rfc/bcp/bcp47.txt
   - Current draft:
   http://www.inter-locale.com/ID/draft-ietf-ltru-4646bis-14.html
   - Macrolanguage mappings <http://www.sil.org/iso639-3/macrolanguages.asp>

 Process

I looked back over the emails, and people may not remember everything that
was discussed in the emails around the time that we came to (rough)
consensus on extlang. While Shawn claimed that the topic should be reopened
-- well after the last call went out -- as far as I can tell there is in
fact *no* new information being presented that was not available on the list
6 months ago.

   - everything mentioned recently about the use of 'zh' and 'cmn' was
   discussed long ago in the email of September through December last year.
   - the teleconferences were discussed at length during September (see "[Ltru]
   70th IETF - ltru session?", and we ended up with a few sessions. We made
   efforts to have them at times that would work for all those interested in
   discussing the issue; we also had jabber available for those who couldn't
   phone in. Nobody objected to the time except for Peter, and we delayed a
   week so that we could then include him. A number of people from both 'camps'
   evidenced interest in having them. We took notes and distributed them. (Doug
   bailed out at one point: "Go ahead and include me out of future
   teleconferences, and feel free to move them back to a more comfortable time
   for all. ")
   - as a result of those telecons, we were able to come to rough consensus
   - by December, we were moving on to whether to remove extlang from the
   ABNF or not, and settled on doing that.
   - the results of the telecon were announced on the list. Nothing
   prevented people who might have felt left out from contesting the results at
   that point or any time since, including the last call.
   - there have been no new technical reasons provided since December that
   would give us any reason for overturning the result of the telecon, nor for
   believing that we would get consensus for having encompassed languages be
   extlangs in RFC4646bis
   - moreover, we now have workable text in draft 14 for handling the
   macrolanguage/encompassed language issues without recourse to a new,
   untested mechanism.

Since this issue seems to be reopening (for no good reason that I can see),
I put together some responses from past emails on the topic. They are not
wonderfully organized; I just tried to cast them as Q/A pairs to make them a
bit more readable. There is definitely some repetition that ought to be
edited away. The tone may seem too harsh at times -- sometimes that was in
the heat of the moment, and I haven't had time to moderate the text, so I
apologize in advance for any offense I may give.
 Q. Where would extlang make a difference?
 A. Where As RFC 4647 describes, there are two main processes for matching
using language tags, filtering and lookup. There is reason to look at how
the two different macrolanguage models in affect 4647. If primary reason
cited for the extlang model is compatibility, we have to see what effects
each model has on commonly used matching functions, which is where the
rubber hits the road.

For filtering, extlang offers no particular advantage. Let's look at queries
of "ar-ary" Moroccan Arabic vs "ary". In either case I need a way to match
all and only Moroccan Arabic; I must not fallback to "ar". If I fallback to
include all Arabic, the actual content that is in Moroccan Arabic would be
completely and utterly swamped by Standard Arabic. So for filtering, the
extlang model just gives us a more complicated syntax, with no benefit.

The only possible advantage of extlang would be in Lookup.
 Q. Isn't extlang better for lookup? A. Take an example

   - My site has support for zh, zh-Hans, and zh-Hant. zh has Mandarin
   content, since that is what 99.999% of web sites mean by "zh" currently. As
   is customary in fallback, my site's 'zh' content uses the predominent form
   (zh-Hans) [this is just for example; a TW site can use the opposite
   convention.]
   - A user comes in with different requests, listed below.

* Scenario 1. *The user's browser has the proposed "zh-yue-Hant-US". My
lookup falls back to zh, so I serve it up to the user. So even if the target
of the match (zh) is not Cantonese, you want a fallback to zh. I'm guessing
that you see this as better than if we defined the tag as "yue-Hant-US",
since it gets to some fallback that the user is likely to understand. But I
don't see this as much different than if we had fr-br-BE (meaning Breton,
but fall back to French), or ro-mo (meaning Moldavian, but falling back to
Romanian).* And note that in the fallback, the script and region are
completely lost.*

* Scenario 2. *The user's browser has zh-cmn-Hant-US. In matching, we fall
back to zh. Note than in the fallback, the script and region are completely
lost. *We have essentially just introduced a synonym for zh which causes
fallback to lose information, for no good reason. *
 Q. What's wrong with simply including extlang?
A. If we bake it in, then every simple algorithm will in practice
automatically fall back from Cantonese to Mandarin, fall back from Dari to
Persian, fall back from Khetrani to Lahnda -- and in doing so, strip the
script and country information. That, unless they fix the algorithm to
pretend that the secondary language is in fact a primary language. So we are
forcing people into a model that is often, or mostly, wrong.

*If we supply the information in the registry, then implementations can
choose whatever they think is appropriate, given the particular facts about
languages and the particular needs of their applications without having to
work around the **extlang** mechanism.*

I'm reminded again of a similar case with C++. The assignment operator gets
a default implementation. That must have seemed like a nice convenience for
the user, but *except for toy programs, it is always, always wrong. * So
supplying that default just means that people usually have to take extra
steps to disable it, and prevent it from causing bugs in their programs. I'm
worried about this being similar.

It is clear that companies like Google or Yahoo can work around the problems
with extlang-- what I'm worried about are the people who don't have a lot of
experience with these matters, and are just led down a garden path. We need
to look long and hard at the experience of people who have had detailed
implementation experience with filtering and matching these tags in
production environments.
 Q. Where are some cases where extlang works particularly badly?
Extlang plays especially badly in many cases. Suppose that we have
macrolanguage m1, and microlanguages x1 and x2. By the design of ISO 639, we
*can't* assume that a speaker of x1 can also speaker x2 or vice versa. If a
user has as accept language the list <x1-Ssss-Rr, en, fr>, it works fine
without extlang: she gets the fallback

(A) Script fallback

   1. x1-Ssss-Rr
   2. x1-Ssss
   3. x1
   4. en
   5. fr

If she also speaks/reads x2, then she can specify <x1-Ssss-Rr, x2, en, fr>
or <x1-Ssss-Rr, en, fr, x2>; that is, putting x2 in the list in the position
she wants it. If x2 is the predominant microlanguage, meaning that m1 is
essentially always assumed to be x2, then the priority list can be
<x1-Ssss-Rr, en, fr, m1>, also wherever it belongs. Thus, the user gets
something she can understand, based on the list she supplied.

If we are using extlang, and x2 is the content for m1, then we get the
fallback

(B) Script + extlang fallback

   1. m1-x1-Ssss-Rr
   2. m1-x1-Ssss
   3. m1-x1
   4. m1
   5. en
   6. fr

That has two problems: first, the script and region are lost. That can be
fixed by hacking the fallback (although there is a *lot *of installed base
that won't do this), to

(C) Hacked Script + extlang fallback

   1. m1-x1-Ssss-Rr
   2. m1-x1-Ssss
   3. m1-x1
   4. m1-Ssss-Rr
   5. m1-Ssss
   6. m1
   7. en
   8. fr

But even more importantly, we are disabling the user's explicit choice. If
the user doesn't speak x2 (or whatever the content of m1 is), he's screwed.
There is no way that he can indicate that he wants x1 * but no other version
of m1 because he can't understand them.
*
We'd have to change the fallback to be quite substantially different to get
around this, with

(D) More Hacked Script + extlang fallback

   1. m1-x1-Ssss-Rr
   2. m1-x1-Ssss
   3. m1-x1
   4. en
   5. fr
   6. m1-Ssss-Rr
   7. m1-Ssss
   8. m1

This is, however, still only appropriate if it is likely that a user of x1
speaks *whatever happens to be the content for m1*. That is an extremely
shaky assumption.

If Peter Constable said, for each and every macrolanguage on
http://www.sil.org/iso639-3/macrolanguages.asp, there is at least one
microlanguage that all speakers (or even most speakers) of each of the other
microlanguages would understand, I'd say: fine, let's do extlang and
incorporate that information into the registry, with the "default
microlanguage" for each macrolanguage. Then, for example, implementers would
know that, say " fuf
<http://www.sil.org/iso639-3/documentation.asp?id=fuf>Pular" is
understood by all the speakers of the microlanguages under
"ful <http://www.sil.org/iso639-3/documentation.asp?id=ful> Fulah", so we
can tell people to always have the content of "ful" be "fuf", and bake the
macrolanguages in as extlang.

The point of the suggested text is that if your application wants to use
macrolanguages to support extlang-equivalent fallback, there is nothing
stopping you from doing so. If there are particular environments where an
extlang-like fallback is right for a particular language community, it is
simple to do. But we don't need to bake shaky assumptions into the structure
of language tags.
 Q. Isn't extlang just like script fallback?
A. The problem with extlang is that the fallback from encompassed language
to macrolanguage is fundamentally different in kind than a fallback from
region to script to base language. In the case of script, like uz-Arab and
uz-Latn, or en-US vs en-GB, we really have variations on the same language,
and fallback makes sense. We ordered the subtags so that it works optimally
overall.

The encompassed languages, on the other hand, are not just dialects, not
just variants. *They are languages in their own right. *Trying to insert
them into the fallback process just screws things up, because they need a
"sideways" matching not just simple truncation fallback. If you want to do
any fallback with extlang, it would be to fall back from zh-yue-<other
stuff> to zh-<other stuff>. That means that in order to do reasonable
fallback, you can't just use truncation fallback anyway. So I see the
situation this way:

   1. The only reason for adding the complication of the extlang mechanism
   is to make truncation fallback work better.
   2. Truncation fallback with extlang doesn't work better.
   3. So there is no need to make encompassed languages be "secondary"
   languages by making them be "secondary" subtags.

The goals of extlang are good, to make matching work better, but in practice
it just makes things worse. [Speaking to those familiar with C++, it feels a
bit like the default assignment operator in C++. Nice in theory, but in
practice it gums things up more than it fixes, since once you are beyond
very simple (toy) classes, the default is almost always wrong -- but because
it is supplied behind your back you don't realize it.]

So instead of adding the extlang mechanism to RFC 4646, what we really need
to do is to point people to how to handle yue and other encompassed
languages along with mo/ro, tl/fil, and other edge cases in a reasonable
way, by augmenting matching.
 Q. Where might the macrolanguage be useful? A. An implementation may choose
to use that information in falling back from some encompassed languages to
macro languages. For example, given the language priority list with
Cantonese in Traditional Script as used in Hong Kong, followed by French
("yue-Hant-HK, fr"), the lookup could be the following:

1. yue-Hant-HK
2. yue-Hant
3. yue-HK
4. fr
5. implementation defined default:
  5a. zh-Hant-Hk
  5b. zh-Hant
  5c. zh
  5d. en

Whether such fallback should be used -- and if so, the precise way in which
such a fallback is done -- is application-dependent.  Where it is very
likely that the audience requesting Cantonese (as above) will accept and
understand Mandarin (the predominant content for 'zh'), then this fallback
might be useful. Where there is risk that that the audience requesting
Cantonese will not be conversant with Mandarin, and would prefer an
alternative in the language priority list, it should be avoided. (This might
be the case, for example, with audio using yue-Zxxx-US.
 Q. Why not have extlang for using macrolanguages if the suggested text adds
macrolanguages back into the fallback chain?
A. The suggested text doesn't add it back in. It only says that *IF *an
application wants to do extlang-equivalent fallback, the text in BCP 47
already allows for that.

We really have *no* idea whether using macro languages in the fallback chain
is a good idea or not. Some people think it will be an advantage for a few
specific examples that are cited. (Cantonese comes up, but when I tested
some of the assumptions with Cantonese speakers, they didn't quite hold up.)
But nobody has substantiated that it will give better results for all or
most macrolanguages. Or any indication of an even rough list of those for
which it will be better. Nor has anyone effectively argued that the
situation between yue and zh is substantially different than the situation
between gsw and de, where we get along just fine without extlang.

So the suggested text just provides it as an option, and leaves it up to the
application.
 Q. How should we look at macrolanguages?
I think one of the things that we realized when looking at how this would
work in practice is that we are better off if we treat macrolanguage as a
piece of *perhaps* information for matching, but one that can be enhanced
(that is, changed) over time as more information becomes available.
Hard-coding it into extlang doesn't serve that purpose, and causes other
problems, notably that the other fields are lost in fallback: if we had
zh-yue-Hant, then by the time we get to zh in fallback, we've lost the Hant.

So that is the origin of the text we are proposing.

There are a number of edge cases such as deprecated codes, closely related
languages, or practical fallbacks (eg if someone speaks X they are likely to
speak Y, even if X and Y are not linguistically related) that are simply
unsuited to hard-coding in the tag. If I hit a code like "iw-PL", I want to
match that with "he-PL", not depend on some kind of fallback between "iw"
and "he"; otherwise it loses information (the PL). We do provide that kind
of information in the deprecated field, and with the macrolanguage field we
would provide more (for example, it would make clear the relation between
no, nb, and nn and how that could be used in matching). Other useful
information is the scripts used with a language in practice; suppress script
supplies just a little information, but doesn't tell me that Uzbek is
customarily written with Arabic, Latin, or Cyrillic, but not with (say)
Tagalog.

The more information there is available, whether it be in the language
subtag registry or somewhere else, the better a job people can do in dealing
with some of the edge cases that turn up in matching.

Q. Doesn't the macrolanguage relationship uniquely define the best fallback?

A. Certain languages are closely related, and the lookup process may take
that into account. Macrolanguage is just one factor that may (or might not)
be useful. For example, since the the tag "gsw-CH" (for Swiss German as used
in Switzerland) was first available on 2006-12-08, Swiss German
("Schwyzerduetsch") text may have been tagged with "de-CH" instead. ISO 639
was not (and is still not) clear on whether "de" meant only High German or
also included variants such as Low German or not. Thus Swiss German material
may have been, and may still be tagged with "de". Essentially all Swiss
German speakers are comfortable in High German, so where Swiss German is not
available, High German is a very good fallback. Thus when given the language
priority list: "gsw-CH, fr-CH", an implementation using lookup may augment
the default values to also include the lookup of related values, such as the
following search order:

1. gsw-CH
2. gsw
3. fr-CH // next language
4. fr
5. implementation defined default:
    5a. de-CH // special fallback from gsw-CH
    5b. de
    5c. en // root

In this way, other likely possibilities are tried before the final fallback
to the root value. Note that typically the fallback to related languages
should include the script and region codes if available.

In this way, the lookup process may take into account what languages people
are likely to understand, given a language priority list. Similarly, the
close relations between Romanian and Moldavian, Tagalog and Filipino,
Serbo-Croatian and Croatian, and so on may all be useful in doing related
language lookup.

This is not restricted to related languages. For example, a Breton speaker
is very likely to also understand French, given the language priority list.
Thus the implementation may choose to use the following lookup for the
language priority list "br-FR, de":

1. br-FR
2. br
3. de
4. implementation defined default:
    4a. fr-FR // special fallback from br (Breton)
    4b. fr
    4c. en
Q. I see "zh" and "cmn", I have no way of telling that they're related
without looking at the registry, which basically means a hard-coded table.
If "zh" is preferred", then I may want to move from "zh" to "cmn" or
whatever. A. *You can't tell that from the **registry** either! *Sometimes
microlanguages are related in such a way as to be a good fallback, but *usually
they are not.* The handful of actual cases where we think it might be a good
idea are listed in the text around Table 8, *not* in the registry. Knowing
that X is a macro/micro language does not necessarily mean -- *and usually
doesn't mean* -- that you want to use it in fallbacks. If there is no
predominant form, then it's a crapshoot as to whether the
macro/microlanguage is a good fallback.

There is no special runtime information in the registry. When a new
macro/microlanguage shows up in the registryregistry macro/microlanguages is
not a good idea. Nor is it complete, since it misses tl/fil, ro/mo, and many
others that are much higher frequency cases than most of the
macro/microlanguages.

If an implementation provides a UI for selecting language priority lists, it
may be better to give the user the option of having explicit fallbacks (such
as from Cantonese to Mandarin or Tagalog to Filipino), rather than trying to
guess the user's intent (and run the distinct risk of getting it wrong). For
that purpose, when a user adds a language to the priority list, the UI may
suggest macrolanguages, or other related languages, as additional fallbacks.
*and* you support the one of the pair, then it may be useful to review
whether or not you want to add a fallback. Automatically updating fallbacks
blindly according to the Q. In lookup, if there is a predominant form how is
it best (in practice) to deal with the macrolanguage? For most programs, I
believe that treating them as synonyms is the right thing to do, and
alternative approaches would be extremely counterproductive. And this goes
for any of the macrolanguage cases where there is a predominant encompassed
language with long usage in the computer industry.

Let's take a scenario where this is not done (a la Ewell). Suppose that a
user picks Arabic as her Accept-Language in her browser. Any existing
browser will represent that with "ar". Then she goes to the BBC site. The
entire site is translated, not into standard Arabic, but into Sudanese
Creole Arabic. The user complains, since she can't understand it, and the
BBC responds that they are just following the standard to the letter and
spirit: "ar" means any kind of Arabic whatsoever, and so in the interests of
fairness, they pick a different encompassed language to serve up each day.
They inform the user that it is her fault for using 'ar' if she really only
wants Standard Arabic. So they have the following schedule:

  Monday
 aao <http://www.sil.org/iso639-3/documentation.asp?id=aao>  Algerian
Saharan Arabic   Tuesday
 abh <http://www.sil.org/iso639-3/documentation.asp?id=abh>  Tajiki
Arabic  Wednesday
abv <http://www.sil.org/iso639-3/documentation.asp?id=abv>  Baharna
Arabic   ...
acm <http://www.sil.org/iso639-3/documentation.asp?id=acm>  Mesopotamian
Arabic      acq <http://www.sil.org/iso639-3/documentation.asp?id=acq>
Ta'izzi-Adeni Arabic
acw <http://www.sil.org/iso639-3/documentation.asp?id=acw>  Hijazi Arabic
acx <http://www.sil.org/iso639-3/documentation.asp?id=acx>  Omani Arabic
 acy <http://www.sil.org/iso639-3/documentation.asp?id=acy>  Cypriot Arabic
adf <http://www.sil.org/iso639-3/documentation.asp?id=adf>  Dhofari Arabic
aeb <http://www.sil.org/iso639-3/documentation.asp?id=aeb>  Tunisian Arabic
aec <http://www.sil.org/iso639-3/documentation.asp?id=aec>  Saidi Arabic
 afb <http://www.sil.org/iso639-3/documentation.asp?id=afb>  Gulf Arabic
 ajp <http://www.sil.org/iso639-3/documentation.asp?id=ajp>  South Levantine
Arabic      apc <http://www.sil.org/iso639-3/documentation.asp?id=apc>
North Levantine Arabic
apd <http://www.sil.org/iso639-3/documentation.asp?id=apd>  Sudanese Arabic
arb <http://www.sil.org/iso639-3/documentation.asp?id=arb>  Standard Arabic
arq <http://www.sil.org/iso639-3/documentation.asp?id=arq>  Algerian Arabic
ars <http://www.sil.org/iso639-3/documentation.asp?id=ars>  Najdi Arabic
 ary <http://www.sil.org/iso639-3/documentation.asp?id=ary>  Moroccan Arabic
     arz <http://www.sil.org/iso639-3/documentation.asp?id=arz>  Egyptian
Arabic      auz <http://www.sil.org/iso639-3/documentation.asp?id=auz>
Uzbeki Arabic
avl <http://www.sil.org/iso639-3/documentation.asp?id=avl>  Eastern Egyptian
Bedawi Arabic
ayh<http://www.sil.org/iso639-3/documentation.asp?id=ayh> Hadrami
Arabic
ayl <http://www.sil.org/iso639-3/documentation.asp?id=ayl>  Libyan Arabic
ayn <http://www.sil.org/iso639-3/documentation.asp?id=ayn>  Sanaani Arabic
ayp <http://www.sil.org/iso639-3/documentation.asp?id=ayp>  North
Mesopotamian Arabic
bbz<http://www.sil.org/iso639-3/documentation.asp?id=bbz> Babalia
Creole Arabic
pga <http://www.sil.org/iso639-3/documentation.asp?id=pga>  Sudanese Creole
Arabic      shu <http://www.sil.org/iso639-3/documentation.asp?id=shu>
Chadian Arabic
ssh <http://www.sil.org/iso639-3/documentation.asp?id=ssh>  Shihhi Arabic

If 'ar' means any Arabic, without any preference, this would be a perfectly
reasonable thing to do. But for users, it would hardly be satisfactory. And
this would be a bizarrely stupid thing for the BBC to do.

Someone might respond that, well, everyone needs to convert over to cmn for
Mandarin since it is now the Right Thing to Do. Even if that magically
happened, it would take years, and during the transition we would get all
kinds of screwups with different programs transitioning at different paces.
And there isn't much magic around; it is very hard to get people to change
infrastructure that works just fine -- you have to give them a compelling
case for why users are served better by the change. And that would be a very
hard sell, since there isn't any real advantage.

The right approach for the BBC is to treat a request 'ar' as a request for
Standard Arabic*, just as they have always done*. Internally, that means
treating 'ar' and 'arb', and any language tag that starts with them, as a
request for Standard Arabic. That is, treating ar-EG and arb-EG, or ar-SA
and arb-SA, or other combinations, as synonyms for the purpose of lookup.
Now, this "treating as synonyms" could be done in different ways. One way is
to mash on input; the other is to have a fancier fallback, eg arb-EG =>
ar-EG => arb => ar (mutatis mutandis, when starting with ar-EG).

Nor was this solved at all by extlang -- as we discussed at some length, it
discarded all script and region info when falling back, and produces worse
results in many cases, especially where the macrolanguage does not have a
predominant form.

The "treating as synonyms" strategy is always going to be the right answer.
There are undoubtedly scenarios where this strategy is not necessary,
although I can't think of any off the top of my head.

Moreover, I think one of the more productive things we can do is to push for
the incorporation of Language Priority Lists in any query-like protocols.
That way I could say I'd like "ary, fr, ar" if my preferred ordering is
Moroccan Arabic, then French, then as a last resort, Standard Arabic
Q. How do I use fancier fallback for the predominant form?
Here is a more detailed case

   1. cmn-Hant-HK
   2. zh-Hant-Hk
   3. cmn-Hant
   4. zh-Hant
   5. yue-HK
   6. zh-HK
   7. yue
   8. zh

 Q. Haven't people always interpreted zh as meaning anything from Mandarin
to Hakka to Min?
A. That's very unclear. People usually choose 'zh' not literally, but
through a UI that shows a human readable form. So the question is, how many
people have looked at interfaces that say simply " 中文" and think

   1. "that could mean Mandarin but could also mean Hakka" vs
   2. "that means just Mandarin, they don't offer Hakka so I'll pick
   something else", vs
   3. "that means just Mandarin, they don't offer Hakka, but Mandarin is the
   closest to Hakka that is offered so I'll pick that."

In lookup on computer systems it is clear that nobody's expectation is that
by picking 'zh', they will get Hakka. And anything but Mandarin is a
vanishingly small percentage of tagged text; the same is true of 'ar';
anything but Standard Arabic is a vanishingly small percentage.

As Karen said on a related topic: *"My experience is that the users who need
to specify Cantonese most often make up an **illegal** tag. Not saying
that's what we should recommend, but I believe my experience does not
support the statement as worded. "*

Q. Isn't the written form of Cantonese the same as Mandarin?A. No, no more
than the "written form of Swiss German is the same as High German". What is
true is that when the Swiss write, they write in High German; that's
different. They are using different words than what is spoken, and very
different syntax: "I bi doo gsy." => "Ich war hier."

Here is what I have on the subject from John Jenkins:

I believe you said that a Mandarin speaker can read written Cantonese, but
will not understand everything (a bit like a Dane reading Swedish).

More like French and Spanish, actually.

Some characters would not (normally) be used in Mandarin.

Most famously U+4E5C, the Cantonese for "what".

Some characters would have different meanings than in Mandarin

Best illustrated with U+4FC2, which means "to bind" in Mandarin but is
frequently borrowed to for the Cantonese word for "to be."

Some syntax would be different

Yes, but I can't think of any examples off the top of my head and the book
I've got that lists the differences is successfully playing hide-and-seek at
the moment.  There are actually not an awful lot of these.  The main
differences between the two are phonetic and lexical.  The grammars are very
similar.

Can you point me to some web pages with written Cantonese that would
demonstrate that to a Chinese reader?

Nothing better to start with than the Cantonese Wikipedia article on
Cantonese:  <http://zh-yue.wikipedia.org/wiki/粵語<http://zh-yue.wikipedia.org/wiki/%E7%B2%B5%E8%AA%9E>>.
Similarly, <http://zh-yue.wikipedia.org/wiki/香港<http://zh-yue.wikipedia.org/wiki/%E9%A6%99%E6%B8%AF>>,
<http://zh-yue.wikipedia.org/wiki/Unicode>, and pretty much everything else
in the Cantonese Wikipedia.

I can give you the relative character frequencies in the Cantonese and
traditional (or simplified Chinese) Wikipediae, if you like.  I've still got
that data around somewhere.

*The thing you have to emphasize is the difference between what-**Cantonese*
*-speakers-generally-read-and-write, which is just Mandardin with **
Cantonese** phonetics, and
writing-down-what-**Cantonese**-speakers-actually-speak,
which is what "written **Cantonese**" should be used to mean.
 Unfortunately, not everybody groks this.  Fortunately, Wikipedia does.*

Meanwhile, I quote from Stephen Matthews and Virginia Yip, _Cantonese: A
Comprehensive Grammar_ (London: Routledge, 1994), pp. 5-6:

"Traditionally, Cantonese has been regarded as one of the many Chinese
dialects. It does not have a standardized written form on a par with
standard written Chinese.  No form of written Cantonese is taught in schools
or used in academic settings in any Cantonese-speaking community.  When it
comes to the written form, it is standard written Chinese that is taught and
learnt.  For educated Cantonese speakers, standard written Chinese is the
written form they use in most contexts.  However, in colloquial genres such
as novels, popular magazines, newspaper gossip columns, informal personal
communications, written Cantonese may be used.  When written Cantonese
Cantonese words and expressions, non-Cantonese speakers may find it totally
unintelligible."   contains too many exclusively

Since they wrote, however, there's been a distinct upsurge in the use of
written Cantonese.  (It's tied in with a kind of Hong Kongese
pseudo-nationalism.)  It's still not exactly *common*, but it's a lot more
common than it used to be.

-- 
Mark