[Ietf-languages] Suppress-Script, assumptions, and guesses (was: RE: Suggestion to update Urdu Script Designation in the subtag registry)

Doug Ewell <doug@ewellic.org> Fri, 14 August 2020 16:32 UTC

From: Doug Ewell <doug@ewellic.org>
To: 'Richard Wordingham' <richard.wordingham=40ntlworld.com@dmarc.ietf.org>, ietf-languages@ietf.org
Date: Fri, 14 Aug 2020 10:32:07 -0600
Message-ID: <000201d67258$73cfb910$5b6f2b30$@ewellic.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Thread-Index: AdZyWGqVn+rPw4V1ScyuYXYRgzMa4g==
Content-Language: en-us
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf-languages/b1seP96KhN3p_SJudQqBWINznjM>
Subject: [Ietf-languages] Suppress-Script, assumptions, and guesses (was: RE: Suggestion to update Urdu Script Designation in the subtag registry)
Precedence: list

Richard Wordingham wrote:

>> But it does not follow from this that all English text written in the
>> Latin script should be tagged "en-Latn". Most of the time, the
>> assumption is correct and adequate. This is why the word SHOULD is
>> used, instead of MUST.
>
> That seems to answer the issue.  If the language tag is describing
> something that already exists, then if one may need to know the script
> from just the tag, one should not heed suppress-script - which is why
> the word 'SHOULD' is used.

If the content already exists, and one has retrieved it, and the language tag doesn't reveal the script, one can figure out what script it is written in by inspecting the content. This is just like the mechanic inspecting the car to figure out how many wheels it has.

I confess that I don't understand this exercise.

> I was coming to the conclusion that the ambiguity between "en" for
> "en-Latn" and "en" for "English, but I won't tell you the script", was
> resolved by the tags Zxxx, Zyyy and Zzzz.  Then the latter meaning for
> text gets translated to en-Zyyy, and then the abbreviation of en-Latn
> to en is resolved by the suppress-script.

'Zxxx' is a different animal: "Code for unwritten documents" from ISO 15924 is interpreted here to mean that this content is not written. It doesn't necessarily mean that the language of this content is never written.

'Zzzz' is "Code for uncoded script," and that's pretty clearly intended by 15924 to apply to things like Voynich or Utopian that are, well, not coded.

Let me try again with the "undetermined" use case:

If I tag textual content as "en", I am not saying anything about the script. I am leaving it up to you, the recipient of the content, to figure out what the script is. Because it is English, the great likelihood is that the script is Latin. One way to find out would be to inspect the content.

Alternatively, you COULD go to the Registry and look at the Suppress-Script value of 'Latn' to give you a strong hint that the text is in Latin script. But that is not a guarantee; "en" is also a perfectly valid tag for English written in Runic. 

If I tag it as "en-Runr", I am explicitly saying the script is Runic.

If I tag it as "en-Latn", I am explicitly saying the script is Latin. I would hope that I am doing so for a reason, and not just adding a script subtag automatically because I didn't read Section 4.1. Maybe I am anticipating that readers will call into question whether the text is in Latin script, and need to know without looking at the content.

If I tag it as "en-Zyyy", that implies that I, the tagger, don't know what script it is in. It seems unlikely to me that I would know a text is in English but don't know the script. For some historical languages, this is more plausible.

Does this help?

Languages are very often not a precise thing lending themselves to concrete rules. I know some have bemoaned the fact that BCP 47 has so many SHOULDs and so few MUSTs, but there are really very few bright lines when dealing with languages and language usage. Trying to attach guarantees to language tag selection is a Hard Problem. We haven't even solved the problem of people writing in I-Ds and on Stack Overflow that a language tag contains a 2-letter 639 code and a 2-letter 3166 code.

>> In your everyday life, it is not necessary under normal circumstances
>> to point out that your car has four wheels, because the overwhelming
>> majority of cars do.
>
> But this information could be vital for positioning it over an
> inspection pit.

Exactly so. Thank you for proving my point about "under normal circumstances."

>>> Would a rule that the script must be indicated somehow make a
>>> difference, e.g. by making plain "en" or "ur" imply that the script
>>> subtag had been suppressed?
>>
>> A rule that the script must be indicated would be the exact opposite
>> of what we were trying to accomplish in 2005, which is backward
>> compatibility with the huge volume of existing language-tagged data.
>
> I was thinking of a rule applying to that index of mine, not one for
> universal use.

For your personal use, where the probability that English text is in Latin script is much lower than normal, I might suggest using "en-Latn". That way you don't need to consult the Registry to guess the script.

I'm skipping over the remainder, which continues to claim that not using a script subtag for languages overwhelmingly written in a single script creates problematic ambiguity. We have heard very little in the past 15 years that either Suppress-Script or the overall philosophy of "tag content wisely" has greatly compromised the usefulness of BCP 47 tags. Maybe people just aren't telling us.

--
Doug Ewell | Thornton, CO, US | ewellic.org

[Ietf-languages] Suppress-Script, assumptions, an… Doug Ewell
Re: [Ietf-languages] Suppress-Script, assumptions… Richard Wordingham