Re: [Ietf-languages] Suggestion to update Urdu Script Designation in the subtag registry

Richard Wordingham <richard.wordingham@ntlworld.com> Fri, 14 August 2020 12:32 UTC

Return-Path: <richard.wordingham@ntlworld.com>
X-Original-To: ietf-languages@ietfa.amsl.com
Delivered-To: ietf-languages@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1BB943A109C for <ietf-languages@ietfa.amsl.com>; Fri, 14 Aug 2020 05:32:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.1
X-Spam-Level:
X-Spam-Status: No, score=-2.1 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=ntlworld.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FxoAOlPephBg for <ietf-languages@ietfa.amsl.com>; Fri, 14 Aug 2020 05:32:14 -0700 (PDT)
Received: from know-smtprelay-omc-6.server.virginmedia.net (know-smtprelay-omc-6.server.virginmedia.net [80.0.253.70]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8EA4B3A109B for <ietf-languages@ietf.org>; Fri, 14 Aug 2020 05:32:13 -0700 (PDT)
Received: from JRWUBU2 ([82.4.11.47]) by cmsmtp with ESMTP id 6YsVkhUVAj1gc6YsVk3u8J; Fri, 14 Aug 2020 13:32:11 +0100
X-Originating-IP: [82.4.11.47]
X-Authenticated-User:
X-Spam: 0
X-Authority: v=2.3 cv=NerIKVL4 c=1 sm=1 tr=0 a=yrOAJgItaIMndimPI+pDLQ==:117 a=yrOAJgItaIMndimPI+pDLQ==:17 a=IkcTkHD0fZMA:10 a=nORFd0-XAAAA:8 a=0fb-DU44bFrIHby2qv8A:9 a=Zcby-9GTQfeqcsdn:21 a=QEXdDO2ut3YA:10 a=AYkXoqVYie-NGRFAsbO8:22
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ntlworld.com; s=meg.feb2017; t=1597408331; bh=/l8LVaV0zDlJyL1H1BU6b55fiXIXfvHIdFh5VzB+yI8=; h=Date:From:To:Subject:In-Reply-To:References; b=ItwPxRGedb8eP0XpJ77/XUYPWkgDG+1sRcs77GeLdZwQSndydXT+zAPtZ6QJO3qi6 xmeweSouFRDzFCFpOeNOZ2jASy0SyQ1sonhQt5mVT91WW2KxExiAQDvqX1LSu2OZij /+9Y2jfMpN8Ha577IpIYJYd0wK1ADdlhPmoI98fr5Nj4/joKg4zEWL4OYZ9XqpE7ls rpJt5LtGp/NP50yLrUToNu8HBrDNpEdF3obVEUM65GXGcdnXxXJiGOJq7T/KCL9YnI Oip+Z2H3LDZyUWxNiiS/0b/0RgL82qsEi8cAysYZ4emHTXLYs3hbIboEcKvMYuwMtN fUnEc3BlO3TsA==
Date: Fri, 14 Aug 2020 13:32:05 +0100
From: Richard Wordingham <richard.wordingham@ntlworld.com>
To: "ietf-languages@ietf.org" <ietf-languages@ietf.org>
Message-ID: <20200814133205.2f891466@JRWUBU2>
In-Reply-To: <000001d671fa$4618c740$d24a55c0$@ewellic.org>
References: <CY4PR0401MB36203305BEFEBF938B654E8FC6420@CY4PR0401MB3620.namprd04.prod.outlook.com> <000201d670e8$d25e7e60$771b7b20$@ewellic.org> <CY4PR0401MB362045E1E4D11D92E1F89443C6420@CY4PR0401MB3620.namprd04.prod.outlook.com> <001a01d670ed$9c868530$d5938f90$@ewellic.org> <f4fa9f5c-3bb6-6b27-f294-7df9e0afa3d4@w3.org> <MWHPR1301MB21120388068B8E68EB6C8DE586430@MWHPR1301MB2112.namprd13.prod.outlook.com> <000001d6719a$9c3c7b40$d4b571c0$@ewellic.org> <20200813202934.3b348a9d@JRWUBU2> <001c01d671b5$efad4e60$cf07eb20$@ewellic.org> <20200814012621.2c6a9b69@JRWUBU2> <000001d671fa$4618c740$d24a55c0$@ewellic.org>
X-Mailer: Claws Mail 3.13.2 (GTK+ 2.24.30; i686-pc-linux-gnu)
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-CMAE-Envelope: MS4wfMELv9PBAFQkMDKsBBzTvkDd4SmcfpoCD1JBgByK1lsKLSSqZH9ienUg2FooFLxQIQwbY66SHhpx4zmhPSkFzCdWFt4DclfmS5Mr5xJJc+stnr8uAZKU Izba//SOiM3zA20QQeHkMb9JEEUHfo52h/QGosOz7xATPyBEs9jpThEP
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf-languages/2-qMpF2BYfmRJs-MdyLiWQ6IYY0>
Subject: Re: [Ietf-languages] Suggestion to update Urdu Script Designation in the subtag registry
X-BeenThere: ietf-languages@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <ietf-languages.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf-languages/>
List-Post: <mailto:ietf-languages@ietf.org>
List-Help: <mailto:ietf-languages-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 14 Aug 2020 12:32:16 -0000

On Thu, 13 Aug 2020 23:17:57 -0600
"Doug Ewell" <doug@ewellic.org> wrote:

> Richard Wordingham wrote:
> 
> >>> Then, when I came to use the catalogue, I would know that those
> >>> labelled as "en-Thai" were in the Thai script, but for those
> >>> labelled "en", I should be unsure of the script; it would be
> >>> improper to assume that the script was the Latin script, though
> >>> that would be the best **guess**.  
> >  
> >> Maybe I'm unclear about the difference between "assume" and "guess"
> >> in this context.  
> >
> > If what I did with one of them depended on the script, then if I
> > "guess" I need to check the script: if I "assume", I don't check.  
> 
> Um, OK.
> 
> > Now, I can't rely on the suppress-script field to assume that "en"
> > means "en-Latn"; that would be an improper use of the field.  Is
> > that correct?  
> 
> I have the feeling I'm being set up for something, but here goes:

Well, I had the strong feeling that what you have been saying would
*mostly* work, so I'm constructing a plausible situation where it seems
not to work.  Beyond that, I'm just asking you to check my logic. 

> The Suppress-Script field on 'en' means that IF content tagged as
> "en" is written at all — it could be spoken, or even some other
> modality — it is PROBABLY written in the Latin script. You cannot be
> 100% certain that it is written, or that it is written in the Latin
> script. If you need 100% certainty, the text should be tagged as
> "en-Latn".

> But it does not follow from this that all English text written in the
> Latin script should be tagged "en-Latn". Most of the time, the
> assumption is correct and adequate. This is why the word SHOULD is
> used, instead of MUST.

That seems to answer the issue.  If the language tag is describing
something that already exists, then if one may need to know the script
from just the tag, one should not heed suppress-script - which is why
the word 'SHOULD' is used.

> > Surely an English text in your conscript Ewellic should be labelled
> > "en-Zzzz", for 'uncoded script'.  
> 
> I could do that, or I could use a script subtag in the private-use
> area (e.g. "en-Qabe") or a private-use subtag (e.g. "en-x-ewellic").
> Each of these approaches has its pros and cons.
> 
> > So are you saying that the script is only 'undetermined' if an
> > attempt to determine its script code has failed?  
> 
> That is my understanding. I welcome the opinions of others on this
> question. I don't think it is covered in RFC 5646, and I don't
> believe the subject has come up on this list before.

> Do you have a concrete use case surrounding this?

I was coming to the conclusion that the ambiguity between "en" for
"en-Latn" and "en" for "English, but I won't tell you the script", was
resolved by the tags Zxxx, Zyyy and Zzzz.  Then the latter meaning for
text gets translated to en-Zyyy, and then the abbreviation of en-Latn
to en is resolved by the suppress-script.

> > You seem to be saying that my index SHOULD NOT use a BCP 47 tag to
> > record whether a text in English is in the Latin script.  On the
> > other hand, it could be used to record the script of Northern Thai
> > texts.  
> 
> In your everyday life, it is not necessary under normal circumstances
> to point out that your car has four wheels, because the overwhelming
> majority of cars do.

But this information could be vital for positioning it over an
inspection pit.

> That is how I view Suppress-Script. The distinction is not between
> English per se and Northern Thai per se, nor between the Latin script
> and other scripts.

> > Would a rule that the script must be indicated somehow make a
> > difference, e.g. by making plain "en" or "ur" imply that the script
> > subtag had been suppressed?  

> A rule that the script must be indicated would be the exact opposite
> of what we were trying to accomplish in 2005, which is backward
> compatibility with the huge volume of existing language-tagged data.

I was thinking of a rule applying to that index of mine, not one for
universal use.

If we need to know the script of an item, then the pre-existing
language tagging simply lacks the information - we need to treat is as
undetermined.  I thought the more critical issue was old tag processors
that would choke if the script were provided.  The downside of the
solution was that for existing items, the script would need to be
determined if it was needed.  What can be silly is that absence of
information is intended to be the usual case.  Often, however, the
script needs to determined on a character by character basis, so
tagging by script is unnecessary and can be ignored.  There can be a
big practical difference between saying, 'this paragraph is in the
Devanagari script' and saying, 'the interface language is to use the
Devanagari script'. (There are some odd effects around though - I've
seen Thai line-breaking applied to stuff in a word-spacing 'complex'
script because I hadn't overridden the default language for complex
scripts, and Thai has its own paragraph justification options.)

> Suppression of a script subtag isn't usually that kind of active
> thought process, like "I really want to indicate that this English
> text is in Latin script, but BCP 47 said I mustn't." It's more a
> matter of not stating that your car has four wheels, which might be
> obvious enough without saying so, or which might not even be relevant.

But matters when choosing an inspection pit!  But a garage would
probably check how many wheels the car had, rather than rely on the
owner's statement.

Richard.