Re: [Ietf-languages] Suppress-Script, assumptions, and guesses (was: RE: Suggestion to update Urdu Script Designation in the subtag registry)

Richard Wordingham <richard.wordingham@ntlworld.com> Fri, 14 August 2020 19:27 UTC

Return-Path: <richard.wordingham@ntlworld.com>
X-Original-To: ietf-languages@ietfa.amsl.com
Delivered-To: ietf-languages@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3ACC53A11F9 for <ietf-languages@ietfa.amsl.com>; Fri, 14 Aug 2020 12:27:01 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.1
X-Spam-Level:
X-Spam-Status: No, score=-2.1 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=ntlworld.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Upnmd93kITzR for <ietf-languages@ietfa.amsl.com>; Fri, 14 Aug 2020 12:26:59 -0700 (PDT)
Received: from know-smtprelay-omc-6.server.virginmedia.net (know-smtprelay-omc-6.server.virginmedia.net [80.0.253.70]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 10FA93A11F0 for <ietf-languages@ietf.org>; Fri, 14 Aug 2020 12:26:58 -0700 (PDT)
Received: from JRWUBU2 ([82.4.11.47]) by cmsmtp with ESMTP id 6fLskj3gLj1gc6fLsk4Dvi; Fri, 14 Aug 2020 20:26:57 +0100
X-Originating-IP: [82.4.11.47]
X-Authenticated-User:
X-Spam: 0
X-Authority: v=2.3 cv=NerIKVL4 c=1 sm=1 tr=0 a=yrOAJgItaIMndimPI+pDLQ==:117 a=yrOAJgItaIMndimPI+pDLQ==:17 a=kj9zAlcOel0A:10 a=nORFd0-XAAAA:8 a=K9ixRuEZUuLO0hNbGcgA:9 a=uGzb4YvMYPnq0WnH:21 a=l85TGuJ3HHHjwbQx:21 a=CjuIK1q_8ugA:10 a=AYkXoqVYie-NGRFAsbO8:22
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ntlworld.com; s=meg.feb2017; t=1597433217; bh=AGhagQFEviUhyjXPfl2gcAkSQ59DrYX5AoMczZwbveM=; h=Date:From:To:Subject:In-Reply-To:References; b=wnIOS0xrqVDF6jfVKL9dNtwyQwIUJ/GJM9UaNNvLtduH9YW54JmRe/6nf/Y4minbJ jD3a3KdtinHPTEZ9VldpEn9lFvX5q3IlLf+UZs2x+CxwO/OO9UVXijjQgRx6LFoIJq yywskN+feDLPQVOUsfS2jUTUF3+HDe+vCIp2K9HogwLStLSd3lOehb+PofZWfIY5Ol hHSZrJp6AldRXyWhb7dfRK4+jGZlJyiWSCR6V1rJlQwJzdZN2nSbDhBdNOeS/K055V BLdUQBxcLzauusddnGVeASsbs2ez0ZvxGLwHIEYhH3jYAaMdWsy1W9mkFYroNyB2Kb ttjYaZoGQ/m6w==
Date: Fri, 14 Aug 2020 20:26:51 +0100
From: Richard Wordingham <richard.wordingham@ntlworld.com>
To: ietf-languages@ietf.org
Message-ID: <20200814202651.237b8c4f@JRWUBU2>
In-Reply-To: <000201d67258$73cfb910$5b6f2b30$@ewellic.org>
References: <000201d67258$73cfb910$5b6f2b30$@ewellic.org>
X-Mailer: Claws Mail 3.13.2 (GTK+ 2.24.30; i686-pc-linux-gnu)
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-CMAE-Envelope: MS4wfBIjrITwsmiazf9VBHHaAJ/CiFFCEdSWtGctq4qrcfM5oi6QLHvqhmK2dNiepcAwjD12jHJx3xSAaBQyaB1Vri9+BBbei83WpN0ZeBL41RtEf/fTrsMg t9UlGzQPp+Cjlu9W7j0m7sdBYsFnvQZ33s/Z9XfHT8XvawjobSfcgwoM
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf-languages/KwHMoqCT4ieA7ka554QNvapcur8>
Subject: Re: [Ietf-languages] Suppress-Script, assumptions, and guesses (was: RE: Suggestion to update Urdu Script Designation in the subtag registry)
X-BeenThere: ietf-languages@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <ietf-languages.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf-languages/>
List-Post: <mailto:ietf-languages@ietf.org>
List-Help: <mailto:ietf-languages-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-languages>, <mailto:ietf-languages-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 14 Aug 2020 19:27:01 -0000

On Fri, 14 Aug 2020 10:32:07 -0600
"Doug Ewell" <doug@ewellic.org> wrote:

> Richard Wordingham wrote:

> If the content already exists, and one has retrieved it, and the
> language tag doesn't reveal the script, one can figure out what
> script it is written in by inspecting the content. This is just like
> the mechanic inspecting the car to figure out how many wheels it has.

The "various texts cluttering my house" aren't all in the same room
as one another.  To inspect the text, I would have to get it out from
its location, which could require a sequential search through a
document wallet.  (Perhaps random access would be possible after the
indexing exercise.)

> I confess that I don't understand this exercise.

> > I was coming to the conclusion that the ambiguity between "en" for
> > "en-Latn" and "en" for "English, but I won't tell you the script",
> > was resolved by the tags Zxxx, Zyyy and Zzzz.  Then the latter
> > meaning for text gets translated to en-Zyyy, and then the
> > abbreviation of en-Latn to en is resolved by the suppress-script.  
> 
> 'Zxxx' is a different animal: "Code for unwritten documents" from ISO
> 15924 is interpreted here to mean that this content is not written.
> It doesn't necessarily mean that the language of this content is
> never written.

Yes, I'd figured that this would handle audio CDs if I were including
them in the index.  It would also cover pure pictures.

> If I tag textual content as "en", I am not saying anything about the
> script. I am leaving it up to you, the recipient of the content, to
> figure out what the script is. Because it is English, the great
> likelihood is that the script is Latin. One way to find out would be
> to inspect the content.

Not easy.  And even if the material were online, determining the script
of a bit map is not always straightforward.

> Alternatively, you COULD go to the Registry and look at the
> Suppress-Script value of 'Latn' to give you a strong hint that the
> text is in Latin script. But that is not a guarantee; "en" is also a
> perfectly valid tag for English written in Runic. 

I do have some en-Runr-NZ stuff!

> If I tag it as "en-Latn", I am explicitly saying the script is Latin.
> I would hope that I am doing so for a reason, and not just adding a
> script subtag automatically because I didn't read Section 4.1. Maybe
> I am anticipating that readers will call into question whether the
> text is in Latin script, and need to know without looking at the
> content.

I thought we'd already agreed that this was the correct solution in my
case.

> If I tag it as "en-Zyyy", that implies that I, the tagger, don't know
> what script it is in. It seems unlikely to me that I would know a
> text is in English but don't know the script.

The mechanism I had in mind is to read the cover and not look at the
content.  It's not very likely, but not impossible.  I also
deliberately didn't mention the complication of different paragraphs
having different language-script combinations.

> I'm skipping over the remainder, which continues to claim that not
> using a script subtag for languages overwhelmingly written in a
> single script creates problematic ambiguity. We have heard very
> little in the past 15 years that either Suppress-Script or the
> overall philosophy of "tag content wisely" has greatly compromised
> the usefulness of BCP 47 tags. Maybe people just aren't telling us.

I did point that usually whatever process needs to know the script of
some text can look at the text itself, and wouldn't bother about any
label.  That knowledge is usually required at a lower granularity than
the labelling would naturally be applied at.

However, there may be cases where 'likely tags' are being used, so that
'en' just gets translated to 'en-Latn'.  If that process goes wrong,
the most likely response is advice to specify the script, implying that
'en' on its own would be wrong for 'en-Thai'.
  
Richard.