Re: [Ietf-languages] Forms for subtag kmpre20c

Élie Roux <> Sun, 01 December 2019 19:58 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 80B501200FE for <>; Sun, 1 Dec 2019 11:58:11 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.4
X-Spam-Status: No, score=-1.4 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, FREEMAIL_FORGED_FROMDOMAIN=0.25, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.25, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=no autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id APntgPlwG65z for <>; Sun, 1 Dec 2019 11:58:10 -0800 (PST)
Received: from ( []) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 28D10120059 for <>; Sun, 1 Dec 2019 11:58:10 -0800 (PST)
Received: by with SMTP id z193so37880817iof.1 for <>; Sun, 01 Dec 2019 11:58:10 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=kP+Rzcw/93V2+hjeK3JvxvcPdKhfqk6lsUgTpMk4xns=; b=ITK+YWkyX5MSJZdpbOvivUUuriKnQ2M5Wv4EETXsQ3nkhrSX740jmMThZSE3CTdeWv UsWpaYHMrWP2siNDXJNgT35nAYduLScLO0jGulK4yVFliqsrZIYh6cC1luFJQb5Pznrw 3MzUVoebAMdLdKeWrcZ2DgGtp73sYdZP7qbtcVJ9Fua90Ux6kARkrNiRbmjDLRhhifSY LVhY3Rl4/f7KsZk8pApfDPjAvzQn2o/eYAeCkP2pwRXkZSj6ABDHOhxU/O42B1vHYZ+n 0SiAXL5wBvjpk0GwH93vOqncvhnLvLtTelVkkNxG1Kz/jOXWagEUZ2V5gwwevu2jV1kL dw4w==
X-Gm-Message-State: APjAAAXqEuwXMlb+iHXEe/whSURb7/cEWVp3ZMs7uO3uPj1ZfyL2BDTn +g4QhmAQqJ44ximIGVMjDM2Jms1JzcmWbOo0xd8=
X-Google-Smtp-Source: APXvYqytBpDsrtVEc+PhycjzfQ7gHNStIv6G49wq7n9Rk9VIdC9ZmGr+KMEHQSL9EUNNtebRD3a+yY4uvij89pKoaXA=
X-Received: by 2002:a02:c7cf:: with SMTP id s15mr12042304jao.15.1575230289250; Sun, 01 Dec 2019 11:58:09 -0800 (PST)
MIME-Version: 1.0
References: <> <> <> <> <000501d5a31c$cb6f52e0$624df8a0$> <> <> <20191201185039.0ec4bf53@JRWUBU2>
In-Reply-To: <20191201185039.0ec4bf53@JRWUBU2>
From: =?UTF-8?Q?=C3=89lie_Roux?= <>
Date: Sun, 1 Dec 2019 20:57:58 +0100
Message-ID: <>
To: Richard Wordingham <>
Cc:, Trent Walker <>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <>
Subject: Re: [Ietf-languages] Forms for subtag kmpre20c
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sun, 01 Dec 2019 19:58:11 -0000

> > Whatever is in Khmer script that's written in an old spelling. How
> > does it matter?
> You are not the only consumer of subtagging.  For example, etymologies
> on the English Wiktionary often tag words as 'Old Khmer'.

Hmm sure, but in my mind this is a different issue: if the -kmpre20c
were to exist, I can't really see why it couldn't be applied to
anything written in Khmer script using non-reformed spelling: Old,
middle, modern Khmer, all sorts of dialects, Pali, Sanskrit, Chinese,
etc. That's what I meant, I'm not sure how you interpreted my point

> So why does that need a tag?

To change the Lucene analyzer and for the UI (for each string in our
database, we have a tag, and there's a little popup saying what
language is the string). Note that we're dealing mostly with
low-resource languages, and sometimes non-standardized scripts and
transliterations; so we had to come up with a very lengthy list of
private lang tags: and
some conventions to refer to the scripts (and script flavors) that are
not in Unicode:
(we didn't include the SEA scripts yet, but there are quite a few).

>  What of a spelling that used the glottal
> stop latter ('QA') when Chuon Nath prescribes a vowel-specific vowel
> letter?  Even though the original spelling is in a way more modern
> ('before its time'), would not that have two spellings recorded for it?

Well, you lost me here, my knowledge of Khmer is close to
non-existant, but I'm copying Trent Walker who has a very vast
knowledge of these matters!

> Is this database restricted to manuscripts?

Currently yes, in the future we could imagine cataloging modern publications.

> OT: How do you handle glyph differences that are close to being
> character differences?  One example, if I understand Antelme, is that
> from a lumper's perspective U+17BE used to look like *<U+17C1 E,
> U+17B7 I>, but this proscribed combination contrasts with U+17BE in
> Thai (perhaps in a wider sense).

Another question I won't pretend to be able to get the answer of :)

> I hope this is for identification purposes.  Several systems seem to
> have been concurrent.

Several in the XXth century, and before that there doesn't seem to be
any study; so I'm not even sure we can confidently speak of concurrent
systems, it might have been just very wild.

> Your account hasn't made clear when the use of CVC orthographic
> syllables was largely abandoned.  That's not a phonetic v. etymological
> difference.  Are we looking at at least three distinguishable writing
> systems?

Another question that's not for me...

> That doesn't stop manuscripts being classified as 'Middle English',
> despite its extreme lack of homogeneity.

Well, I really have no clue about the evolution of the Khmer language,
in my mind the tag was only about the evolution of spelling
conventions: the database we have has titles in old spelling + new
spelling, it's just a mapping of characters and not a change of
language (if I understand correctly). So in my mind this is a
different situation from the Middle English / English situation where
the language itself is different, not just the spelling.

> OT: 'Pāli written using old Khmer spelling' is an interesting concept.
> Would you care to educate me by elaborating?

Hmm, I would love to do so but I'm not sure I'm the best person... I
can refer you to which
has some most excellent scholarship and scans of Leporellos, some of
which contain Pali using the old spelling.

>  I've just been surprised
> by the form of some 150 year old printed Thai-script Pali.  It's
> different to the two living orthographies supported on Wiktionary.

Hmm, that could be an entry point to an interesting discussions
(perhaps to be had outside the list), can you tell me what you're
referring to?