Re: [Ietf-languages] Forms for subtag kmpre20c

Richard Wordingham <> Mon, 02 December 2019 14:12 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id C31F4120086 for <>; Mon, 2 Dec 2019 06:12:48 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -0.978
X-Spam-Status: No, score=-0.978 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, MISSING_HEADERS=1.021, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (2048-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id eVyAVz2Tmx0o for <>; Mon, 2 Dec 2019 06:12:46 -0800 (PST)
Received: from ( []) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id C343E12006D for <>; Mon, 2 Dec 2019 06:12:45 -0800 (PST)
Received: from JRWUBU2 ([]) by cmsmtp with ESMTP id bmRPirj6krx5AbmRPi7gDk; Mon, 02 Dec 2019 14:12:43 +0000
X-Originating-IP: []
X-Spam: 0
X-Authority: v=2.3 cv=Te64SyYh c=1 sm=1 tr=0 a=yrOAJgItaIMndimPI+pDLQ==:117 a=yrOAJgItaIMndimPI+pDLQ==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=9cW_t1CCXrUA:10 a=IkcTkHD0fZMA:10 a=8390bBgGAAAA:8 a=GD67bXehAAAA:20 a=Nzr9TuAUAAAA:8 a=8pif782wAAAA:8 a=-oMPh4zsAAAA:8 a=gL7d7oY9SVNGser1oG0A:9 a=QEXdDO2ut3YA:10 a=ONfENc2Xldf9O1kC1nT1:22 a=ZUuPkEiF50PaAc0_J7ZK:22 a=niitFVS1MaRyb2tnheG2:22
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=meg.feb2017; t=1575295963; bh=wokYOdV//7IeQedqXsDY7z+jz6CthZ01lpg+DNK+QmY=; h=Date:From:Cc:Subject:In-Reply-To:References; b=vUOjz0cKi1Oy317+2nBOjmlWg+IuehInHMLBoo02vRbUWiTfo2icjwWszRpSGkvNJ R2XvarASxFC5mNadjrIVcjikT3LP/sMeWYQyR5IX4Ni3SC9gqFpVobehHot1/rhuTg c88Ot6G/opNUTT8h1mqlpeLcN62HHWO9iJcAp59S53++rqsvMjJCrh4dKadJWPvXJk g0jpiMPuHYxWwVQ9PB8ezHqU5I0vxA3dyofzjyygPzH6pZB/B/ls+rh7IRI0OlNMHu a0M3wvCpxLpGP3BqKB4aA2DU0WI47gQkuqVxjEPXD+e7V72DGFaJYjEI6dsZzAcg1g 615RarGCjTKFA==
Date: Mon, 2 Dec 2019 14:12:37 +0000
From: Richard Wordingham <>
Cc:, Trent Walker <>
Message-ID: <20191202141237.1724fc7c@JRWUBU2>
In-Reply-To: <>
References: <> <> <> <> <000501d5a31c$cb6f52e0$624df8a0$> <> <> <20191201185039.0ec4bf53@JRWUBU2> <>
X-Mailer: Claws Mail 3.13.2 (GTK+ 2.24.30; i686-pc-linux-gnu)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-CMAE-Envelope: MS4wfJxZzPhBh//jewonSdDaheDKvOJx/McLhiZu6eU9/vZM6L/TrdGWkTa3BYCMHE96CatPKU9jU9X75urX/W+coUbaH4RHHGMOfriTnAyN7FTToICvJaUv QNVUO7D/ycAEU0dAf/r5g4cBt44MVhLddV0E6A2A0n2oUoHi1vGNPp3eLhnIZWnaFhqGCdSGx4/JNzT8Lr0PaCBIZGP66jj34jmJ5Hl+sbqUTgY03yHRWuug
Archived-At: <>
Subject: Re: [Ietf-languages] Forms for subtag kmpre20c
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Mon, 02 Dec 2019 14:12:49 -0000

On Sun, 1 Dec 2019 20:57:58 +0100
Élie Roux <> wrote:

> > > Whatever is in Khmer script that's written in an old spelling. How
> > > does it matter?  
> >
> > You are not the only consumer of subtagging.  For example,
> > etymologies on the English Wiktionary often tag words as 'Old
> > Khmer'.  
> Hmm sure, but in my mind this is a different issue: if the -kmpre20c
> were to exist, I can't really see why it couldn't be applied to
> anything written in Khmer script using non-reformed spelling: Old,
> middle, modern Khmer, all sorts of dialects, Pali, Sanskrit, Chinese,
> etc. That's what I meant, I'm not sure how you interpreted my point
> though?

If that's truly the case, the proper tag is und-Khmr.  You then hit the
problem that language tagging doesn't handle exclusions.  At least,
Michael Everson said it doesn't and I have no reason to disbelieve
him.  It also makes sense to me as a policy.

> > So why does that need a tag?  
> To change the Lucene analyzer and for the UI (for each string in our
> database, we have a tag, and there's a little popup saying what
> language is the string). Note that we're dealing mostly with
> low-resource languages, and sometimes non-standardized scripts and
> transliterations; so we had to come up with a very lengthy list of
> private lang tags:
> and
> some conventions to refer to the scripts (and script flavors) that are
> not in Unicode:
> (we didn't include the SEA scripts yet, but there are quite a few).

And this immediately undermines the previous generality, as it includes
things like pi-Khmr.

> Is this database restricted to manuscripts?  
> Currently yes, in the future we could imagine cataloging modern
> publications.

What about printed Khmer-script missionary texts from 1893 printed in
Hong Kong?

> Several in the XXth century, and before that there doesn't seem to be
> any study; so I'm not even sure we can confidently speak of concurrent
> systems, it might have been just very wild.

> > That doesn't stop manuscripts being classified as 'Middle English',
> > despite its extreme lack of homogeneity.  
> Well, I really have no clue about the evolution of the Khmer language,
> in my mind the tag was only about the evolution of spelling
> conventions: the database we have has titles in old spelling + new
> spelling, it's just a mapping of characters and not a change of
> language (if I understand correctly). So in my mind this is a
> different situation from the Middle English / English situation where
> the language itself is different, not just the spelling.

Most living languages are continually changing, though I can't comment
on how hard it is to apply a knowledge of Modern khmer to understand
Old Khmer and Middle Khmer.  Some sources talk of a change of script,
though what that entails I don't know.  Certainly vowel symbols have
been added (apparently based on Thai additions), and the distinction
between Indic b/v has been reinstated.  Wikipedia starts Modern Khmer
in the later 18th century.

The fact that Middle English as a whole certainly lacked a standard
orthography is no bar to its being classified as a language.  The lack
of standards is therefore not of itself a bar to adding variants for
Old Khmer (if you truly have such materials), Middle Khmer and I
suggest is not necessarily a bar to 'pre 20th century' Modern Khmer.
It might be necessary to provide evidence of a time depth to the
variations - in which case we should try to get the experts involved. A
lack of agreement on classifying substantial documents could be, though
one should expect borderline documents.  How far back the BCP 47
language 'km' should go is another matter, probably not within IETF

> > OT: 'Pāli written using old Khmer spelling' is an interesting
> > concept. Would you care to educate me by elaborating?  
> Hmm, I would love to do so but I'm not sure I'm the best person... I
> can refer you to which
> has some most excellent scholarship and scans of Leporellos, some of
> which contain Pali using the old spelling.
> >  I've just been surprised
> > by the form of some 150 year old printed Thai-script Pali.  It's
> > different to the two living orthographies supported on Wiktionary.  
> Hmm, that could be an entry point to an interesting discussions
> (perhaps to be had outside the list), can you tell me what you're
> referring to?

I was chasing up the references in ,
in particular to the book "Pali Phonetic Edition" "พระไตรปิฎกสัชฌายะ
ฉบับเสียงอ่านปาฬิ", available at  The Wikipedia
article currently only documents the academic, abugida writing of Pali
in the Thai script, with a very misleading reference to the
'alphabetical' writing found in most books of chants. The books shows a
third scheme in use in a 19th century publication.  The third scheme is
primarily also an abugida, but has (encoded) vowel killers (two!)
different to the one used as a vowel killer nowadays, and marks
short /a/ in closed syllables with mai hanakat in much the same way as
the corresponding symbol Khmer samyok sannya is used. Michell's
"Siamese-English dictionary" (1892) noted that Siamese gramarians
regarded this as a vowel symbol.  The third scheme seems to have a
name, 'karayut', so it seems almost ready for registration.  However, I
haven't found names for the modern abugidic and alphabetic systems of
writing Pali.  One practical application would be to distinguish the
variants in spelling dictionaries.  It's the modern, unnamed varieties
for which spelling dictionaries would have most use.