Re: [Idna-update] IDNA and combining sequences (was: Re: Expiration impending: <draft-klensin-idna-rfc5891bis-01.txt>)

"Asmus Freytag (c)" <asmusf@ix.netcom.com> Thu, 15 March 2018 21:01 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: idna-update@ietfa.amsl.com
Delivered-To: idna-update@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C92B0126DED for <idna-update@ietfa.amsl.com>; Thu, 15 Mar 2018 14:01:28 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.72
X-Spam-Level:
X-Spam-Status: No, score=-2.72 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=ix.netcom.com; domainkeys=pass (2048-bit key) header.from=asmusf@ix.netcom.com header.d=ix.netcom.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id IBfzTczPpzXw for <idna-update@ietfa.amsl.com>; Thu, 15 Mar 2018 14:01:26 -0700 (PDT)
Received: from elasmtp-dupuy.atl.sa.earthlink.net (elasmtp-dupuy.atl.sa.earthlink.net [209.86.89.62]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EF8301200F1 for <idna-update@ietf.org>; Thu, 15 Mar 2018 14:01:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ix.netcom.com; s=dk12062016; t=1521147686; bh=6Az7P/gk/fKfCV/hEssmTP0KShmbxz2QoY9x yFZQv2c=; h=Received:Subject:To:References:From:Message-ID:Date: User-Agent:MIME-Version:In-Reply-To:Content-Type: Content-Transfer-Encoding:Content-Language:X-ELNK-Trace: X-Originating-IP; b=Ft4nGlYDqT0vN16Xs4yHO+nAeU2WzAdnEP/IfWbE2d7L+M 9FfPIR/ws/PyUP2EMZX0Qa7fiYZJxpttO+rLxsJz4xbRlg4xo7C1vkyRk2d5qEpT6q6 h02P6+ieamgbg9BLTnxEUCPP3LOzaTZ4boRJO/Q5PNCE8aAJoIhloZ9JOxPygtbMMRw ZIae9vHHp6woFRlAM3v0ZpLWYiE9vEAko/kHAxDaFsqD8TSD/h7pnCTUaOcHxZ1DTn7 HsrS+REkVTCC3zwnnrC9ybEbszuHh+nCXwvm/fLxIfL9R65dNdlOY8GH4/NsgPR6Eu2 m0KOyj/xmdpx5nD9Hx0GrFpMc78g==
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk12062016; d=ix.netcom.com; b=iYVdv9e2xL4RhwOnQAHEEmfddXsFD1y3GWocbxfOBFcuHLI7LrjlnwAEsdowX1DeYMPiALro/lUDnsNKw+q4PKB/O/Gf4cn3xmDhlM5TotPhuaq71EGCpr/CYdyJYaqY3kbu7zvv8IAGLx3bnk+jSi2Tutbj4590nfgS4dfVqzWC16EsVKG5yYF4l3BNsEDMgtuyVJecixSNju8scIHwW+66Pb6qPvJnakWroP5vcv6kMfgxW0YeTI9CF2RzS5HFfveqskTR1eRDWJHNODwulYwJwkgTs/+B+2f1O4Xa7FUvXBg+h7QSWaYZdeqmC10k8gJv5zDtLPGlif3HdAf8lw==; h=Received:Subject:To:References:From:Message-ID:Date:User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Transfer-Encoding:Content-Language:X-ELNK-Trace:X-Originating-IP;
Received: from [173.244.217.91] (helo=[10.104.138.162]) by elasmtp-dupuy.atl.sa.earthlink.net with esmtpa (Exim 4) (envelope-from <asmusf@ix.netcom.com>) id 1ewa03-000DGc-Mp; Thu, 15 Mar 2018 17:01:23 -0400
To: John Levine <johnl@taugh.com>, idna-update@ietf.org
References: <20180315194256.CF68F22C3BA1@ary.local>
From: "Asmus Freytag (c)" <asmusf@ix.netcom.com>
Message-ID: <647d97cf-4b2a-c5f6-194b-c6887c5e4947@ix.netcom.com>
Date: Thu, 15 Mar 2018 14:01:28 -0700
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0
MIME-Version: 1.0
In-Reply-To: <20180315194256.CF68F22C3BA1@ary.local>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: en-US
X-ELNK-Trace: 464f085de979d7246f36dc87813833b2c1627926350bb93f7ee6915665675388b3e2c5e54062abd1350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
X-Originating-IP: 173.244.217.91
Archived-At: <https://mailarchive.ietf.org/arch/msg/idna-update/cxqOqiyjjyIPt4hHRn3mXOauU_g>
Subject: Re: [Idna-update] IDNA and combining sequences (was: Re: Expiration impending: <draft-klensin-idna-rfc5891bis-01.txt>)
X-BeenThere: idna-update@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "Internationalized Domain Names in Applications \(IDNA\) implementation and update discussions" <idna-update.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idna-update>, <mailto:idna-update-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idna-update/>
List-Post: <mailto:idna-update@ietf.org>
List-Help: <mailto:idna-update-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idna-update>, <mailto:idna-update-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Mar 2018 21:01:29 -0000

On 3/15/2018 12:42 PM, John Levine wrote:
> In article <1420573e-4d9d-7853-ffdc-7fc7c2290598@ix.netcom.com> you write:
>>> I think I can follow them OK.  The characters each are characterized
>>> as consonant, various types of vowels, tones, and diacritics.  The
>>> ordering rules would make more sense to me as regular expressions.
>> Check out the bottom of the HTML version of each scripts LGR file.
>>
>> Click on
>>
>> https://www.icann.org/sites/default/files/lgr/lgr-2-thai-script-01jun17-en.html
>>
>> and go to WLE Rules. You should see a table of regexes.
> I was more thinking of regexes for an entire valid string, not for
> each rule, but close enough.

John, the rules are designed this way for a purpose. The reason is that 
the most common scenario is the need to prevent some Y from following 
some X, where the Y is typically a combining mark.

Sometimes, script rendering will "ligate" certain combinations, but fail 
if incompatible ones are encountered; in that case you may see right had 
contexts or both left and right contexts.

We originally thought we could start for the Indic scripts with an 
Akshar (syllable) for which there exist BNF definitions. A label would 
then be a concatenation of valid Akshars. The attempt resulted in 
horrifically complex looking rules that were impossible to review and  
tended to be overly restrictive: there are many strings that are a bit 
unusual/uncommon, but not "broken". Sometimes the available Akshar rules 
were language-specific.

For all of these reasons, but mainly to be able to break down things 
into easily understood bits, we ended up with the system we have now.

Now, given that an XML per RFC 7940 is fully machine parsable, it could 
be an exercise to the reader to compile a regex for full labels. I think 
that should be possible, however, I've not attempted to prove this - 
there may be some syntax construction that is difficult/impossible to 
translate into a regex for the whole lable. All I know is that I can 
translate each of the context rules to regexes (that's how the tool I 
wrote evaluates labels based on an LGR), but I apply them iteratively at 
each character position in the string rather than using the regex mechanism.

A./