Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]

John C Klensin <john-ietf@jck.com> Thu, 19 March 2015 19:29 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 775E71A8844 for <lucid@ietfa.amsl.com>; Thu, 19 Mar 2015 12:29:52 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.61
X-Spam-Level:
X-Spam-Status: No, score=-2.61 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id UvHjnJa5_zjU for <lucid@ietfa.amsl.com>; Thu, 19 Mar 2015 12:29:45 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 990A41A8856 for <lucid@ietf.org>; Thu, 19 Mar 2015 12:29:45 -0700 (PDT)
Received: from [198.252.137.35] (helo=JcK-HP8200.jck.com) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1YYg8N-000Pc2-RM; Thu, 19 Mar 2015 15:29:35 -0400
Date: Thu, 19 Mar 2015 15:29:30 -0400
From: John C Klensin <john-ietf@jck.com>
To: Shawn Steele <Shawn.Steele@microsoft.com>, "Asmus Freytag (t)" <asmus-inc@ix.netcom.com>
Message-ID: <B26BCFD19372D553F50138A9@JcK-HP8200.jck.com>
In-Reply-To: <BLUPR03MB1378985F9780A98646E7B31B82010@BLUPR03MB1378.namprd03.prod.outlook.com>
References: <20150311013300.GC12479@dyn.com> <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com> <55008F97.8040701@ix.netcom.com> <CA+9kkMAcgSA1Ch0B9W1Np0LMn2udegZ=AzU1b26dAi+SDcbGgg@mail.gmail.com> <CY1PR0301MB07310C68F6CFDD46AE22086F82190@CY1PR0301MB0731.namprd0 3.prod.outlook.com> <20150311200941.GV15037@mx1.yitter.info> <CY1PR0301MB0731F4EBE5EB5C3340F7059282190@CY1PR0301MB0731.namprd03.prod.outlook.com> <20150319014018.GI5743@mx1.yitter.info> <BLUPR03MB1378184CE32E928A3086665582010@BLUPR03MB1378.namprd03.prod.outlook.com> <20150319023029.GA6046@mx1.yitter.info> <BLUPR03MB137886903F15000BB01E3F5882010@BLUPR03MB1378.namprd03.prod.outlook.com> <A62526FD387D08270363E96E@JcK-HP8200.jck.com> <550B0A32.8080704@ix.netcom.com> <BLUPR03MB1378985F9780A98646E7B31B82010@BLUPR03MB1378.namprd03.prod.outlook.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.35
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/zdDiZWgTLUfOa1Pu_g5rpvzecFg>
Cc: lucid@ietf.org, Andrew Sullivan <ajs@anvilwalrusden.com>
Subject: Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Mar 2015 19:29:52 -0000


--On Thursday, March 19, 2015 18:29 +0000 Shawn Steele
<Shawn.Steele@microsoft.com> wrote:

>> Any of these forms, if they can be guaranteed to be stable,
>> will satisfy that condition, but the point about the
>> discussion is that identifiers that are "reasonably mnemonic"
>> as one could characterize IDNs, do occupy a space between
>> machine and human interaction with writing systems.
> 
> This discussion seems to be about lots of things, but
> "reasonably mnemonic" would limit the problem hugely.  That
> implies that I can see it and remember what it was to type it
> later.  However that scenario doesn't hit any of the security
> concerns being raised as I'm going to type it using the
> appropriate keyboard for my language.  Sure, if the domain had
> a Cyrillic l may have trouble, however that's not mnemonic,
> that's someone being sneaky.  (See fraud below)

Shawn, for better or worse, "reasonably mnemonic" might be an
attractive way to think about things, but it is never been a
requirement the community has chosen to impose on the DNS (even
though ICANN has chosen to impose variations on it on domains it
controls).  See RFC 2181 for some rather strong statements to
the contrary.   "No mnemonic requirement" is actually important
outside the IDN space because there are a number of applications
that use strings  as owner names that, on inspection, we might
guess were hashes.  Lack of willingness to impose such a
restriction is also the main reasons why IDNA has never
contained a "no mixed scripts" rule (there is also a technical
reason, but we could have gotten around that one).

>> Actually, Arabic even has potential "variants" among the
>> digits, so while you can make the identifier be based on the
>> numeric value of what the human wrote in whatever writing
>> system, you get into issues of recognition of identifiers
>> etc. once they are communicated in their human-readable form.
> 
> That's digressing from my point.  I was trying to say that, if
> we consider the problem without any human factors, we already
> have unique IDs. 

We also, as you point out, have no justification for IDNs.  The
community has, for better or worse, thoroughly rejected that
position, so I hope we can move on.

> There's no way that machines confuse any of
> these names, it's only the humans that do that.

Quite the contrary.  If differences among coding sequences for
the same string, depending on human-invisible distinctions such
as choice of IME or host, can produce different results and then
a comparison procedure is used, then, in practical terms, it is
the machines that are confused.  That is precisely why we
normalize before comparing and one of the key reasons why
IDNA2008 DISALLOWs a lot of code points that might be
problematic for machine comparison purposes.   By contrast,
case-folding and DISALLOWing some other characters are about the
potential for human confusion because machines know perfectly
well that upper and lower case characters (no matter how coded
as long as they are coded distinctly) are different.   Probably
part of the present problem is that we may need to be a little
more precise about those distinctions than we have been in the
past but, while that would change our vocabulary, it would not
change the problem.

Some of the above might help with fraud detection or prevention,
but the requirements exist without any consideration of fraud.

Please read and try to understand the distinction I tried to
make between coding artifacts and character confusion in my
previous response to you and try to work your way through the
discussion of cases in draft-klensin-idna-5892upd-unicode70-04
-- they really are important to this.

>...
> That's an additional requirement to the "reasonably mnemonic"
> above.  However if this entire exercise is intended to
> counteract fraud, then we get my view that "trying to address
> these codepoints is way more costly than its benefit".  

It is not entirely or even largely to counteract fraud.  Things
like ICANN's Label Generation Rules might well be defined in
terms of fraud or at least confusion.  I encourage you to read
about them and comment.  But, these particular issues, as
described in draft-sullivan-lucid-prob-stmt and
draft-klensin-idna-5892upd-unicode70 (-04 and anything later),
are really about resolving different coding for the same
character, and not confusion or fraud except incidentally.
There is a big issue in that about what "same character" means,
but that still isn't about fraud or even about confusion between
pairs of characters that almost everyone agrees are different.

> The people in this room aren't going to click on
> "1stbank.trustme.com", but everyone else on the planet (or at
> least 90% of them) click that link.  So worrying about someone
> spelling it "lstbank.com" provides close to zero value.  At
> least in the right font 50% of the other humans might start
> suspecting that one.

And that, while interesting, seems to me completely irrelevant
to the current discussion.   If you want to make comments on the
fraud case, please explain how the "humans in this room" are
expected to distinguish between "nür" (with the middle
character coded as U+00FC) and "nür" (with the middle character
coded as U+0075 U+0308).  Applying NFC to both strings handles
that case perfectly well and that is why we do it.  Part of the
issue here is the cases that NFC does not handle as well.

>...
> (This total obliviousness to security/spoofing/phishing isn't
> even limited to DNS, I've received calls because of billing
> glitches and the bank says "you can pay now, give us your CC
> or check routing info".  That's normal.  They they ask my
> secret code or whatever to verify it's me, and I point out
> that I can't even verify that they're who they say they are
> and they're stumped.  Well, you can call us back at
> 1-800-123-4567.  Seriously!?!?!  Even my bank has no clue what
> security is?)

Of course.  And I've had endless arguments with various
financial institutions, both about their procedures and their
barriers to letting me use even rudimentary self-protective
measures.  But it has nothing to do with the present question
whether or not I carry out my fantasy of posting all of those
examples to rants-are-us.example or, better yet, the more
elegant (or perverse, depending one one's perspective)
rants-я-us.example.

> Fix that, and then we can talk about whether it's worth fixing
> a couple esoteric code points.  There are FAR easier exploits
> for any serious attacker to attack.  It's certainly not worth
> risking breaking real strings and real mnemonics that real
> users might want to use.  (IMO its not even worth prohibiting
> I♥NY.com)

Sorry, but the community believes you are in the rough.
Incidentally, one of the main reasons why the second character
in I♥NY is DISALLOWED (and hence why the string is invalid as
a U-label) also has nothing to do with fraud but is associated
with difficulties in precise enough description of symbols in
screen readers and other text-to-speech applications that they
can be typed back in.  That is another human consideration of
course... too bad about those humans, the world would be so much
neater without them.

>...

I've probably said too much already -- please try to forgive my
frustration after struggling with the real for more months than
I want to think about and being regularly distracted by people
who want to talk about confusion.  Again, please read the two
I-Ds which, at least IMO, actually do give a decent description
of what the problem is thought to be or at least parts of that
puzzle.   IMO, if either uses words like "confusion" or "fraud",
they are historical errors grounded in ways of thinking about
other problems.  

If you want to talk about confusion and fraud, they are really
interesting discussions on which a lot of good people have spent
a lot of time, but they are really not what this is about.

 best,
    john