Re: [idn] IDNA section 3.1 requirement 3

"Adam M. Costello" <idn.amc+0@nicemice.net.RemoveThisWord.cnri.reston.va.us> Sun, 27 March 2005 23:28 UTC

Received: from psg.com (mailnull@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA15785 for <idn-archive@lists.ietf.org>; Sun, 27 Mar 2005 18:28:27 -0500 (EST)
Received: from majordom by psg.com with local (Exim 4.44 (FreeBSD)) id 1DFh88-000INw-U5 for idn-data@psg.com; Sun, 27 Mar 2005 23:25:16 +0000
Received: from [64.36.79.201] (helo=nicemice.net) by psg.com with esmtps (TLSv1:AES256-SHA:256) (Exim 4.44 (FreeBSD)) id 1DFh86-000INb-Oj for idn@ops.ietf.org; Sun, 27 Mar 2005 23:25:14 +0000
Received: from amc by nicemice.net with local (Exim 4.44) id 1DFh84-0004IL-QW for idn@ops.ietf.org; Sun, 27 Mar 2005 15:25:12 -0800
Date: Sun, 27 Mar 2005 23:25:12 +0000
From: "Adam M. Costello" <idn.amc+0@nicemice.net.RemoveThisWord.cnri.reston.va.us>
To: IETF idn working group <idn@ops.ietf.org>
Subject: Re: [idn] IDNA section 3.1 requirement 3
Message-ID: <20050327232512.GB15994~@nicemice.net>
Reply-To: IETF idn working group <idn@ops.ietf.org>
References: <DCA85A0719E37431D3C99DC8@scan.jck.com> <20050316221337.GB25580~@nicemice.net> <052C0407EAFFC3AC92D2D7D9@7AD4D3FB4841A5E367CCF211> <33c401c52b07$3004e600$477d3009@sanjose.ibm.com> <20050316221337.GB25580~@nicemice.net> <052C0407EAFFC3AC92D2D7D9@7AD4D3FB4841A5E367CCF211> <20050316221337.GB25580~@nicemice.net> <p06210260be5e6f4e2379@[10.20.30.249]> <20050317063435.GA26106~@nicemice.net> <p0621027abe5f6eaa4ccf@[10.20.30.249]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <34bc01c52b2a$d01efe00$477d3009@sanjose.ibm.com> <DCA85A0719E37431D3C99DC8@scan.jck.com> <33c401c52b07$3004e600$477d3009@sanjose.ibm.com> <p0621027abe5f6eaa4ccf@[10.20.30.249]>
User-Agent: Mutt/1.5.6+20040907i
X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on psg.com
X-Spam-Status: No, score=-2.1 required=5.0 tests=AWL,BAYES_00,URIBL_SBL autolearn=no version=3.0.1
Sender: owner-idn@ops.ietf.org
Precedence: bulk

I have been persuaded that the recommendations in my proposal were too
narrow, and did not leave enough room for implementations to experiment
with alternative ways of presenting suspicious labels, like color and
decoration.  Here's a less restrictive proposal:

--begin--

[section 3.1]

     3) When a domain label occupying or obtained from a domain name
        slot is to be shown to a user, it SHOULD NOT simply be shown
        in whatever form it was found in; before being shown it SHOULD
        be forced into either ASCII form (which can be obtained by
        applying ToASCII) or non-ACE form (which can be obtained by
        applying ToUnicode, see section 4).  Implementors are encouraged
        to develop policies that balance the conflicting goals of not
        showing unintelligible ACE strings and not showing misleading
        Unicode strings.  See appendix A for suggestions.  When the user
        has explicitly requested to see one form or the other, that
        form SHOULD be shown.  When requirements 2 and 3 both apply,
        requirement 2 takes precedence.

[appendix A]

    This appendix offers suggestions only, not recommendations or
    requirements.

    For labels that are ACEs or have ACE forms, there are various
    factors that an application can consider when deciding how to
    display the label to a user.

    The ACE form is unsuitable for presentation to a user because it
    is unintelligible, unrecognizable, not very useful, and quite
    unfriendly.  Its one redeeming feature is that it is ASCII-only,
    and therefore has the best chance of being displayable and
    copy-and-pastable, and the least chance of containing misleading
    characters.

    The non-ACE form is intelligible to the user (and therefore much
    friendlier and more useful), if it is displayable.  But in an
    environment that cannot handle the characters, the non-ACE form
    could turn out to be even less useful than the ACE form.

    The non-ACE form can be misleading, by containing characters that
    look like delimiters (for example, U+2044 looks like a slash), or
    that look like characters in other scripts (for example, many Latin,
    Greek, and Cyrillic letters look alike).

    The misleading-label problem is the most complex to deal with.  A
    general approach is to identify suspicious characters, and then
    use some means to avoid displaying the suspicious characters, or
    to display them safely.  Below are a few (but certainly not all)
    possible methods.  Implementations are free to experiment and
    innovate.

    Identifying suspicious characters:

        Some characters could be considered suspicious in all contexts;
        for example, all characters outside Unicode categories L
        (letter), N (number), and M (mark), except U+002D hyphen-minus.

        Some characters could be considered suspicious in labels that
        are children of (or descendents of) certain domains.  For
        example, in a domain whose registration rules are believed to
        avoid confusion only for certain scripts, characters outside
        those scripts could be considered suspicious.  In a domain
        believed to have no restrictions on registered names, all
        non-ASCII characters could be considered suspicious.

        Some characters could be considered suspicious depending on what
        other characters appear in the same label.  For example, in a
        label containing Cyrillic, Greek, and Latin characters, one of
        those scripts could be chosen as the main script (possibly the
        one that appears first, or appears most often), and the others
        could be considered suspicious.

    Avoiding displaying suspicious characters:

        Showing the ACE form avoids displaying all non-ASCII characters,
        but see above for the disadvantages of the ACE form.

        Showing a replacement character for the suspicious characters,
        while displaying non-suspicious characters normally, will have a
        friendlier appearance than the ACE form, but is likely to break
        copy-and-paste.

        In some contexts, an escape mechanism is available that can be
        used to obscure characters.  For example, in an International
        Resource Identifier (IRI), any character can be represented by
        percent-encoded UTF-8.  This will interject some ugliness, but
        is still likely to have a friendlier appearance than the ACE
        form, and will not break copy-and-paste.

    Displaying suspicious characters safely:

        Color, highlighting, underlining, etc. could be used to flag
        suspicious characters.  One concern with this approach is
        whether the user will understand the significance of such
        markings.

--end--

The rest of this message contains a few responses to some criticisms
of my previous proposal.  The responses are probably moot, since that
proposal has now been replaced, but anyway...

Paul Hoffman <phoffman@imc.org> wrote:

> > As long as they don't see "paypal" or whatever, they're at least not
> > being misled.
>
> The example given at the start of your message was the
> homograph-slash.  A spoof domain name that has that character and
> "paypal" will *still* display "paypal" in the Punycode.

True.  My explanation was a bit sloppy.  Let me try again:  As long as
the user doesn't see the misleading characters, the user is not being
misled.  If the misleading character is a cyrillic "a" in "paypal", then
it's enough to avoid displaying the cyrillic "a".  If the misleading
character is a slash-homograph immediately following "paypal.com", it's
enough to avoid displaying the slash-homograph.

> As you remember from a few years ago, cut-and-paste often doesn't work
> reliably with non-ASCII characters even under good conditions.  This
> seems like a red herring.

For me, the inability to copy and paste URIs into and out of a browser's
location field would be a major inconvenience, probably enough to
make me abandon the browser in favor of another.  Once IRIs become
widespread, I expect I'll feel the same way about them.

Mark Davis <mark.davis@jtcsv.com> wrote:

> Use of the raw punycode in place of the represented characters will
> cause more user confusion, not less.

> User sees:
>
> xn--tlralit-byabbe.fr versus xn--tlralit-byabb390f.fr
>
> Presented with a collection of apparently random letters, eyes quickly
> glaze over, and people really can't distinguish between two names
> in any sensible fashion.  Users are not going to memorize which
> gobbledygook is the one they want.

I agree, but that's beside the point.  I don't expect anyone to
recognize an ACE, or distinguish between one ACE and another.  I
expect people to consider all ACEs as unrecognized domains.  I expect
registrants who want recognizable domains to pick domain names that
won't display as ACE to the target audience.

AMC