Re: [I18ndir] Fwd: New Version Notification for draft-bray-unichars-04.txt

Asmus Freytag <asmusf@ix.netcom.com> Fri, 15 September 2023 21:01 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 750A2C151067; Fri, 15 Sep 2023 14:01:54 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.091, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=earthlink.net
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id UcxYcaCOzD_O; Fri, 15 Sep 2023 14:01:50 -0700 (PDT)
Received: from mta-202a.earthlink-vadesecure.net (mta-202b.earthlink-vadesecure.net [51.81.232.241]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0FF20C14F73F; Fri, 15 Sep 2023 14:01:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; bh=rHTKKajhZsBIGcUNIgFTFv8BLWvV0TtsQAM7Qe aUAis=; c=relaxed/relaxed; d=earthlink.net; h=from:reply-to:subject: date:to:cc:resent-date:resent-from:resent-to:resent-cc:in-reply-to: references:list-id:list-help:list-unsubscribe:list-subscribe:list-post: list-owner:list-archive; q=dns/txt; s=dk12062016; t=1694811708; x=1695416508; b=Ep+aFXfU7cqVcAi+O3lN8YpLCUHpjgNyaOaZDoGI/n81lNxXuMkibZ2 JKTACw+cntEZlGxDxA0GItVXK+d87R6Ac7bLwXx9iFwNoRKO0uVt8FfLiQ4UdfjLNnyqjAh yvhOjNw2oz3nyhmiL7Z73XpEfAM8ufxT35+zd30BTxtXqo2gLBopmvl0WG8bqzHmkMFdaRv BGnaRrA37h0UwV2h873wg+bo1GwR83A2uKC6Xq5Y51ElNrlfUT3tRR2kU12lxUiCv6dRr4J 12+xlj8dM0wz1z5orGjVA7Vr9sudI09qDpyzGJODO2hzguT41Y2hxSrYLbwfDovjZiGVM2J QMg==
Received: from [10.71.219.206] ([142.147.89.219]) by vsel2nmtao02p.internal.vadesecure.com with ngmta id 0ccb45c2-17852e450470b0fe; Fri, 15 Sep 2023 21:01:48 +0000
Content-Type: multipart/alternative; boundary="------------52707jQRdpk0Abk5C2nEG1Gx"
Message-ID: <472ef154-3f4b-d6f0-dc48-8599a7896f13@ix.netcom.com>
Date: Fri, 15 Sep 2023 14:01:48 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1
Content-Language: en-US
To: Tim Bray <tbray@textuality.com>, ART Area <art@ietf.org>, i18ndir@ietf.org
References: <169479938668.18742.9199862891950651366@ietfa.amsl.com> <CAHBU6ivzUV947N+n7AoYkCFT3ZfaLobCQ4fBXw3dvkqTT=LBAw@mail.gmail.com>
From: Asmus Freytag <asmusf@ix.netcom.com>
In-Reply-To: <CAHBU6ivzUV947N+n7AoYkCFT3ZfaLobCQ4fBXw3dvkqTT=LBAw@mail.gmail.com>
Authentication-Results: earthlink-vadesecure.net; auth=pass smtp.auth=asmusf@ix.netcom.com smtp.mailfrom=asmusf@ix.netcom.com;
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/-xgCeZ0Q4y6GpR4drYMqh6tdnLs>
Subject: Re: [I18ndir] Fwd: New Version Notification for draft-bray-unichars-04.txt
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Sep 2023 21:01:54 -0000

This time I looked only at the diffs and sometimes a bit of adjacent text.

This first one is not major, but a small fix would avoid a contradiction 
in terms.

> The numbers assigned to Unicode characters are called “code points”;
This is backwards, as can be seen by "unassigned" code points. Easy to fix:

> The numbers to which Unicode characters are assigned are called “code 
> points”;

That matches the way the Unicode Standard talks about this process and 
makes "unassigned" code points no longer something that seemingly 
contradicts their definition.

In the paragraph above, this instance also has the backwards description 
assignment:

> each Unicode character is assigned an integer identifier in the range 
> U+0000-U+10FFFF.  These numbers are used to
simple fix, also adding "unique" for precision.

> each Unicode character is assigned to a unique integer identifier in 
> the range U+0000-U+10FFFF.  These numbers are used to

---

New text on dealing with problematic code points:

I like the new section. However, I'm troubled by the some of the wording 
related to RFC9413 which I believe can be misunderstood and even cited 
out of context to support some dangerous strategies.

RF4913 states:

> However, ignoring faulty or ambiguous input is almost always the 
> incorrect solution to the problem.

Because silently ignoring individual code points can be used to evade 
detection of malicious input, this should not be understood as "ignoring 
faulty characters individually", but as "ignoring text fields with 
faulty characters".

The distinction is crucial, and is what gave rise to Unicode's 
recommendation on how to treat ill-formed UTF-8.

There are known attack techniques that rely on part of a string being 
discarded silently. For example adding an unpaired surrogate to foil 
matching of known malicious content. If the surrogate is later 
discarded, the remaining string then represents an attack payload that 
escaped the defenses. This was the impetus for Unicode to add the 
recommendations you cite.

I'm not  suggesting that you necessarily delve into too much detail 
here, but that you introduce the concept that a single ill-formed part 
of a string makes the whole text field ill-formed, and that the 
recommendation of RFC9413 should therefore never apply to single 
characters or code points in isolation.

Here's suggested text:

> In applying the recommendations of RFC19413 for text fields containing 
> ill-formed UTF-8, for example, the recommendations must be applied to 
> the field as a whole, not on the character or byte level. In fact, 
> silently ignoring an ill-formed part of a string is a known security 
> risk. Responding to that risk, [UNICODE] section 3.2 ....

The last paragraph is overselling RFC9413, because the phrasing 
conceivably implies that it contains guidance specific to code points, 
when it is more generically concerned with problematic input. It also 
doesn't flow particularly well.

You could move it at the head of Section 5, with tweak.

> Problematic code points are an example of problematic input. 
> [RFC9413], "Maintaining Robust Protocols", provides a thorough 
> discussion of error-handling options when choosing a strategy for 
> dealing with problematic input. Different types of problematic code 
> points cause different issues.
>
> Noncharacters....

(I'm also suggesting adding a sentence to make the transition)

This way, you put RFC9413 in perspective before relying on it later in 
the text, and you also don't inadvertently set up a contrast between 
Unicode's recommendation and those of RCF9413. What Unicode does, is to 
give a specification for the option when you don't want to or cannot 
discard the whole text field. And it clarifies that on the character or 
code point level, silently discarding part of the text is a big security 
no-no.

With these fixes, OK to ship it.

A./