Re: [I18ndir] [art] New Version Notification for draft-bray-unichars-06.txt

Rob Sayre <sayrer@gmail.com> Tue, 03 October 2023 17:49 UTC

Return-Path: <sayrer@gmail.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E12C5C15107F; Tue, 3 Oct 2023 10:49:34 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.104
X-Spam-Level:
X-Spam-Status: No, score=-2.104 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gve3-HqpJvmh; Tue, 3 Oct 2023 10:49:29 -0700 (PDT)
Received: from mail-ed1-x534.google.com (mail-ed1-x534.google.com [IPv6:2a00:1450:4864:20::534]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E6B0DC14F74E; Tue, 3 Oct 2023 10:49:29 -0700 (PDT)
Received: by mail-ed1-x534.google.com with SMTP id 4fb4d7f45d1cf-536071e79deso126155a12.1; Tue, 03 Oct 2023 10:49:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1696355368; x=1696960168; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=0furSpAMO0Os28SoITAWH6mDaJlOxYmorxKet9Q5yhw=; b=gBHTxco1gbDYEUIkOvtAtr7rCXJtV3sBPI2/JYyoUhXGU5lD2vjTBW79Ogi4iPjDh6 EkZxUNe53JTOB1i5OUILMih4O96XAiHVcWRP84x6EQ4SlFamGcspil0p+o+TRRp5FANs nkG41bA/MmUs+LCaNnBRUnONbRWAQTJ0wR4SR7CE3PvC1+ZWZ52u8K7GplPYN6MPyd/v 0RsGNdwwubuQ4k9vyWmqBKtbJNhKbKxWqrSiONrpWPtvMobsg2Ro4KSR4VjD41nmtvvI 5kZZVumaailKglP06E0fxr9GhAs6ojGdegMCsPhm8xJur+7BrO6sqS6S+/gw++kWxAlt 22CA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696355368; x=1696960168; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=0furSpAMO0Os28SoITAWH6mDaJlOxYmorxKet9Q5yhw=; b=TbIriFVO0sBotSR3lElXI4ZlfU51rPeGGvOvSFKpOfE0D8DRyx36cu4MkLrv8A2ulW cEKc9sq88mukFVDmhCwoC24ch+5xpg3CIYEU5ALOb1nYuQQNH6nj399gXNdhUotR6WVR a71C5q5PQ9Hkilh0B95xX7ju5Re/FyNdZeAzKAQVEarZjlOShd3X9qxG4lljdLRvKqcz n/rqgMhxU2MeHcRnpr86tPa1h/926N+JjobogkqQwo+HZEvwOpCe4pd0b56GEK+YlC9i ll3untYocMnPZFkoRglmS9SSQRI+dG5/sMc74XQJxQ13Ods+oV5jNIrw0BLFcE8Muy2z aEAQ==
X-Gm-Message-State: AOJu0Yyogrih1XZDiWxRK/cNivH4xO2N6l3/jPeTVNS4ey28kXA4wfZp W6GmZTBW49wQ98xxxrfPFlQW/GUvfztfSn74HzA=
X-Google-Smtp-Source: AGHT+IGe5GtlTT3DIGLTq3NLaRRS0EsblQUU30j2ztOMNDV95WE8G4Cy2Y3clF84BnrlYvGq4WhRQ3AtyGy26wJrpZQ=
X-Received: by 2002:aa7:d4cb:0:b0:52a:38c3:1b4b with SMTP id t11-20020aa7d4cb000000b0052a38c31b4bmr3045567edr.15.1696355368110; Tue, 03 Oct 2023 10:49:28 -0700 (PDT)
MIME-Version: 1.0
References: <169566019635.41806.9804796677919971070@ietfa.amsl.com> <CAHBU6is-wU2NLXNWL56nSJ4=nKvDzGv_Aw4qJN6N2O8CuM4-yw@mail.gmail.com> <SYBPR01MB59814B3448F5754AAEDA1740E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com> <CAHBU6iueqtd5T1T-ciYUMWvmo8XqBQqO5LkWbdRaoXQzPYSQOQ@mail.gmail.com> <SY4PR01MB5980D009F1623E3694B871B7E5C5A@SY4PR01MB5980.ausprd01.prod.outlook.com> <CAChr6SzMXqmEJvwQ0Vb0+CfchBn2kMueQJ-2Th1=4Oct8b9t6A@mail.gmail.com> <E1464943-EB11-4FA4-B933-4F138C6C34A0@tzi.org> <CAHBU6itgC07j0P5DcACDyHSjEOG6=j5kWE=eYF8E0NA3mm_b5A@mail.gmail.com> <SY4PR01MB59803C733B6B6A1C9D4E04F4E5C5A@SY4PR01MB5980.ausprd01.prod.outlook.com> <CAChr6SxkYE2B32SAxXmNJbGTpvgm9dBgKs_7khD6KTG3i478fA@mail.gmail.com>
In-Reply-To: <CAChr6SxkYE2B32SAxXmNJbGTpvgm9dBgKs_7khD6KTG3i478fA@mail.gmail.com>
From: Rob Sayre <sayrer@gmail.com>
Date: Tue, 03 Oct 2023 10:49:16 -0700
Message-ID: <CAChr6SziQWhg9AA7k3vJtn3FrV4jMen7edpyXj7qguZ2uLgtew@mail.gmail.com>
To: "Manger, James" <James.H.Manger=40team.telstra.com@dmarc.ietf.org>
Cc: Tim Bray <tbray@textuality.com>, "i18ndir@ietf.org" <i18ndir@ietf.org>, ART Area <art@ietf.org>
Content-Type: multipart/alternative; boundary="0000000000002111010606d38388"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/nnpmmwZJVjo14rWnrHgbpXtNRfI>
Subject: Re: [I18ndir] [art] New Version Notification for draft-bray-unichars-06.txt
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 03 Oct 2023 17:49:35 -0000

I also went and looked at the Fetch standard.* This is like XMLHttpRequest
but newer, and it was specified after I got out of the browser game. The
easiest thing to read is the Servo Rust version of this (all browsers are
amazing, but are usually written in pretty gnarly C++).

https://github.com/servo/servo/blob/bd9c17234c330e6a2f01fdc472a9dd81a0ea05c5/components/script/body.rs#L822

If you dig through the WHATWG document, you'll find that the error mode is
"replacement". So, you can see there that the browser gets a buffer,
creates UTF-8 that may contain replacement characters (the escape sequences
are not dealt with yet), converts that to UTF-16, and then calls
JSON.parse, where many of our not-so-favorite code points might occur via
escape sequences.

I think this is a good example, and validates the approach in the draft.
Here, we see that not only do web browsers send questionable strings, they
also process them that way:

1) lossy UTF-8 decoding
2) convert to UTF-16
3) convert that to possibly questionable Unicode after processing escape
sequences

Gross, but you can still get sensible Unicode through there by following
this draft. It is also more efficient to use correct UTF-8, because the
decoder won't allocate as much. If you avoid escape sequences, the JSON
parser will also go faster. I don't write this as if browsers are the last
word, but there sure are a lot of them, and it's tough to change every
website, so they tend to converge on convenient solutions. They can still
accommodate change (like HTTPS or QUIC), but it's a tall order.

thanks,
Rob

* https://fetch.spec.whatwg.org/#ref-for-dom-body-json

On Mon, Oct 2, 2023 at 5:32 PM Rob Sayre <sayrer@gmail.com> wrote:

> On Mon, Oct 2, 2023 at 5:21 PM Manger, James <James.H.Manger=
> 40team.telstra.com@dmarc.ietf.org> wrote:
>
>> draft-bray-unichars
>> <https://datatracker.ietf.org/doc/html/draft-bray-unichars> §3 “Dealing
>> with problematic code points” suggests “replacing problematic code points
>> with "�" (U+FFFD, REPLACEMENT CHARACTER)” (or signalling an error, but I’ll
>> only talk about the replacement option in this email).
>>
>
> It's probably not worth dwelling on. But I will get into this one, at the
> risk of sounding like a major dork. This issue is already decided, so we
> are only describing what already happens:
>
>
> https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8_lossy
>
> The reason the unichars document should not unreservedly recommend a
> replacement character is because that's lossy.
>
> thanks,
> Rob
>
>