Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

Rob Sayre <sayrer@gmail.com> Mon, 11 September 2023 02:40 UTC

Return-Path: <sayrer@gmail.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3E338C151068; Sun, 10 Sep 2023 19:40:07 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.105
X-Spam-Level:
X-Spam-Status: No, score=-7.105 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BebdyH4-qyzf; Sun, 10 Sep 2023 19:40:03 -0700 (PDT)
Received: from mail-ed1-x52b.google.com (mail-ed1-x52b.google.com [IPv6:2a00:1450:4864:20::52b]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 698C1C14CEED; Sun, 10 Sep 2023 19:40:03 -0700 (PDT)
Received: by mail-ed1-x52b.google.com with SMTP id 4fb4d7f45d1cf-52c9be5e6f0so4749372a12.1; Sun, 10 Sep 2023 19:40:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1694400002; x=1695004802; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=t/DjBsIv1IQWoL4LvZxUr14dP6AKVImP7AtyWNBWuNM=; b=i8Fx+6qx7s5yJK8aH1UTqrz7bKCNqMi2XYexRjcivEuu6Zb+X0x6JYb4i0VwH5Z63s vSFTNbQmvDPgxk45rQg2j1RRJnJYNYY5nAQY9xssTWiA8dCLfQYNBn18IkzRUQpvJ6Da zHyItt+WW4QCUeFisz1ssG5gYSey5QcBfi5IcUgChNVFBmlu3yvWdi4MIDrObyuNIulx kLnOFvrFmqxvatbcHJQhYfmON5tkzpmnr7r+Y793LKErObuOQR1L7B3iJUD3VkYHqKne AtlNOB3GuTnMhZz6tMJ8lPTurw14+FmLBq5dZEtTcnDR5uhlejgFjIk/h6HlTNrXLoYd WOvQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694400002; x=1695004802; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=t/DjBsIv1IQWoL4LvZxUr14dP6AKVImP7AtyWNBWuNM=; b=gDwPeLbxpXzwR4oDp1sz3X1VC6T/yG/jjcGgQkyDUdWd82JA4s/2xIcoAxKNIABRr3 /5S7IVQwLNYpNGDIMyrdYkMYLA2cGqaPDI2CpviOWfXw2zC3fZl8BaJGCWL9QRU4kams dtBSeAKqTddTK5NFwiHGZertG7UynhVKm5pB53yPQUb050wmy6xa1Ue61E4eIyV7+QZR stMcbJ11kHTATZfutHkgxoWauCIMEjusn3NEQ61IbaGxvC3C4LPQ6sEYMogrSaxucRgl 8gGKu6IoPGPgXtFTHhV7ZKjXKO9LGPQSslCggOeOl97Oyg6gbPjXwr4vBdT3U1ti2ery /XNA==
X-Gm-Message-State: AOJu0YxtEHR8yZ19AWb4Y44d+xWmbmlIybB+OgNYaD0lPFHi8XoRNgdw VzK+HYneLsu9DGkh/SdkbSOui4jyJZb+nky1Ako=
X-Google-Smtp-Source: AGHT+IG0tDcpg0EzwFPELKdDjQ+DKRmiwmGBP1Q7fGlhWWAAqSWLddESw5q9uu4uzgJUZCVuTefqaGbjcH2vYczC28g=
X-Received: by 2002:a05:6402:3229:b0:525:466c:5fda with SMTP id g41-20020a056402322900b00525466c5fdamr6970607eda.28.1694400001587; Sun, 10 Sep 2023 19:40:01 -0700 (PDT)
MIME-Version: 1.0
References: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com> <ME3PR01MB59730B45D9339180AF00E941E5F3A@ME3PR01MB5973.ausprd01.prod.outlook.com> <CAHBU6ivc4W3KyYtbK2H7PQUa8C4+g=73nSTgBK+xLXnzH7V6GA@mail.gmail.com> <ME3PR01MB5973C8061732F354E5C7F242E5F2A@ME3PR01MB5973.ausprd01.prod.outlook.com>
In-Reply-To: <ME3PR01MB5973C8061732F354E5C7F242E5F2A@ME3PR01MB5973.ausprd01.prod.outlook.com>
From: Rob Sayre <sayrer@gmail.com>
Date: Sun, 10 Sep 2023 19:39:49 -0700
Message-ID: <CAChr6Sx9hMCDKy8eg_uHBmm2OYusQSkVC-OiME9h7E-WWj-ESQ@mail.gmail.com>
To: "Manger, James" <James.H.Manger=40team.telstra.com@dmarc.ietf.org>
Cc: Tim Bray <tbray@textuality.com>, Asmus Freytag <asmusf@ix.netcom.com>, "i18ndir@ietf.org" <i18ndir@ietf.org>, ART Area <art@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000033bd7306050c3e09"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/Bc9HYgx0sksMy92nvQtwOhFxJpI>
Subject: Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Sep 2023 02:40:07 -0000

On Sun, Sep 10, 2023 at 7:26 PM Manger, James <James.H.Manger=
40team.telstra.com@dmarc.ietf.org> wrote

>
>
> Java correctly detects an attempt to UTF-8-encode a lone surrogate as
> wrong. CharsetEncoder.encode does throw a MalformedInputException. But Java
> helpfully offers 3 ways to handle malformed-input and unmappable-character
> errors: CodingErrorAction.IGNORE|REPLACE|REPORT. The REPLACE option drops
> the erroneous input, appending a replacement value – which is what
> String.getBytes(Charset) is defined to do.
>
>
>
> You can (with at least some libraries) round-trip JSON with lone
> surrogates in Java – but the output has to be JSON (with the lone surrogate
> again represented with an escape).
>

It is so much worse than that. see:
https://stackoverflow.com/a/7921064/5965814

I don't think we should document any of this, because there is an economic
incentive to not do anything of the sort. The reason is that you can avoid
allocations on the server if everyone is keeping it clean in correct UTF-8.
I think newer JVM/Java things get this right, too.

I know this piece of annoying trivia, because we measured UTF-8 to
UTF-16/UCS2 processing costs in Mozilla Firefox. These were
surprisingly high. At the time, the OS strings meant that we were going to
pay either way, and it was not worth refactoring Firefox strings to use
UTF-8 internally.

These days, the OS strings are getting better.

thanks,
Rob