Re: [I18ndir] draft-bray-unichars-01

Nico Williams <nico@cryptonector.com> Wed, 30 August 2023 17:40 UTC

Return-Path: <nico@cryptonector.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5E1D1C14F5E0 for <i18ndir@ietfa.amsl.com>; Wed, 30 Aug 2023 10:40:54 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.104
X-Spam-Level:
X-Spam-Status: No, score=-2.104 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=cryptonector.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id DbbbsirSaA06 for <i18ndir@ietfa.amsl.com>; Wed, 30 Aug 2023 10:40:49 -0700 (PDT)
Received: from eastern.birch.relay.mailchannels.net (eastern.birch.relay.mailchannels.net [23.83.209.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 043EEC14CF1D for <i18ndir@ietf.org>; Wed, 30 Aug 2023 10:40:48 -0700 (PDT)
X-Sender-Id: dreamhost|x-authsender|nico@cryptonector.com
Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id 38FAF5423DC; Wed, 30 Aug 2023 17:40:48 +0000 (UTC)
Received: from pdx1-sub0-mail-a237.dreamhost.com (unknown [127.0.0.6]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id B14705422DA; Wed, 30 Aug 2023 17:40:47 +0000 (UTC)
ARC-Seal: i=1; s=arc-2022; d=mailchannels.net; t=1693417247; a=rsa-sha256; cv=none; b=VhMhIbKQ+c+4hBog6ss3baSND0J34BcEAgYTr1qTEKPAMPRhdE0jrk9xG2UzRKNVAv9RL4 yaI0+foPwRqqNoyQlnBl9M5mPzNrok9rjYrAoPSQ1cbRQT9WcNU41vilVh72zMA4NjiK8Z wq9KJC4CbWTEnXiyJ13GBKPz7oKOuVms/g2dVWzISz9OOSaIqcxzCMTTgmFULhiMrQTOjI WbuCx0iS4f7YTvFlm0WfwiAyZN96pQSQbiq6Gf4NVRlTWt6qGc0xDnAib4uuSuGMZuCl0W 4gdzDeV/SWgM7mZqVLqKgbK7VXQcdyUofyni5bD92LzVYJXGW/cG2mY3iOW3Cg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=mailchannels.net; s=arc-2022; t=1693417247; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WULrPWKh9tqcZ0P6SB5RvV3qhkUJTVhNHxSQhUIbLXs=; b=VkEYW+0wxNy1U92Vxm5D0xkBy1586DUvzob7ms1eME+OGvwDZ1yRaPA+x21mhYMdUkN2Yt YjqTFbjswkbBIjb0ZLdWBgyiQFOnHhXNLVwCiEYxZFeNgYskaExZWmaoIRL5mRn0PrAddh Xx0FndUddpd1kPa0bXnWjZNZCo2lhlCUq5qeg54AncImhq37Ny2rCOTl9Zlk2ahD+CpeQn Say4NT2XhvgIZzkjLZE60ijiTUL9y1IrJFlFxRGQ/Zg33DANdK6eFXOtU6SaNPAF+/ge8j 0+lgPh1on9sjfkoLy34uOYXnx17ddCFcEIITltzdjzQq5w2ydVsTHFw4kz3tZw==
ARC-Authentication-Results: i=1; rspamd-bfd6864c7-vqx5w; auth=pass smtp.auth=dreamhost smtp.mailfrom=nico@cryptonector.com
X-Sender-Id: dreamhost|x-authsender|nico@cryptonector.com
X-MC-Relay: Neutral
X-MailChannels-SenderId: dreamhost|x-authsender|nico@cryptonector.com
X-MailChannels-Auth-Id: dreamhost
X-Name-Unite: 6d98dfc2697ff839_1693417248009_2967983969
X-MC-Loop-Signature: 1693417248009:4192863953
X-MC-Ingress-Time: 1693417248008
Received: from pdx1-sub0-mail-a237.dreamhost.com (pop.dreamhost.com [64.90.62.162]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384) by 100.123.152.115 (trex/6.9.1); Wed, 30 Aug 2023 17:40:48 +0000
Received: from ubby21 (075-081-095-064.res.spectrum.com [75.81.95.64]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: nico@cryptonector.com) by pdx1-sub0-mail-a237.dreamhost.com (Postfix) with ESMTPSA id 4RbWmH15c3z2f; Wed, 30 Aug 2023 10:40:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cryptonector.com; s=dreamhost; t=1693417247; bh=WULrPWKh9tqcZ0P6SB5RvV3qhkUJTVhNHxSQhUIbLXs=; h=Date:From:To:Cc:Subject:Content-Type:Content-Transfer-Encoding; b=UGO8BHNsJ+QL2yJ20G2ngqAD8juvIr7fOHCU9Swxd5jrOTe9bos8lRzKfiJfYJ4kt T4Tm1rnDUIgY5miUC8pHKKeQ+U3uP3WBs0dGOJAd8jrNKPvQwMfPNwmAQc4CqmVgD/ NuxtFWoTQtX/wj3pQdcPj4z8ocLullRbLei7Zy19ji0FvAPFAbr34teSG6YN2etFaX SvpQv+Rh3/+QKxekpaRYuVuYB2X563FOgoKJOpMpVD23LDJa2YXdAnzr77huBn4hu0 yVstMs6lMXz3+lqYt5OwOpd1nO2vS400s73THv1gCPI34C+Zt632fZ+mmx9j70tWzZ Ra/RZIBUo01Lg==
Date: Wed, 30 Aug 2023 12:40:44 -0500
From: Nico Williams <nico@cryptonector.com>
To: Tim Bray <tbray@textuality.com>
Cc: Asmus Freytag <asmusf@ix.netcom.com>, "i18ndir@ietf.org" <i18ndir@ietf.org>
Message-ID: <ZO9/HH+AUq/NGWoD@ubby21>
References: <CAHBU6isuZ1fgAjv14JRCiWaq-cmE69iEGajQkDDNA4CzfTKoxQ@mail.gmail.com> <122f70b8-62f8-cd24-a0e1-c3e0052b37e8@ix.netcom.com> <CAHBU6ivmCCOghYSP5zT1d6q3KbGtrtpC=pa4JZOMruz8iU=Bsg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAHBU6ivmCCOghYSP5zT1d6q3KbGtrtpC=pa4JZOMruz8iU=Bsg@mail.gmail.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/lNg3vNqHKKeaNxc82BapHsM2WYc>
Subject: Re: [I18ndir] draft-bray-unichars-01
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Aug 2023 17:40:54 -0000

On Wed, Aug 30, 2023 at 09:48:21AM -0700, Tim Bray wrote:
> On Aug 29, 2023 at 12:10:42 PM, Asmus Freytag <asmusf@ix.netcom.com> wrote:
> > This subset has the advantage of excluding surrogates, which can never add
> > any value and have the potential to cause problems.
> >
> > should be reworded a bit to:
> >
> > "This subset has the advantage of excluding surrogates, which are not
> > assigned to any characters, and thus can never add any value.
> > They have the potential to cause problems, for example it is not possible
> > to represent them individually in UTF-8."
> 
> The Unicode Standard says that surrogates can only be used in UTF-16 and
> RFC3629 explicitly says they are prohibited. Having said that, it’s easy to
> generate UTF-8 that includes them (start by truncating a Java “char” array
> incautiously, or just have a bad day while programming in C) and most
> (all?) language implementations per Postel’s law will happily give you that
> surrogate code point. That’s the problem; although it’s forbidden it can be
> done and it does happen.  So I think that instead of “not possible” we
> should say something like “… potential to cause problems; while it is
> possible to include them in UTF-8 strings, this is forbidden by the
> specification of UTF-8”.

You can always use a JSON escape sequence to represent an unpaired
surrogate codepoint.  And who knows, maybe on parsing it will be
unescaped but not substituted.

Ideally no software should produce even escaped unpaired surrogates, and
ideally all parsers should replace or escape non-UTF-8, and/or maybe
replace escapes that are not valid Unicode.

(It's not even clear that JSON parsers should unescape escapes.  RFC
8259 is silent on this matter.  The text in section 8.2 seems to hint
that a parser should either unescape to the max or escape to the max,
but then section 8.3 more strongly implies no requirement.)

Nico
--