Re: [I18ndir] [art] New Version Notification for draft-bray-unichars-06.txt

Steffen Nurpmeso <steffen@sdaoden.eu> Tue, 03 October 2023 16:04 UTC

Return-Path: <steffen@sdaoden.eu>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 589B6C14CE29; Tue, 3 Oct 2023 09:04:22 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.906
X-Spam-Level:
X-Spam-Status: No, score=-1.906 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_BLOCKED=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=unavailable autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XBAGlb8Fu7hv; Tue, 3 Oct 2023 09:04:17 -0700 (PDT)
Received: from sdaoden.eu (sdaoden.eu [217.144.132.164]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 986DCC14CF05; Tue, 3 Oct 2023 09:04:16 -0700 (PDT)
Date: Tue, 03 Oct 2023 18:04:13 +0200
Author: Steffen Nurpmeso <steffen@sdaoden.eu>
From: Steffen Nurpmeso <steffen@sdaoden.eu>
To: "Manger, James" <James.H.Manger=40team.telstra.com@dmarc.ietf.org>
Cc: Tim Bray <tbray@textuality.com>, "i18ndir@ietf.org" <i18ndir@ietf.org>, ART Area <art@ietf.org>, Steffen Nurpmeso <steffen@sdaoden.eu>
Message-ID: <20231003160413.zxbWD%steffen@sdaoden.eu>
In-Reply-To: <SY4PR01MB59803C733B6B6A1C9D4E04F4E5C5A@SY4PR01MB5980.ausprd01.prod.outlook.com>
References: <169566019635.41806.9804796677919971070@ietfa.amsl.com> <CAHBU6is-wU2NLXNWL56nSJ4=nKvDzGv_Aw4qJN6N2O8CuM4-yw@mail.gmail.com> <SYBPR01MB59814B3448F5754AAEDA1740E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com> <CAHBU6iueqtd5T1T-ciYUMWvmo8XqBQqO5LkWbdRaoXQzPYSQOQ@mail.gmail.com> <SY4PR01MB5980D009F1623E3694B871B7E5C5A@SY4PR01MB5980.ausprd01.prod.outlook.com> <CAChr6SzMXqmEJvwQ0Vb0+CfchBn2kMueQJ-2Th1=4Oct8b9t6A@mail.gmail.com> <E1464943-EB11-4FA4-B933-4F138C6C34A0@tzi.org> <CAHBU6itgC07j0P5DcACDyHSjEOG6=j5kWE=eYF8E0NA3mm_b5A@mail.gmail.com> <SY4PR01MB59803C733B6B6A1C9D4E04F4E5C5A@SY4PR01MB5980.ausprd01.prod.outlook.com>
Mail-Followup-To: "Manger, James" <James.H.Manger=40team.telstra.com@dmarc.ietf.org>, Tim Bray <tbray@textuality.com>, "i18ndir@ietf.org" <i18ndir@ietf.org>, ART Area <art@ietf.org>, Steffen Nurpmeso <steffen@sdaoden.eu>
User-Agent: s-nail v14.9.24-524-gd5f7c65f62
OpenPGP: id=EE19E1C1F2F7054F8D3954D8308964B51883A0DD; url=https://ftp.sdaoden.eu/steffen.asc; preference=signencrypt
BlahBlahBlah: Any stupid boy can crush a beetle. But all the professors in the world can make no bugs.
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/hWUcezo-NjAmwdreem0jAwwhp98>
Subject: Re: [I18ndir] [art] New Version Notification for draft-bray-unichars-06.txt
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 03 Oct 2023 16:04:22 -0000

Manger, James wrote in
 <SY4PR01MB59803C733B6B6A1C9D4E04F4E5C5A@SY4PR01MB5980.ausprd01.prod.outl\
 ook.com>:
 |[1]draft-bray-unichars[/1] §3 “Dealing with problematic code points” \
 |suggests “replacing problematic code points with "�" (U+FFFD, REPLACEMENT \
 |CHARACTER)” (or signalling an error, 
 |but I’ll only talk about the replacement option in this email).
 |
 |  [1] https://datatracker.ietf.org/doc/html/draft-bray-unichars
 |
 |* An ill-formed sequence of code units needs to be replaced. It is \
 |far less obvious to me that “problematic” scalars should be replaced. \
 |Even for noncharacters Unicode provides a 
 |good [2] FAQ[/2] and [3]corrigendum #9 “Clarification about noncharact\
 |ers”[/3] that suggests passing them along (treating them like unassigned \
 |scalars) is often the best policy 
 |(because the internal/interchange boundary is blurry).
 |So §4.3 defining unicode-assignable that excludes noncharacters is \
 |fine -- when to be lenient on receiving a supposed unicode-assignable \
 |value is less obvious.
 |But §3 looks dodgy.
 |
 |* U+FFFD is an obvious choice to replace code units or scalars you \
 |don’t want. But Unicode does allow choices. [4]Unicode ch3[/4] C10 \
 |only says “with a marker such as U+FFFD”. [5] 
 |Unicode TR36[/5] says “where U+FFFD is not available, a common alternative \
 |is "?"”. Java, for instance, uses “?” is some common circumstances. \
 |Unichars does not admit such an option.

The situation on the iconv(3) (POSIX, Portable Operating System
Interface) front is a disaster really.  Let me quote some code
snippet of the honourable Bruno Haible of GNU iconv and more:

  /* Irix iconv() inserts a NUL byte if it cannot convert.
     NetBSD iconv() inserts a question mark if it cannot convert.

("Citrus" that is, also used on some other BSD's.)
The small (but very neat, and lots of "impressive" code snippets)
musl C library as used for example by AlpineLinux uses asterisk *.

     Only GNU libiconv and GNU libc are known to prefer to fail rather
     than doing a lossy conversion.  */

Irix made an incredible bad choice.
Anyhow it is all totally intransparent for programmers anyhow,
since you do know nothing of at least one character set, neither
the "resolved name", nor whether it is multi-byte or "multi-word"
or "multi-multi-{word,multi}", not whether it has "state", not
whether ASCII NUL can be a vivid part (of "unused" "word"s),
nothing.

Off-topic but i am still hoping we get a dedicated _addressable_
behaviour switch for iconv(3) that then does _not_ overload the
EILSEQ error (as there is no "invalid input", just missing output
convertability).

Of course, this is all about boundaries in between different
character sets with non-Unicode involved.

Just to add that the most widely used conversion library GNU iconv
does for many years:

      UTF-8: Reject surrogates and out-of-range code points.
      * lib/utf8.h (utf8_mbtowc, utf8_wctomb): Reject code points in the
      range 0xD800..0xDFFF and >= 0x110000.

As in, "IETF is one thing, but programmers have to deal with that
on the code side".

No no.  UTF-8 as in RFC 3629, and with _exactly_ the "well-formed"
check as defined by

      if(LIKELY(x <= 0x7Fu))
              c = x;
      /* 0xF8, but Unicode guarantees maximum of 0x10FFFFu -> F4 8F BF BF.
       * Unicode 9.0, 3.9, UTF-8, Table 3-7. Well-Formed UTF-8 Byte Sequences */
      else if(LIKELY(x > 0xC0u && x <= 0xF4u)){
      ...
      }else
              goto jerr;

I am a bit outdated.

If some _sender_ pushes through garbage because it has to, and
i have had requests to disable iconv so that garbage is sent over
as mail for the MUA i maintain, mind you, there are administrators
and they want the mail report to be sent out no matter what
garbage there is!, .. and let me tell it was requested by
a Dr. aka Bachelor++, even a Master-of-the-Universe, these are the
real life guys, heaven!, and, furthermore, that over fourty years
old protocol with its almost thirty years MIME extensions actually
CAN deliver this s..t!!, then, if the IETF is about to define that
network protocols should use UTF-8, then it should define that the
one and only Unicode character which was specifically designed for
that purpose -- to indicate that some non-representable had to be
represented -- should be used.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)