UTF-8 [was Re: New Version Notification - draft-sgtatham-secsh-iutf8-05.txt]

Mouse <mouse@Rodents-Montreal.ORG> Fri, 16 December 2016 19:04 UTC

Return-Path: <bounces-ietf-ssh-owner-secsh-tyoxbijeg7-archive=lists.ietf.org@NetBSD.org>
X-Original-To: ietfarch-secsh-tyoxbijeg7-archive@ietfa.amsl.com
Delivered-To: ietfarch-secsh-tyoxbijeg7-archive@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id BE3AF129B7D for <ietfarch-secsh-tyoxbijeg7-archive@ietfa.amsl.com>; Fri, 16 Dec 2016 11:04:01 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.096
X-Spam-Level:
X-Spam-Status: No, score=-7.096 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.001, RCVD_IN_DNSWL_MED=-2.3, RP_MATCHES_RCVD=-2.896, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3lSxwka85Oo4 for <ietfarch-secsh-tyoxbijeg7-archive@ietfa.amsl.com>; Fri, 16 Dec 2016 11:03:48 -0800 (PST)
Received: from mail.netbsd.org (mail.NetBSD.org [IPv6:2001:470:a085:999::25]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9E1D4129443 for <secsh-tyoxbijeg7-archive@lists.ietf.org>; Fri, 16 Dec 2016 11:03:48 -0800 (PST)
Received: by mail.netbsd.org (Postfix, from userid 605) id A6502855BD; Fri, 16 Dec 2016 19:03:47 +0000 (UTC)
Delivered-To: ietf-ssh@netbsd.org
Received: by mail.netbsd.org (Postfix, from userid 1347) id 62B688557D; Fri, 16 Dec 2016 19:03:47 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1]) by mail.netbsd.org (Postfix) with ESMTP id 21DE585594 for <ietf-ssh@NetBSD.org>; Fri, 16 Dec 2016 12:38:40 +0000 (UTC)
X-Virus-Scanned: amavisd-new at netbsd.org
Received: from mail.netbsd.org ([127.0.0.1]) by localhost (mail.netbsd.org [127.0.0.1]) (amavisd-new, port 10025) with ESMTP id AMO1G2lpYm7n for <ietf-ssh@netbsd.org>; Fri, 16 Dec 2016 12:38:39 +0000 (UTC)
Received: from Stone.Rodents-Montreal.ORG (Stone.Rodents-Montreal.ORG [98.124.61.89]) by mail.netbsd.org (Postfix) with ESMTP id 3191484CBD for <ietf-ssh@NetBSD.org>; Fri, 16 Dec 2016 12:38:39 +0000 (UTC)
Received: (from mouse@localhost) by Stone.Rodents-Montreal.ORG (8.8.8/8.8.8) id HAA15001; Fri, 16 Dec 2016 07:38:38 -0500 (EST)
Date: Fri, 16 Dec 2016 07:38:38 -0500
From: Mouse <mouse@Rodents-Montreal.ORG>
Message-Id: <201612161238.HAA15001@Stone.Rodents-Montreal.ORG>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
X-Erik-Conspiracy: There is no Conspiracy - and if there were I wouldn't be part of it anyway.
X-Message-Flag: Microsoft: the company who gave us the botnet zombies.
X-Composition-Start-Date: Fri, 16 Dec 2016 07:12:38 -0500 (EST)
To: ietf-ssh@NetBSD.org
Subject: UTF-8 [was Re: New Version Notification - draft-sgtatham-secsh-iutf8-05.txt]
In-Reply-To: <E9F740043A344B90B98AE856C98002C6@Khan>
References: <148166458750.29362.18378725891208955198.idtracker@ietfa.amsl.com><2DD56D786E600F45AC6BDE7DA4E8A8C117FF354E@eusaamb107.ericsson.se> <201612150224.VAA24159@Stone.Rodents-Montreal.ORG> <E9F740043A344B90B98AE856C98002C6@Khan>
Sender: ietf-ssh-owner@NetBSD.org
List-Id: ietf-ssh.NetBSD.org
Precedence: list

> What=E2=80=99s inherently broken in using UTF-8...?

Different characters occupy different amounts of space.

(Some) characters are larger than one addressing unit (most machines).

There are octet sequences which are not valid UTF-8 character
sequences.  This results in text tools that break on small amounts of
non-UTF-8 text mixed into the text they're handling.  (This is not
really a problem with UTF-8 proper - there are also octets that are not
valid 8859-1 text, for example - but a problem with how it's
implemented; in my experience UTF-8 text tools break when faced with
non-UTF-8 octet sequences, whereas single-octet text tools usually
don't break when faced with invalid octets.)

Some characters have multiple distinct encodings.  (Okay, that too is
not really UTF-8 proper - it's actually Unicode.)

I've seen it said (by the git documentation) that transcoding from some
character sets like 8859-1 to UTF-8 is not a reversible operation.
This seems dubious to me, but, if true, it would be another, and fairly
strong, strike against UTF-8 in my opinion.

That's just what come to mind immediately.  I don't use UTF-8 myself if
I can help it (when I run into something using it my major concern is
how to make it stop doing so), so it's entirely possible there are
others I'm just not aware of.

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse@rodents-montreal.org
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B