Re: [I18nrp] Confusion among characters and strings

John C Klensin <john-ietf@jck.com> Fri, 12 October 2018 07:12 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18nrp@ietfa.amsl.com
Delivered-To: i18nrp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B0E72130E05 for <i18nrp@ietfa.amsl.com>; Fri, 12 Oct 2018 00:12:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.908
X-Spam-Level:
X-Spam-Status: No, score=0.908 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RAZOR2_CF_RANGE_51_100=1.886, RAZOR2_CHECK=0.922, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5D_Rlvov5pyZ for <i18nrp@ietfa.amsl.com>; Fri, 12 Oct 2018 00:12:06 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 96339130DFB for <i18nrp@ietf.org>; Fri, 12 Oct 2018 00:12:06 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1gArcC-0008Qg-HW; Fri, 12 Oct 2018 03:12:04 -0400
Date: Fri, 12 Oct 2018 03:11:58 -0400
From: John C Klensin <john-ietf@jck.com>
To: Larry Masinter <LMM@acm.org>
cc: i18nrp@ietf.org
Message-ID: <2B857DB709BC32EE5DC0EEAF@PSB>
In-Reply-To: <033401d461f1$7d181590$774840b0$@acm.org>
References: <145D45F77511A9B1281FE35D@PSB> <033401d461f1$7d181590$774840b0$@acm.org>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18nrp/0A8n4eRDdIWCysOqyDuUAhwgN78>
Subject: Re: [I18nrp] Confusion among characters and strings
X-BeenThere: i18nrp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Review Procedures <i18nrp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18nrp/>
List-Post: <mailto:i18nrp@ietf.org>
List-Help: <mailto:i18nrp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Oct 2018 07:12:10 -0000

Larry,

Asmus can almost certainly speak to this with greater authority
and better examples than I can, but it seems to me that there
are several different subcategories of confusion.  For the
purpose of looking at character grapheme relationships (OCR-like
methods included), the first two on the list below have tended
to dominate discussions but I think we do the Internet a real
disservice by focusing only on them:

(1) Within Latin script, e.g., the g00g1e examples.
(2) Between Latin script and other scripts (especially Greek
and/or Cyrillic), whether with mixed-script strings or complete
substitutions.
(3) Within non-Latin (and non-Greek and Cyrillic) scripts.
(4) Within and among non-Latin scripts, especially ones with
rich traditions of type style design and/or calligraphy as a
design art.

Your suggestion risks falling into a trap we seem to fall into
regularly, which is looking at the first category or two (if
only because it is easy to incorporate examples that almost
everyone on IETF lists will understand) and assuming whatever
observations one makes apply to other than Greek-Latin-Cyrillic
scripts.  

best,
  john


--On Thursday, October 11, 2018 23:04 -0700 Larry Masinter
<LMM@acm.org> wrote:

> This 
> 
> An experiment:
> Given a string, convert the string to an image, OCR the image,
> and see you get back the same string, code-point by code-point.
> Vary the font to cover repertoire on common platforms
> (Android, iOS, windows mac).
> 
> Note this secion contains lots of puns
> 
> G00GLE.com OCRs to GOOGLE.com consistently.
> Larrч turns into Larry and Зcom into 3com.
> Toys-Я-Us.com turns into Toys-A-Us.com, even when language is
> Russian.
> 
> I was using open office to turn text into image
> soffice --convert-to jpg test.txt
> and https://ocr.space/compare-ocr-software for ocr.
> 
> 
> 
> 
> 
>