Re: [I18nrp] Confusion among characters and strings

Larry Masinter <LMM@acm.org> Fri, 12 October 2018 06:04 UTC

Return-Path: <masinter@gmail.com>
X-Original-To: i18nrp@ietfa.amsl.com
Delivered-To: i18nrp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B4733130DF0 for <i18nrp@ietfa.amsl.com>; Thu, 11 Oct 2018 23:04:52 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 1.408
X-Spam-Level: *
X-Spam-Status: No, score=1.408 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FREEMAIL_FORGED_FROMDOMAIN=0.25, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.25, RAZOR2_CF_RANGE_51_100=1.886, RAZOR2_CHECK=0.922, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 34YdjDBw4Q4a for <i18nrp@ietfa.amsl.com>; Thu, 11 Oct 2018 23:04:51 -0700 (PDT)
Received: from mail-pl1-x641.google.com (mail-pl1-x641.google.com [IPv6:2607:f8b0:4864:20::641]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2FEBD130DE0 for <i18nrp@ietf.org>; Thu, 11 Oct 2018 23:04:51 -0700 (PDT)
Received: by mail-pl1-x641.google.com with SMTP id 1-v6so5382007plv.7 for <i18nrp@ietf.org>; Thu, 11 Oct 2018 23:04:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:references:in-reply-to:subject:date:message-id :mime-version:content-transfer-encoding:thread-index :content-language; bh=Q0WCE0B0r7MTPKRCyArKiOvHifscSrlHXx5b3qkhWsI=; b=neS69WHFyloQ9GfYGzrraUi7a0mRuNS3Auw8ChWIg3fNGzTMV2s2Bp4afwTL2exH82 AVj1jk7fCSBNR0/LJWyf1pw3ZfStVsLALLxZ/+1IYEWPdCOzE+Hv+hT2BCnMduJtB0yu Zpoj5+CpzaXyRAqbYOxwQXArKLe/jnQtFXZ/1iPKk9ogTzUsryIPLQ3gXSp5lSHUYFs7 h55zcB7ry0zrRQUcBYEO9jmIPVAlzA3WjIHw5z68LJJc2sNFaj195fGWtRgDzbaJqRfn VWgc2wFL/Wv1GJ6EQEeABloxKw35UQYuLWw4NuAWDez56ZcNk2enDZokGwhALC6hk4uT L8ZQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:references:in-reply-to:subject :date:message-id:mime-version:content-transfer-encoding:thread-index :content-language; bh=Q0WCE0B0r7MTPKRCyArKiOvHifscSrlHXx5b3qkhWsI=; b=PALpaUtcvrCrCIjX1CY0RV6NJ5ds5MgQd9Ty0b0yGS8D3bOdaeybXxBLRkiZ2xWLKU id34QBF9c+SaXG08F6qfFqXiiEnjV9M8TJ5F9oPIo8//msuxiV2Y0JVyMLWSgLIaOjyL ZPKu8xdGGcd7DSzUnoXXolfygPezrgBDm6R43bNIZlNwmQPdM+QqbBPoSP1dWBjEefbB HQ1m5X0eSu4nd/PX1jUiEzQ4KMumP/2sMDR4m1Ly1yJfReg+2ARBbZ2RQbpMnC2GIZUF +nsBqTq8NXboo9LxBPgTpTiHTeuRU7w6nSt6IiiaN8diu1ywpURq07b6fcbdKPg6+xJf Qp/Q==
X-Gm-Message-State: ABuFfoiXCF+rLLPYtmpyI+yNPiCeP+v+EHJYiXQZgYxMjLy4+xVz2Xld IPh54lMuglOOMCJ2m6Xj9qczmayL
X-Google-Smtp-Source: ACcGV60OWuZEh+J0hC+UVCBXumuaMgLrozK4JRRn5NNUZC6yUFXxIF/HjpXXFeKCoo/pHWM2l3smpA==
X-Received: by 2002:a17:902:8347:: with SMTP id z7-v6mr4559620pln.111.1539324290378; Thu, 11 Oct 2018 23:04:50 -0700 (PDT)
Received: from TVPC (c-24-6-174-39.hsd1.ca.comcast.net. [24.6.174.39]) by smtp.gmail.com with ESMTPSA id k72-v6sm760908pfj.63.2018.10.11.23.04.49 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 11 Oct 2018 23:04:49 -0700 (PDT)
Sender: Larry Masinter <masinter@gmail.com>
From: Larry Masinter <LMM@acm.org>
X-Google-Original-From: "Larry Masinter" <lmm@acm.org>
To: "'John C Klensin'" <john-ietf@jck.com>, <i18nrp@ietf.org>
References: <145D45F77511A9B1281FE35D@PSB>
In-Reply-To: <145D45F77511A9B1281FE35D@PSB>
Date: Thu, 11 Oct 2018 23:04:50 -0700
Message-ID: <033401d461f1$7d181590$774840b0$@acm.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Mailer: Microsoft Outlook 16.0
Thread-Index: AQHDUmDVjpTpmH/VJAtCEe7qJubDJQIFeGBy
Content-Language: en-us
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18nrp/UIWgX0wj37SDzPC0UWVTKEbkmVE>
Subject: Re: [I18nrp] Confusion among characters and strings
X-BeenThere: i18nrp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Review Procedures <i18nrp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18nrp/>
List-Post: <mailto:i18nrp@ietf.org>
List-Help: <mailto:i18nrp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18nrp>, <mailto:i18nrp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Oct 2018 06:04:53 -0000

This 

An experiment:
Given a string, convert the string to an image, OCR the image, and see
you get back the same string, code-point by code-point.
Vary the font to cover repertoire on common platforms (Android, iOS, windows mac).

Note this secion contains lots of puns

G00GLE.com OCRs to GOOGLE.com consistently.
Larrч turns into Larry and Зcom into 3com.
Toys-Я-Us.com turns into Toys-A-Us.com, even when language is Russian.

I was using open office to turn text into image
soffice --convert-to jpg test.txt
and https://ocr.space/compare-ocr-software for ocr.