Re: [I18nrp] [Ext] Re: Confusion among characters and strings
Sarmad Hussain <sarmad.hussain@icann.org> Fri, 12 October 2018 19:16 UTC
Return-Path: <sarmad.hussain@icann.org>
X-Original-To: i18nrp@ietfa.amsl.com
Delivered-To: i18nrp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1])
by ietfa.amsl.com (Postfix) with ESMTP id 04DD1130E76
for <i18nrp@ietfa.amsl.com>; Fri, 12 Oct 2018 12:16:52 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Level:
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5
tests=[BAYES_00=-1.9, SPF_PASS=-0.001]
autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44])
by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id xS01lQX1X5uZ for <i18nrp@ietfa.amsl.com>;
Fri, 12 Oct 2018 12:16:49 -0700 (PDT)
Received: from out.west.pexch112.icann.org (out.west.pexch112.icann.org
[64.78.40.7])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits))
(No client certificate requested)
by ietfa.amsl.com (Postfix) with ESMTPS id 3CAF71200D7
for <i18nrp@ietf.org>; Fri, 12 Oct 2018 12:16:49 -0700 (PDT)
Received: from PMBX112-W1-CA-1.pexch112.icann.org (64.78.40.21) by
PMBX112-W1-CA-2.pexch112.icann.org (64.78.40.23) with Microsoft SMTP Server
(TLS) id 15.0.1367.3; Fri, 12 Oct 2018 12:16:46 -0700
Received: from PMBX112-W1-CA-1.pexch112.icann.org ([64.78.40.21]) by
PMBX112-W1-CA-1.PEXCH112.ICANN.ORG ([64.78.40.21]) with mapi id
15.00.1367.000; Fri, 12 Oct 2018 12:16:46 -0700
From: Sarmad Hussain <sarmad.hussain@icann.org>
To: Larry Masinter <LMM@acm.org>, 'John C Klensin' <john-ietf@jck.com>,
"i18nrp@ietf.org" <i18nrp@ietf.org>
Thread-Topic: [Ext] Re: [I18nrp] Confusion among characters and strings
Thread-Index: AQHUYfGax+VlHswO7kWe5/gWcXSSzaUb6Ddw
Date: Fri, 12 Oct 2018 19:16:45 +0000
Message-ID: <d5b04f10fd304712adc85b1940bef3c6@PMBX112-W1-CA-1.PEXCH112.ICANN.ORG>
References: <145D45F77511A9B1281FE35D@PSB>
<033401d461f1$7d181590$774840b0$@acm.org>
In-Reply-To: <033401d461f1$7d181590$774840b0$@acm.org>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: yes
X-MS-TNEF-Correlator:
x-ms-exchange-transport-fromentityheader: Hosted
x-originating-ip: [192.0.47.234]
Content-Type: multipart/signed; protocol="application/x-pkcs7-signature";
micalg=SHA1; boundary="----=_NextPart_000_0059_01D4628A.051D6B50"
MIME-Version: 1.0
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18nrp/kd2w3o6mSY5zIAHT0-8N8cYdfJs>
Subject: Re: [I18nrp] [Ext] Re: Confusion among characters and strings
X-BeenThere: i18nrp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Review Procedures <i18nrp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18nrp>,
<mailto:i18nrp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18nrp/>
List-Post: <mailto:i18nrp@ietf.org>
List-Help: <mailto:i18nrp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18nrp>,
<mailto:i18nrp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Oct 2018 19:16:52 -0000
Dear all, >> G00GLE.com OCRs to GOOGLE.com consistently Though I am not familiar with the algorithms behind these specific OCRs, generally OCRs perform a language modelling task after the image classification/recognition task to formulate the final output (this can be seen as a corrective measure). It could mean that "G00GLE" may actually be recognized as either "G00GLE" or "GOOGLE", and former even with possibly higher likelihood values. However, the subsequent language model will score "GOOGLE" as much more likely than "G00GLE", ensuring that the cumulative probability scores are higher for "GOOGLE", giving it as the eventual output. Though effective for OCRs, such language based corrective analysis makes OCRs a less effective tool for gauging string confusion. If one turns off the language model in an OCR, then the results would be "closer" to a string confusion task - though there are other problems with a simple image classification task. In some recent work on Arabic script, native readers were found to weigh certain strokes (those which form a character) more than others (those which join the characters) in character recognition tasks - a purely perceptual preference for this cursive script, something a pure image classification/recognition task does not consider (as it would give each stroke an equal weight). Regards, Sarmad -----Original Message----- From: i18nRP <i18nrp-bounces@ietf.org> On Behalf Of Larry Masinter Sent: Friday, October 12, 2018 11:05 AM To: 'John C Klensin' <john-ietf@jck.com>om>; i18nrp@ietf.org Subject: [Ext] Re: [I18nrp] Confusion among characters and strings This An experiment: Given a string, convert the string to an image, OCR the image, and see you get back the same string, code-point by code-point. Vary the font to cover repertoire on common platforms (Android, iOS, windows mac). Note this secion contains lots of puns G00GLE.com OCRs to GOOGLE.com consistently. Larrч turns into Larry and Зcom into 3com. Toys-Я-Us.com turns into Toys-A-Us.com, even when language is Russian. I was using open office to turn text into image soffice --convert-to jpg test.txt and https://urldefense.proofpoint.com/v2/url?u=https-3A__ocr.space_compare-2Docr-2Dsoftware&d=DwIGaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=xeCsPHV_i7VGdkJqjbKMJ_i59E5tHFqpjD6AEoYEwzA&s=ab1teI6TRTtvos10YRG08zadQpOKRMNzkHG1vBbIZhw&e= for ocr. _______________________________________________ i18nRP mailing list i18nRP@ietf.org https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ietf.org_mailman_listinfo_i18nrp&d=DwIGaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=xeCsPHV_i7VGdkJqjbKMJ_i59E5tHFqpjD6AEoYEwzA&s=Mw-p2pdByp7TjeXzbfmXpgrdM7soyU2jjc7Ombp5i2U&e=
- Re: [I18nrp] Confusion among characters and strin… John C Klensin
- Re: [I18nrp] Confusion among characters and strin… Nico Williams
- [I18nrp] Confusion among characters and strings John C Klensin
- Re: [I18nrp] Confusion among characters and strin… Larry Masinter
- Re: [I18nrp] Confusion among characters and strin… John C Klensin
- Re: [I18nrp] [Ext] Re: Confusion among characters… Sarmad Hussain
- Re: [I18nrp] Confusion among characters and strin… Asmus Freytag
- Re: [I18nrp] Confusion among characters and strin… John C Klensin
- Re: [I18nrp] Confusion among characters and strin… Asmus Freytag