Re: [precis] Precis Java Implementation

Christian Schudt <christian.schudt@gmx.de> Wed, 23 December 2015 09:16 UTC

Return-Path: <christian.schudt@gmx.de>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id F0CE71ACDFD for <precis@ietfa.amsl.com>; Wed, 23 Dec 2015 01:16:28 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.7
X-Spam-Level:
X-Spam-Status: No, score=0.7 tagged_above=-999 required=5 tests=[BAYES_50=0.8, FREEMAIL_FROM=0.001, J_CHICKENPOX_62=0.6, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RfrAsKEwPK9y for <precis@ietfa.amsl.com>; Wed, 23 Dec 2015 01:16:26 -0800 (PST)
Received: from mout.gmx.net (mout.gmx.net [212.227.17.22]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 56C751ACDFA for <precis@ietf.org>; Wed, 23 Dec 2015 01:16:25 -0800 (PST)
Received: from christihudtsmbp.fritz.box ([95.117.202.179]) by mail.gmx.com (mrgmx102) with ESMTPSA (Nemesis) id 0MbaS9-1ZvNAJ2byv-00J4AG; Wed, 23 Dec 2015 10:16:22 +0100
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 9.1 \(3096.5\))
From: Christian Schudt <christian.schudt@gmx.de>
In-Reply-To: <CAHbk4R++se1zCX2tP3rdp_gQZ7eBMmc-7s+m9u3r-+U0ziCvMA@mail.gmail.com>
Date: Wed, 23 Dec 2015 10:16:20 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <8581502F-058F-4A21-A1A4-454BC0734DE8@gmx.de>
References: <3012FFC8-1FAD-4C30-8D85-175F4180BC02@gmx.de> <CAHbk4RLPP4LeLgM=aigNLXnbUJ0gQ+Lcsji66huFXpu5qu2fQQ@mail.gmail.com> <A295CC65-6A47-45DB-8539-A25E9A8361B7@gmx.de> <CAHbk4R++se1zCX2tP3rdp_gQZ7eBMmc-7s+m9u3r-+U0ziCvMA@mail.gmail.com>
To: Sam Whited <sam@samwhited.com>
X-Mailer: Apple Mail (2.3096.5)
X-Provags-ID: V03:K0:OBnlFokzSFttKFOCx9n39UEEPVsZJqMl3+QrwyuOFP8Jf+d4GMg R1lQJuwhiy6Zal9IEYpZFx3kGzrroIkBtk5v34SPCQnegZ2PIV1BXErUkznaL3Svpx0vxZf MK88dKLGvSrUGUUwZhsT7alKZiNQnmwgujfY0uvhDzxVoaTUfA+fmjadXiHWoxj5B+bSPTh jI4mGK+JFlmMBS6pdSk4Q==
X-UI-Out-Filterresults: notjunk:1;V01:K0:o7b6XVq+1Co=:O1HrhHriQ/gmvBO0hvGRJ2 49OR5EquwFtcVpH2JFNnUuFvFERmBsn1QiLnkZkA58XiLZje+J5tzmpFE0Rniq/kSNre/SKut Kt+3qBazE5dBvrBqT2KaASysLYHGxjlSYGNaVzh8J8KSsnQCm2SMrTntb4kMg85Ir4cXjreQp wrAbaNsZmWPdvx+Nba8Za5cUmeIgYZGMS/atq4VwmldkVQsDwvbUyHZvgC6KRFt27fdjZ71yA NQq3RclaSFbTyRW+OWgkXLOar/oPSrEtBo46QHV4vjSKgJ2eX2gBvGKVMcB6OLa6OTFm2aXTC mnOJ7yOpLKH02z/8l7LiDDzxiyqbSyioeFIGPQiWyLxrcOPK9L0puMr8gZAzd8mEQWH+sTD12 4reS+XIJ4zUtbKun4NqXBVrQHOmrG7aOHNXjq9hwMvX/WhRh1ZVeCXWc3BDCHTxIIWVHKCnR6 Kj7lwpHvXyrqxtcY++AvTgCf48qXwP/BLkFui2Ii9NwnwkzEB90deBS5UedKjBE9DQ2l3UY2R tFXJjW7nvU3/agc7TfvhMS86mq6/DERzrGVFuiOUzhnged9TFzf/Gb1XS/JLb8XN//1FKpsam S6SjJRghTWokCZygwhYt2qnjUHHCyn1NwpCmQIlUyBWnSlxVoIhwjyXKIMxQHrBvE4/JWq2fe 9TR0hqWGFsH65HNTONSf8ZhTckekiWm1wCNUWj/GIJJdDfOEp5H53uKilV4vbBQt61My/letW WKGTqd6BmrQZMWih/kLjVHoTvh9L3nVADyYAflHJDvbNIZ3CvApEZ3N+JYzwDMOLlPAER57Ow lOfPFOx
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/e1nmvO6qrudMkMFiNIZPyYcLUj4>
Cc: precis@ietf.org
Subject: Re: [precis] Precis Java Implementation
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 23 Dec 2015 09:16:29 -0000

If you are interested, in Java getting the character type (property) is even a bit slower, but it’s of course machine dependent:

java.lang.Character.getType(codepoint) is constantly 39 ns/op on 300000000 iterations on my machine, no matter if it’s an ascii or a fullwidth character. (Oracle JDK 8 SE).
Java uses huge Unicode arrays internally to lookup the code point, I doubt that there’s much room for optimization and even if, probably nobody cares when saving a few nano seconds.

— Christian


> Am 23.12.2015 um 03:04 schrieb Sam Whited <sam@samwhited.com>:
> 
> On Mon, Dec 21, 2015 at 4:08 PM, Christian Schudt
> <christian.schudt@gmx.de> wrote:
>> If you mean having a huge code point table, like in your tables.go file: I think Java already has such tables internally.
>> What could be improved here, is that Character.getType(cp) could only be invoked once. I haven’t done any benchmark for this, but I don’t expect a significant performance benefit.
> 
> Out of curiosity, I answered my own question here. I'm using Go, which
> also has lots of Unicode tables in the standard library, so I
> benchmarked running the algorithm (I modified it slightly from the
> version in my generator to remove the NFKC step, which is very slow,
> this way it more closely resembles your algorithm), and looking up a
> value in the large pre-generated trie. I have no idea where the
> bottlenecks / optimizations in Java would be, so these results may be
> meaningless to you, but, at least in Go, the single Trie lookup was
> much faster:
> 
> $ go test -bench . -benchmem
> PASS
> BenchmarkAsciiLookup-4          300000000                3.85 ns/op
>        0 B/op          0 allocs/op
> BenchmarkFullwidthLookup-4      200000000                9.21 ns/op
>        0 B/op          0 allocs/op
> BenchmarkAsciiCalculate-4       100000000               17.4 ns/op
>        0 B/op          0 allocs/op
> BenchmarkFullwidthCalculate-4   20000000                71.4 ns/op
>        0 B/op          0 allocs/op
> ok      _/home/sam/Projects/golang-x-text/unicode/precis        7.632s
> 
> Each test here is looking up or calculating the derived properties for
> a single character (the ASCII tests are looking up 'u' and the Unicode
> tests are looking up 'u' [full width] which was chosen very
> scientifically, I assure you), the second column is the number of
> tests that were run until the timings reached equilibrium.
> 
> For the worst case, there's a pretty good speed difference, whether
> that difference is worth pre-generating the data is another matter, of
> course ☺
> 
> Best,
> Sam
> 
> 
> -- 
> Sam Whited
> pub 4096R/54083AE104EA7AD3
> https://blog.samwhited.com