Re: [precis] Precis Java Implementation

Sam Whited <sam@samwhited.com> Wed, 23 December 2015 02:05 UTC

Return-Path: <sam@samwhited.com>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A51D11A6F87 for <precis@ietfa.amsl.com>; Tue, 22 Dec 2015 18:05:21 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 1.921
X-Spam-Level: *
X-Spam-Status: No, score=1.921 tagged_above=-999 required=5 tests=[BAYES_50=0.8, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FM_FORGED_GMAIL=0.622, J_CHICKENPOX_62=0.6, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 13LCYH8s-0Xe for <precis@ietfa.amsl.com>; Tue, 22 Dec 2015 18:05:20 -0800 (PST)
Received: from mail-qk0-x22b.google.com (mail-qk0-x22b.google.com [IPv6:2607:f8b0:400d:c09::22b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2FB5C1A1BDC for <precis@ietf.org>; Tue, 22 Dec 2015 18:05:20 -0800 (PST)
Received: by mail-qk0-x22b.google.com with SMTP id t125so155905400qkh.3 for <precis@ietf.org>; Tue, 22 Dec 2015 18:05:20 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=samwhited.com; s=swgoo; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; bh=BQeZxZRs4uYJJz5cdJAzGr2zP2zs1tcFPC24ollFJvo=; b=JWwYgD3NlqYKaH4gt29opIJqznujiY8NUqZmuPhhxnkWuuU3pHO6i+kWcxid+4/8jl m0H0ED1F2eL/CVmTFmkKPhZMbCEVDmbQZfE+OTUTFevTjGFL21PBxOoEZJ7anXD2IjG+ EfKo/14WDngmKvJsFZGt5FQ2/FkguwhqyESH4=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type:content-transfer-encoding; bh=BQeZxZRs4uYJJz5cdJAzGr2zP2zs1tcFPC24ollFJvo=; b=cSuqrWyNviwS/7hPyPBN8gdSsOGzCmjkXFPemhqIyVhsmB6xgZp/qePvQ+FT4TOKiM WlZy3QGBPVrYc3LF/QYJM5b/nM9/NdlHZ+dPITqpTseyPAkdstwOc3e2bjg5Ev5FBBTG fXY5RqFiMincU8doexGDIu4ehLET9krYLoXzlAU0sdPEyqlD9DX+G9r5x0B/13qt86vb MyO8fPGccrAmaKzNpcTSzjaS4g+1fsJiOsS/9BUC1yfA1NUf/4/WOFOf47j8V4l8d7yM NiXVIbfMEMXn8b0KuW8Kj5R6IU6HZdUdkAwvPbHLuqf2M2i4WTK6slFR8H/w6s/It72w /EXA==
X-Gm-Message-State: ALoCoQm+T0iQQTtVQHWaNvLfWb5dQNmPsCaSBGfxDsCIINjs9jczJTpz6vxAD/2SowvOhXIXrYe/iI8ilTUqVIdOmFOrQNNKMw==
X-Received: by 10.55.31.95 with SMTP id f92mr1416169qkf.50.1450836319289; Tue, 22 Dec 2015 18:05:19 -0800 (PST)
MIME-Version: 1.0
Received: by 10.55.10.133 with HTTP; Tue, 22 Dec 2015 18:04:39 -0800 (PST)
X-Originating-IP: [172.56.6.89]
In-Reply-To: <A295CC65-6A47-45DB-8539-A25E9A8361B7@gmx.de>
References: <3012FFC8-1FAD-4C30-8D85-175F4180BC02@gmx.de> <CAHbk4RLPP4LeLgM=aigNLXnbUJ0gQ+Lcsji66huFXpu5qu2fQQ@mail.gmail.com> <A295CC65-6A47-45DB-8539-A25E9A8361B7@gmx.de>
From: Sam Whited <sam@samwhited.com>
Date: Tue, 22 Dec 2015 20:04:39 -0600
Message-ID: <CAHbk4R++se1zCX2tP3rdp_gQZ7eBMmc-7s+m9u3r-+U0ziCvMA@mail.gmail.com>
To: Christian Schudt <christian.schudt@gmx.de>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/x5njCOTggv7bhxLGuy33eO3XMI4>
Cc: precis@ietf.org
Subject: Re: [precis] Precis Java Implementation
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 23 Dec 2015 02:05:21 -0000

On Mon, Dec 21, 2015 at 4:08 PM, Christian Schudt
<christian.schudt@gmx.de> wrote:
> If you mean having a huge code point table, like in your tables.go file: I think Java already has such tables internally.
> What could be improved here, is that Character.getType(cp) could only be invoked once. I haven’t done any benchmark for this, but I don’t expect a significant performance benefit.

Out of curiosity, I answered my own question here. I'm using Go, which
also has lots of Unicode tables in the standard library, so I
benchmarked running the algorithm (I modified it slightly from the
version in my generator to remove the NFKC step, which is very slow,
this way it more closely resembles your algorithm), and looking up a
value in the large pre-generated trie. I have no idea where the
bottlenecks / optimizations in Java would be, so these results may be
meaningless to you, but, at least in Go, the single Trie lookup was
much faster:

$ go test -bench . -benchmem
PASS
BenchmarkAsciiLookup-4          300000000                3.85 ns/op
        0 B/op          0 allocs/op
BenchmarkFullwidthLookup-4      200000000                9.21 ns/op
        0 B/op          0 allocs/op
BenchmarkAsciiCalculate-4       100000000               17.4 ns/op
        0 B/op          0 allocs/op
BenchmarkFullwidthCalculate-4   20000000                71.4 ns/op
        0 B/op          0 allocs/op
ok      _/home/sam/Projects/golang-x-text/unicode/precis        7.632s

Each test here is looking up or calculating the derived properties for
a single character (the ASCII tests are looking up 'u' and the Unicode
tests are looking up 'u' [full width] which was chosen very
scientifically, I assure you), the second column is the number of
tests that were run until the timings reached equilibrium.

For the worst case, there's a pretty good speed difference, whether
that difference is worth pre-generating the data is another matter, of
course ☺

Best,
Sam


-- 
Sam Whited
pub 4096R/54083AE104EA7AD3
https://blog.samwhited.com