Re: [Cbor] Regular expressions

Joe Hildebrand <hildjj@cursive.net> Sun, 28 February 2021 23:34 UTC

Return-Path: <hildjj@cursive.net>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A43963A0D8E for <cbor@ietfa.amsl.com>; Sun, 28 Feb 2021 15:34:24 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.203
X-Spam-Level:
X-Spam-Status: No, score=0.203 tagged_above=-999 required=5 tests=[DKIM_INVALID=0.1, DKIM_SIGNED=0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=neutral reason="invalid (public key: not available)" header.d=cursive.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6Dc1BamG456h for <cbor@ietfa.amsl.com>; Sun, 28 Feb 2021 15:34:23 -0800 (PST)
Received: from mail-ot1-x32e.google.com (mail-ot1-x32e.google.com [IPv6:2607:f8b0:4864:20::32e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 3946A3A0D8C for <cbor@ietf.org>; Sun, 28 Feb 2021 15:34:23 -0800 (PST)
Received: by mail-ot1-x32e.google.com with SMTP id g8so11096254otk.4 for <cbor@ietf.org>; Sun, 28 Feb 2021 15:34:23 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cursive.net; s=google; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=eybY2h346KLTdYtupDeC+LN45P774CQxTrAD0GKtIBY=; b=GlOYXfFY5rianMJvc3PpCTN6uUxiYW+hYiGKhGQ2c9oy+eyNjGvUL+EVQgGqx7+TkE UCN0t0qTpDwNoEKtEmTXvPHTfejEO6sWFTeo26QZBoL9g89D2TAnjNppKd04A7RkzMhq gmS0WP5WQDFOi0C1bn2b+9Zp+FjDH7XPJ8aUc=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=eybY2h346KLTdYtupDeC+LN45P774CQxTrAD0GKtIBY=; b=pQIvMZBSKToiufvCNvsAb1OaQOQ0iYkji/0eEVpVBf+TD4wbFDC0Z9fXP0dhkcr6YM jUAkzZ3E5ldzwaZHbr+I2kVMRNKaP6B5vXO8IkFsRWZzMKismgTYLE8tZA7+XUFrX2QY fggSTRUxTZsWG4XCpfg2sNMnGQOGV7713W8TevQJsHuE5VMk4G94NxK2KmI7LxQAFfGU TzVwhWOr9VKGLiJmu6p6k3z+j0doAE/RLSj9Lz0/vCBNjAuY79bSgPzYhYkyRgnpORVE uIhqUtE9e2fIg1jZ26ocChQ/BB4DwVnaP9hvoYjK4incSx6WlSg/bqjpuE9JBPuvkeLo 88Ig==
X-Gm-Message-State: AOAM532KusiaMPPaQVz3EVdTA1FeDZ36KaHhQWhiT924dg/V91RH4DtD 3N7JoPEfr6X19Kvdp+CXIVCFiw==
X-Google-Smtp-Source: ABdhPJxyLk4K+IunNtF8ZRVeISh4fPsW5/GF0foL3UpkFA1f2fhZ8FpUpj28/kTz6wsFYQoTruqX1g==
X-Received: by 2002:a9d:19c9:: with SMTP id k67mr6730713otk.76.1614555261308; Sun, 28 Feb 2021 15:34:21 -0800 (PST)
Received: from ?IPv6:2601:282:200:3758:878:7598:b37e:7e3f? ([2601:282:200:3758:878:7598:b37e:7e3f]) by smtp.gmail.com with ESMTPSA id b17sm3081382ook.21.2021.02.28.15.34.20 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 28 Feb 2021 15:34:21 -0800 (PST)
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.60.0.2.21\))
From: Joe Hildebrand <hildjj@cursive.net>
In-Reply-To: <5DC26811-7361-4885-8DE0-77ADD402CA78@tzi.org>
Date: Sun, 28 Feb 2021 16:34:19 -0700
Cc: cbor@ietf.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <145D36C5-5BF3-42DE-B972-7A784875E00E@cursive.net>
References: <4665BD99-C64E-41B4-9FD0-547175B33D9A@cursive.net> <B79CC250-9E89-41B4-8136-B9AC96422962@tzi.org> <F4BCBE46-F8E4-47E9-82A2-3FB67F607993@cursive.net> <5DC26811-7361-4885-8DE0-77ADD402CA78@tzi.org>
To: Carsten Bormann <cabo@tzi.org>
X-Mailer: Apple Mail (2.3654.60.0.2.21)
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/IeaeKrGNcpgwtrAnumkwAcVG5GE>
Subject: Re: [Cbor] Regular expressions
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 28 Feb 2021 23:34:25 -0000

> On Feb 28, 2021, at 4:24 PM, Carsten Bormann <cabo@tzi.org> wrote:
> 
>> In ECMAscript land 'gimsuy' are all valid.
> 
> But g, for instance, is not an RE modifier.  It modifies how the operations are using it, but not the RE itself.  Similar for y.  u is really enabling syntax for decoding unicode, so it should not be visible during interchange (of the decoded RE).

Hm.  I understand your point, but round-tripping the objects intact is more important to me than which part is technically part of the regex.

>> Nod.  We'd probably need a small registry then, with the names or a code.  I would expect the semantics are "use this if it's a type you know about, otherwise, keep the tagged version and punt to the application layer".
> 
> Right.  But why not use tags for those?  We already have that registry.
> 
> (We could write a common document that we expect new RE tag registrations to reference, so there is some structure to this.)

That works for me.  So, there would be an "ECMAscript RegExp" tag then, right?

>> No argument, but I use regex's every day, and ABNF or a full PEG grammar just when I need to get out the big hammer.
> 
> REs are certainly more amenable to interchange as such.
> 
> I need to fix up my ABNF to RE compiler…
> (Which is almost trivial – as long as the ABNF is not recursive – but the current version generates way too much noise that a manual RE writer would know how to avoid.)
> 
> I’m also on the lookout for a toolkit for translating between the various RE dialects.

I bet the regex101 folks have a starting point for that.  Does anyone have a contact there?

BTW, one of the reasons all of this has come up for me is I've been tinkering with Basura (https://github.com/hildjj/basura/), a library/CLI for generating random trash JS that is syntactically valid, but otherwise useless.  I just released it this morning, and I've learned a few interesting things about my CBOR implementation by tying the two together.

— 
Joe Hildebrand