Re: [secdir] [Last-Call] [Jsonpath] Secdir last call review of draft-ietf-jsonpath-iregexp-06

Rob Sayre <sayrer@gmail.com> Wed, 17 May 2023 20:48 UTC

Return-Path: <sayrer@gmail.com>
X-Original-To: secdir@ietfa.amsl.com
Delivered-To: secdir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2510FC14F748; Wed, 17 May 2023 13:48:31 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.094
X-Spam-Level:
X-Spam-Status: No, score=-2.094 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id uUX6JEBMP9eX; Wed, 17 May 2023 13:48:26 -0700 (PDT)
Received: from mail-ej1-x634.google.com (mail-ej1-x634.google.com [IPv6:2a00:1450:4864:20::634]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C4CCCC14CE2C; Wed, 17 May 2023 13:48:26 -0700 (PDT)
Received: by mail-ej1-x634.google.com with SMTP id a640c23a62f3a-965b8a969b3so93125166b.1; Wed, 17 May 2023 13:48:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1684356505; x=1686948505; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=KJIMWtEwy6/fy9QfLJtqBlfWx8Xk1y+l0neNOX8Bsas=; b=gY/pNvjA3t5HHEAX3a7maNgQfIp0EDQDZIcOKmHO5V0AVecCuKEPWh5OPbhHImE/1g zxAB65ZC8mTQSKiZf6eAlyqfREcb5V17sl/l3rQ0yToqWZQmTSqMFoi+UIorkNKl7wO+ Odw/GltoE740dTPkdh+YDG6U1E1CiO3xVOYwfv42lfBmP4tz3vFPQnlxtzCVgZ9yePYs abDGvDW9AW0Wuvaeks/5/BMtcCAq7wpk8GLVlGyzn6AclES6/kjWCRFDg+uKiqvuE+yo SAShIG9rJPWjtJZoZ4sHA9ta51XXXC+Pbs+6H83vPwshGJedFcJjPT/6tc6m35O0wooR WYGg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684356505; x=1686948505; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=KJIMWtEwy6/fy9QfLJtqBlfWx8Xk1y+l0neNOX8Bsas=; b=bXwFQvg62oZgyXrsQFg0zMqG1tFrNlgvl7N8KF0bwyXiDKrDJAHs4XpbU208W3yuDq NqR6kti5PqAJ7ydRhVcoI0FQqdLAJCuHL1/fpqyhq0OSQCLN4GkNrogYshmg3VzpaEw3 LqKSkGOtlC8rbjYH1tQHtseRlETIGgUSyOgaHKOpM4VK+64PmkthZGmbqXsudZigeWJp Aqp8YX+zdGmflNX1MCpf834f8jvDqfolOvCj7ssDLL1SlFXBYL07RXtVz4ghH8BUqOfz EgS+ahVfTjksqbJB7SPtVJB9H43dKOXe1f/GBYdCYP6pDB+OPbtShwV0kh4Lvy4lhryh bILg==
X-Gm-Message-State: AC+VfDx3E/D7/sAu2TIMEInx5zGNsA30GCrE47Qti4sexqvj83FDeSE0 1eEhpF5eWV4ijqqbTypywJO2uQsz9glyPQhEpEk=
X-Google-Smtp-Source: ACHHUZ5HHakn7wwpUMuKEQ2l713SGOkBJLtLQTnxyqJoDBUzEta08xs9WQEqhjv37fSvV07DfLVB/Rq9PZZoWzFq/Ls=
X-Received: by 2002:a17:907:1c9b:b0:966:5fac:2e52 with SMTP id nb27-20020a1709071c9b00b009665fac2e52mr3395265ejc.9.1684356504471; Wed, 17 May 2023 13:48:24 -0700 (PDT)
MIME-Version: 1.0
References: <168416383998.50512.953102690552943438@ietfa.amsl.com> <CAHBU6iuKKp3g_HbhgaZT8CcStQBKoaHOcdf9ogku=bftYt5wgA@mail.gmail.com> <f69ab710-34eb-1697-be29-cc0298c88da8@it.aoyama.ac.jp>
In-Reply-To: <f69ab710-34eb-1697-be29-cc0298c88da8@it.aoyama.ac.jp>
From: Rob Sayre <sayrer@gmail.com>
Date: Wed, 17 May 2023 13:48:13 -0700
Message-ID: <CAChr6SwqnKoG+MkBkCupKme1Y8aMxosBwHYkVWYy=yfAiwMCZw@mail.gmail.com>
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
Cc: Tim Bray <tbray@textuality.com>, Mike Ounsworth <mike.ounsworth@entrust.com>, secdir@ietf.org, draft-ietf-jsonpath-iregexp.all@ietf.org, jsonpath@ietf.org, last-call@ietf.org
Content-Type: multipart/alternative; boundary="0000000000001fcca705fbe9cf26"
Archived-At: <https://mailarchive.ietf.org/arch/msg/secdir/ApXmxVOzCPfAElPLkbF5YC3dzhA>
Subject: Re: [secdir] [Last-Call] [Jsonpath] Secdir last call review of draft-ietf-jsonpath-iregexp-06
X-BeenThere: secdir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Security Area Directorate <secdir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/secdir>, <mailto:secdir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/secdir/>
List-Post: <mailto:secdir@ietf.org>
List-Help: <mailto:secdir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/secdir>, <mailto:secdir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 17 May 2023 20:48:31 -0000

Hi, pardon the top quote,

There's one more thing I think you should add to the security
considerations: handling of invalid Unicode in the input. Here is an
example:
https://github.com/nst/JSONTestSuite/blob/master/test_parsing/i_string_1st_valid_surrogate_2nd_invalid.json

That string is allowed in ECMAScript and Python, but not Ruby or Rust. I
have actually hit this problem in real life, and we had to write some
little extensions that used Björn Höhrmann's UTF-8 decoder*. Basically, you
have to validate the input to ensure interoperability. I know RFC8259
covers this (although it contains some warnings ECMA-404 does not), and so
does JSONPath, but neither are referenced here.

I could see covering this issue in the Security Considerations, or
somewhere else, as the editors please. There's a big table covering this
ugliness at the bottom of this page:
https://github.com/nst/JSONTestSuite/

thanks,
Rob

* https://bjoern.hoehrmann.de/utf-8/decoder/dfa/

On Tue, May 16, 2023 at 1:20 AM Martin J. Dürst <duerst@it.aoyama.ac.jp>
wrote:

> Hello Tim, Mike, others,
>
> Many thanks to Mike for bringing up this issue. I have the following
> comments:
>
> 1) The syntax (and what's more important, but completely missing in the
> text, the semantics) is equivalent to theoretical regular expressions,
> i.e. regular expressions that define regular languages (Chomsky Type 3).
> They can therefore be implemented to run in time linear to the length of
> the string input, and if so implemented, they don't offer any
> possibilities for ReDoS (but see (4) below).
>
> 2) I-Regexps will be executed on all kinds of different regexp
> implementations, as explicitly envisioned by the draft. Some of these
> implementations use backtracking because they need it for more complex
> features, and may (or let's say will) exhibit quadratic, cubic, or
> exponential behavior on some I-Regexp inputs. The examples of nested
> quantifiers given by Mike are a typical case.
>
> 3) Even for the same programming language/regular expression engine,
> different versions may perform differently. As an example, Ruby just
> recently implemented some changes to its regexp engine, implementing a
> cache that makes a lot of cases linear in time and uses space
> proportional to the product of the length of the regexp and the input.
> (see https://bugs.ruby-lang.org/issues/19104). In addition, we
> introduced a predicate to check whether a regexp can be run in linear
> time (https://bugs.ruby-lang.org/issues/19194) and a feature to set a
> timeout (https://bugs.ruby-lang.org/issues/17837). So a regexp that
> might blow up on Ruby versions of 3.1 and lower may behave nicely on
> Ruby 3.2 and higher.
>
> 4) Repetitions with lower and upper bounds are a bit of a special case.
> In theory, α{m,n} (where α is any suitable subregexp) can always be
> expanded to m times α followed by (n-m) times α?. As an example, a{3,7}
> would expand to aaaa?a?a?a?, which can be implemented efficiently (e.g.
> using conversion from a nondeterministic finite automaton (NFA) to a
> deterministic finite automaton (DFA)). However, it's easy to spoof this
> with e.g. a regular expression of the form a{10,1000000}. And if
> 1,000,000 isn't enough, it's easy for the attacker to add a few zeros,
> they may get a 10-fold memory increase for each zero they add.
> It is possible to avoid this memory increase by using some kind of lazy
> construction, which would only be triggered on lengthy inputs, and be
> linear in the length of the input. Nested quantifier ranges will need
> special care.
> [It could be that "Programming Techniques: Regular expression search
> algorithm" (Ken Thomson, https://dl.acm.org/doi/10.1145/363347.363387)
> includes such a technique; unfortunately, I have not yet been able to
> understand that paper because I don't understand the assembly code it
> uses :-(.]
> Ruby produces a "too big number for repeat range" error for ranges
> greater than 100000. I have no idea how the number 100000 was found.
> Ruby 3.2 also returns false for
> Regexp.linear_time?(/(a?){20,200}{20,200}/) (and runs a veeery long time
> on some specific input), which I think is a limitation in the
> implementation, not in principle.) Many engines simply treat stacked
> range quantifiers as errors (see https://regex101.com).
>
> On 2023-05-16 00:52, Tim Bray wrote:
> > I was reading your note and agreeing that yes, it remains possible to
> > devise regexps that are going to cause combinatorial nasties in almost
> any
> > conceivable implementation, but unconvinced about the conclusion that it
> is
> > "still not advisable to run arbitrary user-provided regular expressions
> on
> > your hardware", because it seems to me that the only way to find out if
> the
> > regexp is evil is to run it.
>
> Not exactly. For I-Regexps, it's only the implementation that is
> relevant; all I-Regexps *can* be implemented safely.
>
> Even for much wider classes of regexps, most of them can be very quickly
> classified into safe or potentially dangerous, with rather simple
> heuristics.
>
> So in conclusion, please make sure that the security considerations
> mention all of the following:
>
> - I-Regexps may be implemented to run in linear time in the input,
>    but this requires really great care.
> - Existing regexp engines may be used to run I-Regexps, but may run in
>    quadratic, cubic, or exponential time of the input length for some
>    I-Regexps.
> - Existing regexp engines should be able to easily handle most
>    I-Regexps (after the adjustments discussed in Section 5), but
>    may reject some types of syntactically legal I-Regexps because they
>    cannot guarantee execution in reasonable time (a typical example
>    might be a{2,4}{2,4})
> - Range quantifiers (example: {2,4}) provide specific challenges
>    for implementation and may therefore be limited in composability
>    (see previous example) or range (e.g. disallowing very large ranges
>    such as {20,200000}). Such large ranges may also allow specific
>    attacks on some implementations.
> - Different versions of the same regexp library may be more or less
>    vulnerable to some attacks. In particular, newer libraries may
>    behave better or may offer additional features to check regular
>    expressions or guard against high resource consumption.
>
> Regards,   Martin.
>
> P.S.: The syntax currently has
>  >>>>
> quantifier = ( %x2A-2B ; '*'-'+'
>   / "?" ) / ( "{" quantity "}" )
>  >>>>
> The doubly nested alternatives and the extremely short range provide
> more confusion than clarity to the reader. Also, there's absolutely no
> need to use hex for * and + (see the char-val production in RFC 5234).
> So please rewrite this as:
>  >>>>
> quantifier = "*" / "+" / "?" / ( "{" quantity "}" )
>  >>>>
>
> Because we want to refer to it in the security discussion, it may also
> be a good idea to introduce a rule name for "{" quantity "}":
>  >>>>
> quantifier = "*" / "+" / "?" / range-quantifier
> range-quantifier = "{" quantity "}"
>  >>>>
>
> P.P.S.: For additional reading, I suggest starting at
>
> https://levelup.gitconnected.com/the-regular-expression-denial-of-service-redos-cheat-sheet-a78d0ed7d865.
>
> All of James Davis' work on ReDoS is well worth reading.
>
> > But I think your closing paragraph provides a solution.
> >
> > On Mon, May 15, 2023 at 8:17 AM Mike Ounsworth via Datatracker <
> > noreply@ietf.org> wrote:
> > …
> >
> >>   I wonder if this
> >> document could recommend that implementations include some sort of
> >> configurable
> >> limit on nesting level or on recursion / backtracking depth.
> >
> >
> > That sounds like a good direction, but pretty complex. A simpler option
> > would be that implementations impose a limit on time and/or memory costs
> > and error out when those are breached. Do you think that a recommendation
> > along those lines would address your concerns?
> >
> >
>
> --
> last-call mailing list
> last-call@ietf.org
> https://www.ietf.org/mailman/listinfo/last-call
>