Re: [secdir] [Last-Call] [Jsonpath] Secdir last call review of draft-ietf-jsonpath-iregexp-06

Hi, pardon the top quote,

There's one more thing I think you should add to the security
considerations: handling of invalid Unicode in the input. Here is an
example:
https://github.com/nst/JSONTestSuite/blob/master/test_parsing/i_string_1st_valid_surrogate_2nd_invalid.json

That string is allowed in ECMAScript and Python, but not Ruby or Rust. I
have actually hit this problem in real life, and we had to write some
little extensions that used Björn Höhrmann's UTF-8 decoder*. Basically, you
have to validate the input to ensure interoperability. I know RFC8259
covers this (although it contains some warnings ECMA-404 does not), and so
does JSONPath, but neither are referenced here.

I could see covering this issue in the Security Considerations, or
somewhere else, as the editors please. There's a big table covering this
ugliness at the bottom of this page:
https://github.com/nst/JSONTestSuite/

thanks,
Rob

* https://bjoern.hoehrmann.de/utf-8/decoder/dfa/

On Tue, May 16, 2023 at 1:20 AM Martin J. Dürst <duerst@it.aoyama.ac.jp>
wrote:

> Hello Tim, Mike, others,
>
> Many thanks to Mike for bringing up this issue. I have the following
> comments:
>
> 1) The syntax (and what's more important, but completely missing in the
> text, the semantics) is equivalent to theoretical regular expressions,
> i.e. regular expressions that define regular languages (Chomsky Type 3).
> They can therefore be implemented to run in time linear to the length of
> the string input, and if so implemented, they don't offer any
> possibilities for ReDoS (but see (4) below).
>
> 2) I-Regexps will be executed on all kinds of different regexp
> implementations, as explicitly envisioned by the draft. Some of these
> implementations use backtracking because they need it for more complex
> features, and may (or let's say will) exhibit quadratic, cubic, or
> exponential behavior on some I-Regexp inputs. The examples of nested
> quantifiers given by Mike are a typical case.
>
> 3) Even for the same programming language/regular expression engine,
> different versions may perform differently. As an example, Ruby just
> recently implemented some changes to its regexp engine, implementing a
> cache that makes a lot of cases linear in time and uses space
> proportional to the product of the length of the regexp and the input.
> (see https://bugs.ruby-lang.org/issues/19104). In addition, we
> introduced a predicate to check whether a regexp can be run in linear
> time (https://bugs.ruby-lang.org/issues/19194) and a feature to set a
> timeout (https://bugs.ruby-lang.org/issues/17837). So a regexp that
> might blow up on Ruby versions of 3.1 and lower may behave nicely on
> Ruby 3.2 and higher.
>
> 4) Repetitions with lower and upper bounds are a bit of a special case.
> In theory, α{m,n} (where α is any suitable subregexp) can always be
> expanded to m times α followed by (n-m) times α?. As an example, a{3,7}
> would expand to aaaa?a?a?a?, which can be implemented efficiently (e.g.
> using conversion from a nondeterministic finite automaton (NFA) to a
> deterministic finite automaton (DFA)). However, it's easy to spoof this
> with e.g. a regular expression of the form a{10,1000000}. And if
> 1,000,000 isn't enough, it's easy for the attacker to add a few zeros,
> they may get a 10-fold memory increase for each zero they add.
> It is possible to avoid this memory increase by using some kind of lazy
> construction, which would only be triggered on lengthy inputs, and be
> linear in the length of the input. Nested quantifier ranges will need
> special care.
> [It could be that "Programming Techniques: Regular expression search
> algorithm" (Ken Thomson, https://dl.acm.org/doi/10.1145/363347.363387)
> includes such a technique; unfortunately, I have not yet been able to
> understand that paper because I don't understand the assembly code it
> uses :-(.]
> Ruby produces a "too big number for repeat range" error for ranges
> greater than 100000. I have no idea how the number 100000 was found.
> Ruby 3.2 also returns false for
> Regexp.linear_time?(/(a?){20,200}{20,200}/) (and runs a veeery long time
> on some specific input), which I think is a limitation in the
> implementation, not in principle.) Many engines simply treat stacked
> range quantifiers as errors (see https://regex101.com).
>
> On 2023-05-16 00:52, Tim Bray wrote:
> > I was reading your note and agreeing that yes, it remains possible to
> > devise regexps that are going to cause combinatorial nasties in almost
> any
> > conceivable implementation, but unconvinced about the conclusion that it
> is
> > "still not advisable to run arbitrary user-provided regular expressions
> on
> > your hardware", because it seems to me that the only way to find out if
> the
> > regexp is evil is to run it.
>
> Not exactly. For I-Regexps, it's only the implementation that is
> relevant; all I-Regexps *can* be implemented safely.
>
> Even for much wider classes of regexps, most of them can be very quickly
> classified into safe or potentially dangerous, with rather simple
> heuristics.
>
> So in conclusion, please make sure that the security considerations
> mention all of the following:
>
> - I-Regexps may be implemented to run in linear time in the input,
>    but this requires really great care.
> - Existing regexp engines may be used to run I-Regexps, but may run in
>    quadratic, cubic, or exponential time of the input length for some
>    I-Regexps.
> - Existing regexp engines should be able to easily handle most
>    I-Regexps (after the adjustments discussed in Section 5), but
>    may reject some types of syntactically legal I-Regexps because they
>    cannot guarantee execution in reasonable time (a typical example
>    might be a{2,4}{2,4})
> - Range quantifiers (example: {2,4}) provide specific challenges
>    for implementation and may therefore be limited in composability
>    (see previous example) or range (e.g. disallowing very large ranges
>    such as {20,200000}). Such large ranges may also allow specific
>    attacks on some implementations.
> - Different versions of the same regexp library may be more or less
>    vulnerable to some attacks. In particular, newer libraries may
>    behave better or may offer additional features to check regular
>    expressions or guard against high resource consumption.
>
> Regards,   Martin.
>
> P.S.: The syntax currently has
>  >>>>
> quantifier = ( %x2A-2B ; '*'-'+'
>   / "?" ) / ( "{" quantity "}" )
>  >>>>
> The doubly nested alternatives and the extremely short range provide
> more confusion than clarity to the reader. Also, there's absolutely no
> need to use hex for * and + (see the char-val production in RFC 5234).
> So please rewrite this as:
>  >>>>
> quantifier = "*" / "+" / "?" / ( "{" quantity "}" )
>  >>>>
>
> Because we want to refer to it in the security discussion, it may also
> be a good idea to introduce a rule name for "{" quantity "}":
>  >>>>
> quantifier = "*" / "+" / "?" / range-quantifier
> range-quantifier = "{" quantity "}"
>  >>>>
>
> P.P.S.: For additional reading, I suggest starting at
>
> https://levelup.gitconnected.com/the-regular-expression-denial-of-service-redos-cheat-sheet-a78d0ed7d865.
>
> All of James Davis' work on ReDoS is well worth reading.
>
> > But I think your closing paragraph provides a solution.
> >
> > On Mon, May 15, 2023 at 8:17 AM Mike Ounsworth via Datatracker <
> > noreply@ietf.org> wrote:
> > …
> >
> >>   I wonder if this
> >> document could recommend that implementations include some sort of
> >> configurable
> >> limit on nesting level or on recursion / backtracking depth.
> >
> >
> > That sounds like a good direction, but pretty complex. A simpler option
> > would be that implementations impose a limit on time and/or memory costs
> > and error out when those are breached. Do you think that a recommendation
> > along those lines would address your concerns?
> >
> >
>
> --
> last-call mailing list
> last-call@ietf.org
> https://www.ietf.org/mailman/listinfo/last-call
>