Re: [secdir] [Last-Call] [Jsonpath] Secdir last call review of draft-ietf-jsonpath-iregexp-06

On Wed, May 17, 2023 at 1:48 PM Rob Sayre <sayrer@gmail.com> wrote:

> There's one more thing I think you should add to the security
> considerations: handling of invalid Unicode in the input. Here is an
> example:
>
> https://github.com/nst/JSONTestSuite/blob/master/test_parsing/i_string_1st_valid_surrogate_2nd_invalid.json
>
> That string is allowed in ECMAScript and Python, but not Ruby or Rust.
>

I went back and read:
https://github.com/ietf-wg-jsonpath/iregexp/issues/22#issuecomment-1504004958

I think the issue is that JSON, .js, or .py source files are almost always
valid UTF-8, so if you just interpret them as a text file, it's fine.

But I-Regexp operates on the text produced after the escape sequence
processing phase, which happens after UTF decoding. Some languages enforce
valid Unicode at this point, but some don't. So it's a bit separate from
the encoding of the source file. I think that's what I-JSON was going
after, too.

thanks,
Rob

> * https://bjoern.hoehrmann.de/utf-8/decoder/dfa/
>
> On Tue, May 16, 2023 at 1:20 AM Martin J. Dürst <duerst@it.aoyama.ac.jp>
> wrote:
>
>> Hello Tim, Mike, others,
>>
>> Many thanks to Mike for bringing up this issue. I have the following
>> comments:
>>
>> 1) The syntax (and what's more important, but completely missing in the
>> text, the semantics) is equivalent to theoretical regular expressions,
>> i.e. regular expressions that define regular languages (Chomsky Type 3).
>> They can therefore be implemented to run in time linear to the length of
>> the string input, and if so implemented, they don't offer any
>> possibilities for ReDoS (but see (4) below).
>>
>> 2) I-Regexps will be executed on all kinds of different regexp
>> implementations, as explicitly envisioned by the draft. Some of these
>> implementations use backtracking because they need it for more complex
>> features, and may (or let's say will) exhibit quadratic, cubic, or
>> exponential behavior on some I-Regexp inputs. The examples of nested
>> quantifiers given by Mike are a typical case.
>>
>> 3) Even for the same programming language/regular expression engine,
>> different versions may perform differently. As an example, Ruby just
>> recently implemented some changes to its regexp engine, implementing a
>> cache that makes a lot of cases linear in time and uses space
>> proportional to the product of the length of the regexp and the input.
>> (see https://bugs.ruby-lang.org/issues/19104). In addition, we
>> introduced a predicate to check whether a regexp can be run in linear
>> time (https://bugs.ruby-lang.org/issues/19194) and a feature to set a
>> timeout (https://bugs.ruby-lang.org/issues/17837). So a regexp that
>> might blow up on Ruby versions of 3.1 and lower may behave nicely on
>> Ruby 3.2 and higher.
>>
>> 4) Repetitions with lower and upper bounds are a bit of a special case.
>> In theory, α{m,n} (where α is any suitable subregexp) can always be
>> expanded to m times α followed by (n-m) times α?. As an example, a{3,7}
>> would expand to aaaa?a?a?a?, which can be implemented efficiently (e.g.
>> using conversion from a nondeterministic finite automaton (NFA) to a
>> deterministic finite automaton (DFA)). However, it's easy to spoof this
>> with e.g. a regular expression of the form a{10,1000000}. And if
>> 1,000,000 isn't enough, it's easy for the attacker to add a few zeros,
>> they may get a 10-fold memory increase for each zero they add.
>> It is possible to avoid this memory increase by using some kind of lazy
>> construction, which would only be triggered on lengthy inputs, and be
>> linear in the length of the input. Nested quantifier ranges will need
>> special care.
>> [It could be that "Programming Techniques: Regular expression search
>> algorithm" (Ken Thomson, https://dl.acm.org/doi/10.1145/363347.363387)
>> includes such a technique; unfortunately, I have not yet been able to
>> understand that paper because I don't understand the assembly code it
>> uses :-(.]
>> Ruby produces a "too big number for repeat range" error for ranges
>> greater than 100000. I have no idea how the number 100000 was found.
>> Ruby 3.2 also returns false for
>> Regexp.linear_time?(/(a?){20,200}{20,200}/) (and runs a veeery long time
>> on some specific input), which I think is a limitation in the
>> implementation, not in principle.) Many engines simply treat stacked
>> range quantifiers as errors (see https://regex101.com).
>>
>> On 2023-05-16 00:52, Tim Bray wrote:
>> > I was reading your note and agreeing that yes, it remains possible to
>> > devise regexps that are going to cause combinatorial nasties in almost
>> any
>> > conceivable implementation, but unconvinced about the conclusion that
>> it is
>> > "still not advisable to run arbitrary user-provided regular expressions
>> on
>> > your hardware", because it seems to me that the only way to find out if
>> the
>> > regexp is evil is to run it.
>>
>> Not exactly. For I-Regexps, it's only the implementation that is
>> relevant; all I-Regexps *can* be implemented safely.
>>
>> Even for much wider classes of regexps, most of them can be very quickly
>> classified into safe or potentially dangerous, with rather simple
>> heuristics.
>>
>> So in conclusion, please make sure that the security considerations
>> mention all of the following:
>>
>> - I-Regexps may be implemented to run in linear time in the input,
>>    but this requires really great care.
>> - Existing regexp engines may be used to run I-Regexps, but may run in
>>    quadratic, cubic, or exponential time of the input length for some
>>    I-Regexps.
>> - Existing regexp engines should be able to easily handle most
>>    I-Regexps (after the adjustments discussed in Section 5), but
>>    may reject some types of syntactically legal I-Regexps because they
>>    cannot guarantee execution in reasonable time (a typical example
>>    might be a{2,4}{2,4})
>> - Range quantifiers (example: {2,4}) provide specific challenges
>>    for implementation and may therefore be limited in composability
>>    (see previous example) or range (e.g. disallowing very large ranges
>>    such as {20,200000}). Such large ranges may also allow specific
>>    attacks on some implementations.
>> - Different versions of the same regexp library may be more or less
>>    vulnerable to some attacks. In particular, newer libraries may
>>    behave better or may offer additional features to check regular
>>    expressions or guard against high resource consumption.
>>
>> Regards,   Martin.
>>
>> P.S.: The syntax currently has
>>  >>>>
>> quantifier = ( %x2A-2B ; '*'-'+'
>>   / "?" ) / ( "{" quantity "}" )
>>  >>>>
>> The doubly nested alternatives and the extremely short range provide
>> more confusion than clarity to the reader. Also, there's absolutely no
>> need to use hex for * and + (see the char-val production in RFC 5234).
>> So please rewrite this as:
>>  >>>>
>> quantifier = "*" / "+" / "?" / ( "{" quantity "}" )
>>  >>>>
>>
>> Because we want to refer to it in the security discussion, it may also
>> be a good idea to introduce a rule name for "{" quantity "}":
>>  >>>>
>> quantifier = "*" / "+" / "?" / range-quantifier
>> range-quantifier = "{" quantity "}"
>>  >>>>
>>
>> P.P.S.: For additional reading, I suggest starting at
>>
>> https://levelup.gitconnected.com/the-regular-expression-denial-of-service-redos-cheat-sheet-a78d0ed7d865.
>>
>> All of James Davis' work on ReDoS is well worth reading.
>>
>> > But I think your closing paragraph provides a solution.
>> >
>> > On Mon, May 15, 2023 at 8:17 AM Mike Ounsworth via Datatracker <
>> > noreply@ietf.org> wrote:
>> > …
>> >
>> >>   I wonder if this
>> >> document could recommend that implementations include some sort of
>> >> configurable
>> >> limit on nesting level or on recursion / backtracking depth.
>> >
>> >
>> > That sounds like a good direction, but pretty complex. A simpler option
>> > would be that implementations impose a limit on time and/or memory costs
>> > and error out when those are breached. Do you think that a
>> recommendation
>> > along those lines would address your concerns?
>> >
>> >
>>
>> --
>> last-call mailing list
>> last-call@ietf.org
>> https://www.ietf.org/mailman/listinfo/last-call
>>
>