Re: [secdir] [Last-Call] [Jsonpath] Secdir last call review of draft-ietf-jsonpath-iregexp-06

Rob Sayre <sayrer@gmail.com> Thu, 18 May 2023 20:31 UTC

Return-Path: <sayrer@gmail.com>
X-Original-To: secdir@ietfa.amsl.com
Delivered-To: secdir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5C6BAC14CE4F; Thu, 18 May 2023 13:31:30 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.094
X-Spam-Level:
X-Spam-Status: No, score=-2.094 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Yo1LW9huAfYP; Thu, 18 May 2023 13:31:26 -0700 (PDT)
Received: from mail-ed1-x52e.google.com (mail-ed1-x52e.google.com [IPv6:2a00:1450:4864:20::52e]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0BEC1C14F74E; Thu, 18 May 2023 13:31:26 -0700 (PDT)
Received: by mail-ed1-x52e.google.com with SMTP id 4fb4d7f45d1cf-510ea8d0bb5so1396548a12.0; Thu, 18 May 2023 13:31:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1684441884; x=1687033884; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=vtIWOjDk4Tzk/Wj/rE1qH7G8DNw575Qe78HCHlDh0uY=; b=XMK+8AarBxtUwVGeEX72jjrK25VyUzHNpUhtTkzQRV0X7cYaURcDdxGRDWZjwl48ql LZ1JHdNpkzWsBDSCrqA1HP+VJjtXsENpYKx/9Ni1IhM7J7lzkiBBxEncR4paXf2apMu2 BbNfuw0SYoJ/gnJwWgVO9RpV7qfckE5JQjeRF3eWoHEZMsjkjtQu1wogdZ1TMJdV1e/6 j39BFT6RuP/dl+LjZ+LYWZo43/rqSzppq9lwxNPi1rk2xjoYocIyeJAviv4WrDG5YBAn AO8yrOO8mys60055W21Xgz+ZVErSCEV13Lj3XnnXCWDy5KZ1kGcMS0KGGaxhGxga0Myn dc+A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684441884; x=1687033884; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=vtIWOjDk4Tzk/Wj/rE1qH7G8DNw575Qe78HCHlDh0uY=; b=JNIzJRtK7g2zryDIeTKQ7gc9s40/uTy3u4L57+uBMGc/TgRyKAH2u/cx4u+XCireRf XH08hxeJuGqCSZ1YOlqjtVMszzRFEdzH+XAWSV7ghvS79fFOLmuiHBD2xcwC1srQm/xD v9kwB3aGqi10if7+Ycefw7YSt0PqQe06Fc3YN9DVcaAOeiSs0UArfT6msLRrFRTDnSxt fbwqA9Sb7L6ZJpoigrOBDejS7fd2K5nPlCSSypuISCh6ybjh1EKiLLnAfgf2iqblA3u2 m4nD07t9vTh0LArOP1+ac2qrKZI1CH6VxnZuAEpMEnsTLRK7iw2yUVV5gBIqWNGy7q+Y 33nw==
X-Gm-Message-State: AC+VfDySqh/OpAhCyopwu9iXaGVn3WTNjszNAOv0GvQNWPDRacu+v0q6 lXxJuJVasE8j50w7s9nKI/zDyMK2C5cEVOKizvfO3ietYWjGUA==
X-Google-Smtp-Source: ACHHUZ4HHMcfPB0XlhqQbr68GL5Nr3fVtV55DrANSDSNPSbXgGzeRTn8I8BH8o2XL4VTv+vaO9S9i5heKMa+uvNzSas=
X-Received: by 2002:aa7:c048:0:b0:510:f3a6:d50 with SMTP id k8-20020aa7c048000000b00510f3a60d50mr676364edo.36.1684441884049; Thu, 18 May 2023 13:31:24 -0700 (PDT)
MIME-Version: 1.0
References: <168416383998.50512.953102690552943438@ietfa.amsl.com> <CAHBU6iuKKp3g_HbhgaZT8CcStQBKoaHOcdf9ogku=bftYt5wgA@mail.gmail.com> <f69ab710-34eb-1697-be29-cc0298c88da8@it.aoyama.ac.jp> <CAChr6SwqnKoG+MkBkCupKme1Y8aMxosBwHYkVWYy=yfAiwMCZw@mail.gmail.com>
In-Reply-To: <CAChr6SwqnKoG+MkBkCupKme1Y8aMxosBwHYkVWYy=yfAiwMCZw@mail.gmail.com>
From: Rob Sayre <sayrer@gmail.com>
Date: Thu, 18 May 2023 13:31:12 -0700
Message-ID: <CAChr6SyZfLNDXNDkWMS5xu=YvxcPEr=bvCKLVcubs_-BWxn8KA@mail.gmail.com>
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
Cc: Tim Bray <tbray@textuality.com>, Mike Ounsworth <mike.ounsworth@entrust.com>, secdir@ietf.org, draft-ietf-jsonpath-iregexp.all@ietf.org, jsonpath@ietf.org, last-call@ietf.org
Content-Type: multipart/alternative; boundary="00000000000024ca1705fbfdb01d"
Archived-At: <https://mailarchive.ietf.org/arch/msg/secdir/gHiFQWTdchk3ahh3670cU_CciI8>
Subject: Re: [secdir] [Last-Call] [Jsonpath] Secdir last call review of draft-ietf-jsonpath-iregexp-06
X-BeenThere: secdir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Security Area Directorate <secdir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/secdir>, <mailto:secdir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/secdir/>
List-Post: <mailto:secdir@ietf.org>
List-Help: <mailto:secdir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/secdir>, <mailto:secdir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 18 May 2023 20:31:30 -0000

On Wed, May 17, 2023 at 1:48 PM Rob Sayre <sayrer@gmail.com> wrote:

> There's one more thing I think you should add to the security
> considerations: handling of invalid Unicode in the input. Here is an
> example:
>
> https://github.com/nst/JSONTestSuite/blob/master/test_parsing/i_string_1st_valid_surrogate_2nd_invalid.json
>
> That string is allowed in ECMAScript and Python, but not Ruby or Rust.
>

I went back and read:
https://github.com/ietf-wg-jsonpath/iregexp/issues/22#issuecomment-1504004958

I think the issue is that JSON, .js, or .py source files are almost always
valid UTF-8, so if you just interpret them as a text file, it's fine.

But I-Regexp operates on the text produced after the escape sequence
processing phase, which happens after UTF decoding. Some languages enforce
valid Unicode at this point, but some don't. So it's a bit separate from
the encoding of the source file. I think that's what I-JSON was going
after, too.

thanks,
Rob




> * https://bjoern.hoehrmann.de/utf-8/decoder/dfa/
>
> On Tue, May 16, 2023 at 1:20 AM Martin J. Dürst <duerst@it.aoyama.ac.jp>
> wrote:
>
>> Hello Tim, Mike, others,
>>
>> Many thanks to Mike for bringing up this issue. I have the following
>> comments:
>>
>> 1) The syntax (and what's more important, but completely missing in the
>> text, the semantics) is equivalent to theoretical regular expressions,
>> i.e. regular expressions that define regular languages (Chomsky Type 3).
>> They can therefore be implemented to run in time linear to the length of
>> the string input, and if so implemented, they don't offer any
>> possibilities for ReDoS (but see (4) below).
>>
>> 2) I-Regexps will be executed on all kinds of different regexp
>> implementations, as explicitly envisioned by the draft. Some of these
>> implementations use backtracking because they need it for more complex
>> features, and may (or let's say will) exhibit quadratic, cubic, or
>> exponential behavior on some I-Regexp inputs. The examples of nested
>> quantifiers given by Mike are a typical case.
>>
>> 3) Even for the same programming language/regular expression engine,
>> different versions may perform differently. As an example, Ruby just
>> recently implemented some changes to its regexp engine, implementing a
>> cache that makes a lot of cases linear in time and uses space
>> proportional to the product of the length of the regexp and the input.
>> (see https://bugs.ruby-lang.org/issues/19104). In addition, we
>> introduced a predicate to check whether a regexp can be run in linear
>> time (https://bugs.ruby-lang.org/issues/19194) and a feature to set a
>> timeout (https://bugs.ruby-lang.org/issues/17837). So a regexp that
>> might blow up on Ruby versions of 3.1 and lower may behave nicely on
>> Ruby 3.2 and higher.
>>
>> 4) Repetitions with lower and upper bounds are a bit of a special case.
>> In theory, α{m,n} (where α is any suitable subregexp) can always be
>> expanded to m times α followed by (n-m) times α?. As an example, a{3,7}
>> would expand to aaaa?a?a?a?, which can be implemented efficiently (e.g.
>> using conversion from a nondeterministic finite automaton (NFA) to a
>> deterministic finite automaton (DFA)). However, it's easy to spoof this
>> with e.g. a regular expression of the form a{10,1000000}. And if
>> 1,000,000 isn't enough, it's easy for the attacker to add a few zeros,
>> they may get a 10-fold memory increase for each zero they add.
>> It is possible to avoid this memory increase by using some kind of lazy
>> construction, which would only be triggered on lengthy inputs, and be
>> linear in the length of the input. Nested quantifier ranges will need
>> special care.
>> [It could be that "Programming Techniques: Regular expression search
>> algorithm" (Ken Thomson, https://dl.acm.org/doi/10.1145/363347.363387)
>> includes such a technique; unfortunately, I have not yet been able to
>> understand that paper because I don't understand the assembly code it
>> uses :-(.]
>> Ruby produces a "too big number for repeat range" error for ranges
>> greater than 100000. I have no idea how the number 100000 was found.
>> Ruby 3.2 also returns false for
>> Regexp.linear_time?(/(a?){20,200}{20,200}/) (and runs a veeery long time
>> on some specific input), which I think is a limitation in the
>> implementation, not in principle.) Many engines simply treat stacked
>> range quantifiers as errors (see https://regex101.com).
>>
>> On 2023-05-16 00:52, Tim Bray wrote:
>> > I was reading your note and agreeing that yes, it remains possible to
>> > devise regexps that are going to cause combinatorial nasties in almost
>> any
>> > conceivable implementation, but unconvinced about the conclusion that
>> it is
>> > "still not advisable to run arbitrary user-provided regular expressions
>> on
>> > your hardware", because it seems to me that the only way to find out if
>> the
>> > regexp is evil is to run it.
>>
>> Not exactly. For I-Regexps, it's only the implementation that is
>> relevant; all I-Regexps *can* be implemented safely.
>>
>> Even for much wider classes of regexps, most of them can be very quickly
>> classified into safe or potentially dangerous, with rather simple
>> heuristics.
>>
>> So in conclusion, please make sure that the security considerations
>> mention all of the following:
>>
>> - I-Regexps may be implemented to run in linear time in the input,
>>    but this requires really great care.
>> - Existing regexp engines may be used to run I-Regexps, but may run in
>>    quadratic, cubic, or exponential time of the input length for some
>>    I-Regexps.
>> - Existing regexp engines should be able to easily handle most
>>    I-Regexps (after the adjustments discussed in Section 5), but
>>    may reject some types of syntactically legal I-Regexps because they
>>    cannot guarantee execution in reasonable time (a typical example
>>    might be a{2,4}{2,4})
>> - Range quantifiers (example: {2,4}) provide specific challenges
>>    for implementation and may therefore be limited in composability
>>    (see previous example) or range (e.g. disallowing very large ranges
>>    such as {20,200000}). Such large ranges may also allow specific
>>    attacks on some implementations.
>> - Different versions of the same regexp library may be more or less
>>    vulnerable to some attacks. In particular, newer libraries may
>>    behave better or may offer additional features to check regular
>>    expressions or guard against high resource consumption.
>>
>> Regards,   Martin.
>>
>> P.S.: The syntax currently has
>>  >>>>
>> quantifier = ( %x2A-2B ; '*'-'+'
>>   / "?" ) / ( "{" quantity "}" )
>>  >>>>
>> The doubly nested alternatives and the extremely short range provide
>> more confusion than clarity to the reader. Also, there's absolutely no
>> need to use hex for * and + (see the char-val production in RFC 5234).
>> So please rewrite this as:
>>  >>>>
>> quantifier = "*" / "+" / "?" / ( "{" quantity "}" )
>>  >>>>
>>
>> Because we want to refer to it in the security discussion, it may also
>> be a good idea to introduce a rule name for "{" quantity "}":
>>  >>>>
>> quantifier = "*" / "+" / "?" / range-quantifier
>> range-quantifier = "{" quantity "}"
>>  >>>>
>>
>> P.P.S.: For additional reading, I suggest starting at
>>
>> https://levelup.gitconnected.com/the-regular-expression-denial-of-service-redos-cheat-sheet-a78d0ed7d865.
>>
>> All of James Davis' work on ReDoS is well worth reading.
>>
>> > But I think your closing paragraph provides a solution.
>> >
>> > On Mon, May 15, 2023 at 8:17 AM Mike Ounsworth via Datatracker <
>> > noreply@ietf.org> wrote:
>> > …
>> >
>> >>   I wonder if this
>> >> document could recommend that implementations include some sort of
>> >> configurable
>> >> limit on nesting level or on recursion / backtracking depth.
>> >
>> >
>> > That sounds like a good direction, but pretty complex. A simpler option
>> > would be that implementations impose a limit on time and/or memory costs
>> > and error out when those are breached. Do you think that a
>> recommendation
>> > along those lines would address your concerns?
>> >
>> >
>>
>> --
>> last-call mailing list
>> last-call@ietf.org
>> https://www.ietf.org/mailman/listinfo/last-call
>>
>