Re: [Cbor] Handling duplicate map keys

Jeffrey Yasskin <jyasskin@chromium.org> Sat, 23 November 2019 09:11 UTC

Return-Path: <jyasskin@google.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id CCD4812090A for <cbor@ietfa.amsl.com>; Sat, 23 Nov 2019 01:11:36 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -9.251
X-Spam-Level:
X-Spam-Status: No, score=-9.251 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.249, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=chromium.org
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id hJUmli4o5aCj for <cbor@ietfa.amsl.com>; Sat, 23 Nov 2019 01:11:35 -0800 (PST)
Received: from mail-qt1-x82f.google.com (mail-qt1-x82f.google.com [IPv6:2607:f8b0:4864:20::82f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 96DE7120908 for <cbor@ietf.org>; Sat, 23 Nov 2019 01:11:35 -0800 (PST)
Received: by mail-qt1-x82f.google.com with SMTP id r20so10827661qtp.13 for <cbor@ietf.org>; Sat, 23 Nov 2019 01:11:35 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=JlrYV7ve1IiQjUyiUtJlA6/QuP1+8de94x6p4lzuQU8=; b=gBMybBuTPEVYwZV8P2pEDIfZ1sLs1tdxhj7ZlFeR7jzptHc3DliUdg6IkGrTU5Ae9L 1i19APyTCHSC10+DEb3dHPS4ZLapJ4VmLyJLiX1tdttZS1H2b1ceF+VzPZ62/CSmpXzm dPuPa2lBrwUUycy2yGaOGdonPkdysVePvTZVI=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=JlrYV7ve1IiQjUyiUtJlA6/QuP1+8de94x6p4lzuQU8=; b=Ygdx/3lEd9Zt/OBvZv0HzPIKCyXQRXZz8fM8dSoBL75apXe7TeW9xQsnk43+jltQZj UfJ6o3zg5zHBwWU7C6E7CuFZOaxiyACsspj3nilMTKgA1YaXz1skJTtNIW+aAb+XXRqX +uFY0MBOxQ0bRgU1kJA/2WYgkvvoHc+mvxuZa69TvpoJS3e0JaHoKs745dMGW/Wt4LW4 FrxURPEBzt/WHHjTUw7Q0ux0IPmcuCJlz+bsdPC+j7w8RdqZQZjk6NRTdCc+raQxTsRt 3MBTL3I9ruL3sh3eD6Ue/aR7Trt9fJxEfJP7ZrYKlS4dWWRWRUqm+kvnly23UadB8wqZ I7CA==
X-Gm-Message-State: APjAAAXKaHzSlua6cjjVt8HM+oZ/WsKeXRmAwHP7TrVRvcWmwtXKmyI0 qoxBiha5QbxhHC5dJqGG0F9pxa+v51SH0KV0vNsAPUDOOFJlOjCM
X-Google-Smtp-Source: APXvYqxRcsfjyNVa9E6uwtlmWL4yA4Y+LZpxR9GqCW7fXvxnT3Go01NH2aN92iSS8jWpnkBXniGPrDGSfTzSwvdRJU8=
X-Received: by 2002:aed:37c6:: with SMTP id j64mr2161577qtb.364.1574500293957; Sat, 23 Nov 2019 01:11:33 -0800 (PST)
MIME-Version: 1.0
References: <CANh-dX=EVDa4EwrgWKof4kCD3nkfV3BvH0Cg5ZKOivmXJ_Dm8g@mail.gmail.com> <E5C74E1F-A5F4-4AB8-9787-3999C4697C3B@island-resort.com>
In-Reply-To: <E5C74E1F-A5F4-4AB8-9787-3999C4697C3B@island-resort.com>
From: Jeffrey Yasskin <jyasskin@chromium.org>
Date: Sat, 23 Nov 2019 17:11:22 +0800
Message-ID: <CANh-dX=1gkAtfSG-yzCsVAkk=oLM-=dN_JLCr1kQK3d6Jb0fSw@mail.gmail.com>
To: Laurence Lundblade <lgl@island-resort.com>
Cc: Jeffrey Yasskin <jyasskin=40google.com@dmarc.ietf.org>, cbor@ietf.org
Content-Type: multipart/alternative; boundary="000000000000b8db0c0597ffea77"
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/WTc7D5Eiohu21pK7SkqVJhXEtv8>
Subject: Re: [Cbor] Handling duplicate map keys
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 23 Nov 2019 09:11:37 -0000

On Sat, Nov 23, 2019 at 10:00 AM Laurence Lundblade <lgl@island-resort.com>
wrote:

> Here’s a rough proposed text:
>
> The protocol designer should make a choice for maps as to whether
> duplicates are allowed or not, particularly as to whether duplicates would
> cause security or functionality problems.
>
> The protocol designer should only require duplicate detection when
> necessary as it can have the following implications:
>
>    - Some generic decoders do not support duplicate detection because the
>    underlying facilities in their programming environment to represent maps
>    can’t detect duplicates
>    - Some generic decoders do not support duplicate detection because it
>    requires more code and is not required
>
>
>    - It requires more resources to implement: 1) memory to store all the
>    keys in the map, 2) more code, 3) more CPU time, as much as O(n^2). This is
>    particularly an issue with big maps in constrained environments.
>
> It is suggested that protocol designers require duplicate detection only
> no the particular maps for which there is an issue.
>
> Decoders will typically fall into one of these categories:
>
>    - Full duplicate detection
>    - Pass all items up to caller, allowing the caller to implement
>    duplicate detection or not
>    - No duplicate detection
>
> A generic decoder should identify which it is. Some may support more than
> one, selectable by configuration.
>
>
I don't think generic decoders are relevant here. Protocol implementations
often accumulate their input into a data structure even if they use a
streaming decoder to read the input. Any such protocol can detect
duplicates (of the keys it uses at all) for at most the cost of an extra
bit per field.

Here’s why.
>
> I see no value in take first or take last over duplicate detection. In all
> three all, the keys encountered must be recorded and processed. If the map
> is large this will take a lot of memory. In all cases there is some extra
> code and CPU. CPU time is worse than O(n), perhaps O(n^2). It doesn’t
> matter if the use case is streaming or not. It is the memory, code and CPU
> time that matter.  (Take last can only be implemented by decoding the whole
> map, as it is only after decoding the whole map that it can know it has
> encountered the last; in a way you cannot do the last for a streaming
> decoder).
>
> This is a guess — I don’t think map libraries that don’t support duplicate
> detection commonly support  take first. I suspect some do take first, some
> do take last and some do take random. To me that makes any mention of take
> first or take last in the specification not very valuable especially as it
> doesn’t have much memory, CPU or code advantage over duplicate detection.
>

I think you make a good point that if a protocol is going to specify "take
first", it can probably specify "error on duplicate key" for the same cost.
That might imply some much simpler text:

"Protocols based on CBOR SHOULD fail with an error if a map contains a
> duplicate key, except that if the key isn't used at all, they MAY ignore it
> instead. Protocols that do not reject duplicate keys MUST (?) document why
> the cost of rejecting duplicates is too high and why accepting them will
> not lead to security vulnerabilities. An array might be a better choice for
> such protocols."


Do we have any concrete examples of protocols or formats built on CBOR that
don't store anything while parsing their input? One of those might provide
a good example of a protocol that shouldn't reject duplicates.

Jeffrey