Re: [Cbor] Handling duplicate map keys

Laurence Lundblade <lgl@island-resort.com> Sat, 23 November 2019 02:00 UTC

Return-Path: <lgl@island-resort.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 17480120180 for <cbor@ietfa.amsl.com>; Fri, 22 Nov 2019 18:00:05 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_NONE=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6kMDvmCoPVNV for <cbor@ietfa.amsl.com>; Fri, 22 Nov 2019 18:00:03 -0800 (PST)
Received: from p3plsmtpa07-04.prod.phx3.secureserver.net (p3plsmtpa07-04.prod.phx3.secureserver.net [173.201.192.233]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A3B93120168 for <cbor@ietf.org>; Fri, 22 Nov 2019 18:00:03 -0800 (PST)
Received: from [172.19.249.135] ([38.98.37.136]) by :SMTPAUTH: with ESMTPA id YKiHi3su1Q4JfYKiNij4Ak; Fri, 22 Nov 2019 19:00:02 -0700
From: Laurence Lundblade <lgl@island-resort.com>
Message-Id: <E5C74E1F-A5F4-4AB8-9787-3999C4697C3B@island-resort.com>
Content-Type: multipart/alternative; boundary="Apple-Mail=_4E416485-5BF0-4C6D-B865-31F33140BE6F"
Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\))
Date: Sat, 23 Nov 2019 09:59:52 +0800
In-Reply-To: <CANh-dX=EVDa4EwrgWKof4kCD3nkfV3BvH0Cg5ZKOivmXJ_Dm8g@mail.gmail.com>
Cc: cbor@ietf.org
To: Jeffrey Yasskin <jyasskin=40google.com@dmarc.ietf.org>
References: <CANh-dX=EVDa4EwrgWKof4kCD3nkfV3BvH0Cg5ZKOivmXJ_Dm8g@mail.gmail.com>
X-Mailer: Apple Mail (2.3445.9.1)
X-CMAE-Envelope: MS4wfBoQ5/ea9wICwsr+RBCCe+Seo/P0WLvm2TcSKQNZo9s7ntqCO8eMnbQF28lv9ZkNKKOCATUTwpSK2pYtmjDBgXBkUUdeIEL5bTjCw8IkN1cVusx8L6qN C22DdQ6zHWS9NJw7628pT480w7F0ce61MxUdFeDadPMauQP9lTumInNBHakG1kx6ithCy+/QsrTnqPRG0Il7Db81TGX4CzzlzeW1LtnQEb/Oks3gbYrV339b
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/y9EwKVL9Sz0GuuRl6rZ5GAplDxk>
Subject: Re: [Cbor] Handling duplicate map keys
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 23 Nov 2019 02:00:05 -0000

Here’s a rough proposed text:

The protocol designer should make a choice for maps as to whether duplicates are allowed or not, particularly as to whether duplicates would cause security or functionality problems.

The protocol designer should only require duplicate detection when necessary as it can have the following implications:
Some generic decoders do not support duplicate detection because the underlying facilities in their programming environment to represent maps can’t detect duplicates
Some generic decoders do not support duplicate detection because it requires more code and is not required
It requires more resources to implement: 1) memory to store all the keys in the map, 2) more code, 3) more CPU time, as much as O(n^2). This is particularly an issue with big maps in constrained environments.
It is suggested that protocol designers require duplicate detection only no the particular maps for which there is an issue.

Decoders will typically fall into one of these categories:
Full duplicate detection
Pass all items up to caller, allowing the caller to implement duplicate detection or not
No duplicate detection
A generic decoder should identify which it is. Some may support more than one, selectable by configuration.

Here’s why.

I see no value in take first or take last over duplicate detection. In all three all, the keys encountered must be recorded and processed. If the map is large this will take a lot of memory. In all cases there is some extra code and CPU. CPU time is worse than O(n), perhaps O(n^2). It doesn’t matter if the use case is streaming or not. It is the memory, code and CPU time that matter.  (Take last can only be implemented by decoding the whole map, as it is only after decoding the whole map that it can know it has encountered the last; in a way you cannot do the last for a streaming decoder).

This is a guess — I don’t think map libraries that don’t support duplicate detection commonly support  take first. I suspect some do take first, some do take last and some do take random. To me that makes any mention of take first or take last in the specification not very valuable especially as it doesn’t have much memory, CPU or code advantage over duplicate detection.

LL