[Cbor] draft-bormann-cbor-sequence-00 and Big Data

Burt Harris <burt_harris@hotmail.com> Sun, 23 June 2019 02:12 UTC

Return-Path: <burt_harris@hotmail.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A6B5512004C for <cbor@ietfa.amsl.com>; Sat, 22 Jun 2019 19:12:13 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.126
X-Spam-Level:
X-Spam-Status: No, score=-1.126 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FORGED_HOTMAIL_RCVD2=0.874, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=hotmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 51_upu-mPVnj for <cbor@ietfa.amsl.com>; Sat, 22 Jun 2019 19:12:12 -0700 (PDT)
Received: from NAM01-SN1-obe.outbound.protection.outlook.com (mail-oln040092002013.outbound.protection.outlook.com [40.92.2.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C9C68120043 for <cbor@ietf.org>; Sat, 22 Jun 2019 19:12:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hotmail.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ct9/1KjGmK9D2qkYoBJvm3nh4+h0p+Q+JdakgxSb1/Y=; b=VwkvUyFa37tA2QXjaDXOyZj0ndw1ClOcxUkYaAkL+XKcjPrFssQI/vjzYaMea0FznbPHrsmZyNtIlfhnjO+iy8GAeIGs/ikTz0dWld7MDdHew4aye4j8Ra0NjcYNimpdQZkVcGggHgj685yRfMapnkk/FfHe4m+vwpiCJVZUBHQOhCMFJXL5EaT7W18Mm3g8Htr5sCOo4PSEhI0uVIoNRMxV7NcJJ6ATX6xmRqrJqb5yoVX84YTeQGsxP5po3M9U5S2r7z8aYTbRPHuJqIjWiWShbDKeAkU/1Iaesfywyz+LTN4yI3kcFTpL8jBUr7dBWthj7VqDZn3B63H1fKtTIw==
Received: from BN3NAM01FT016.eop-nam01.prod.protection.outlook.com (10.152.66.59) by BN3NAM01HT208.eop-nam01.prod.protection.outlook.com (10.152.67.95) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.2008.13; Sun, 23 Jun 2019 02:12:10 +0000
Received: from MWHPR22MB0336.namprd22.prod.outlook.com (10.152.66.52) by BN3NAM01FT016.mail.protection.outlook.com (10.152.66.204) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2008.13 via Frontend Transport; Sun, 23 Jun 2019 02:12:10 +0000
Received: from MWHPR22MB0336.namprd22.prod.outlook.com ([fe80::d489:2a46:3449:e1d9]) by MWHPR22MB0336.namprd22.prod.outlook.com ([fe80::d489:2a46:3449:e1d9%9]) with mapi id 15.20.2008.014; Sun, 23 Jun 2019 02:12:10 +0000
From: Burt Harris <burt_harris@hotmail.com>
To: "cbor@ietf.org" <cbor@ietf.org>
Thread-Topic: draft-bormann-cbor-sequence-00 and Big Data
Thread-Index: AdUpUgLr+XPcIF4kT4CtHwBS9wv+Lg==
Date: Sun, 23 Jun 2019 02:12:10 +0000
Message-ID: <MWHPR22MB0336C10E7C7D2C4F2146D4FA92E10@MWHPR22MB0336.namprd22.prod.outlook.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-incomingtopheadermarker: OriginalChecksum:9F1A8E785B96EAE886DE745BADE123B58EF968EED2836D83A895A13B5DD8E31D; UpperCasedChecksum:45A195F9A01897898268B0EF4074FA130D26FAEB43D2888C1310F16FED4793B8; SizeAsReceived:6511; Count:40
x-tmn: [1d8Gh9b5q7uVifeybW5NoT0HWHsUdJTF]
x-ms-publictraffictype: Email
x-incomingheadercount: 40
x-eopattributedmessage: 0
x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(5050001)(7020095)(20181119110)(201702061078)(5061506573)(5061507331)(1603103135)(2017031320274)(2017031322404)(2017031323274)(2017031324274)(1601125500)(1603101475)(1701031045); SRVR:BN3NAM01HT208;
x-ms-traffictypediagnostic: BN3NAM01HT208:
x-microsoft-antispam-message-info: FYJZuiRmAjECcwD8l7d0Rp+PIzSY4UQkKE7/IW/bxprGZolAgksID8innM6Oy8AzC5HDtoT7khpfztAMczR9HQkHxkMS32NiQFTrpi7nYQhEkTZrG9IuqQNex18SV7MgYXPzQBup914Bk12pboofa13+yJGXapHZLbwvptKf44nMiH4Tr9O4o/P9yBXsYFeN
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: hotmail.com
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000
X-MS-Exchange-CrossTenant-Network-Message-Id: 8639a1e9-93f2-463d-5d75-08d6f7803314
X-MS-Exchange-CrossTenant-rms-persistedconsumerorg: 00000000-0000-0000-0000-000000000000
X-MS-Exchange-CrossTenant-originalarrivaltime: 23 Jun 2019 02:12:10.3490 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Internet
X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN3NAM01HT208
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/3S9X8E1_evNd10QTZQHW_feinXM>
Subject: [Cbor] draft-bormann-cbor-sequence-00 and Big Data
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 23 Jun 2019 02:12:13 -0000

draft-borman-cbor-sequence-00 has me thinking, it's a great first draft.

It is unquestionably good to logically distinguish between a single CBOR data item and a (potentially large) stream of CBOR data items.   MIME type makes very good sense, but we might also back it up with by allocating a unique tag (similar to the self-describing CBOR tag number 55799 in RFC7049 section 2.4.5)) as a NO-OP "magic number" at the beginning of a stream.

I think the current name and content of the draft implies the order of the data items in the stream might be logically significant.   I suggest for this media type we should consider overall sequence implementation detail (like the order of keys in a map).   In particular, to facilitate building scalable big-data systems using CBOR, it would be valuable to recognize that if the order of data items is important, a CBOR Array is probably more appropriate.   

Perhaps s/sequence/stream/g might be appropriate in the document because a "sequence" (to me) implies order preservation.

For example, to collect data from a few million IoT devices, I would use a scale-out architecture where several identical front-end machines are load balanced with each one buffering the little cbor messages into a large buffer.   The front-end machines flush their buffers asynchronously, either when the buffer is full or the after expiration of a period.   Due to the asynchronous distributed buffering, the order of the data items may not reflect the actual sequence of events.      This non-sequential logging can be significant even in multicore-processors with sophisticated logging infrastructure.

Similarly, big-data distributed processing architectures, e.g., the map / reduce algorithm; there should be no assumption of global ordering.   But local ordering (within a block) can sometimes be very useful.    For that reason, we might want to consider allocating a single-byte CBOR special (MT 7) value for delimiting portions of a stream where the sequence can be considered meaningful.   For applications not sensitive to the order of the values, it would effectively be a NoOp, but when compression approaches are used in a CBOR stream, it should probably reset the compression state.   A 1-byte NoOp in Major Type 7 can also be useful in padding a CBOR stream into a "splittable" storage format, which aids in distributed analysis processing using frameworks like Hadoop.

Finally, by testing for if a NoOp appears at the last element in a stream, it may be possible to detect if the application writing to the stream paused at that point.    For those familiar with YAML, consider the "..." token.

P.S.  I wanted to thank the members of the WG who participated in moving CDDL forward to become RFC8610, great work!