Re: [apps-discuss] Gen-ART review of draft-bormann-cbor-04

I like the design of CBOR but my big question is how much code-size and
memory-size and performance it actually buys you.  My experience is that
the performance of software that parses XML & JSON is often dominated by
memory allocation as it builds the structures to hold the parsed data.  Do
we have any data? -T

On Mon, Aug 5, 2013 at 10:43 AM, Joe Hildebrand (jhildebr) <
jhildebr@cisco.com> wrote:

> Sorry, my response is also correspondingly long.  There are some original
> comments at the end that are not just responses to Martin.
>
> On 7/30/13 9:05 AM, "Martin Thomson" <martin.thomson@gmail.com> wrote:
>
> >I'm glad that I held this review until Paul's appsarea presentation.
> >This made it very clear to me that the types of concerns I have are
> >considered basically irrelevant by the authors because they aren't
> >interested in changing the design goals.  I don't find the specific
> >design goals to be interesting and am of the opinion that the goals
> >are significant as a matter of general application.  I hope that is
> >clear from my review.
>
> I don't think you made it clear which of the design goals you thought
> weren't interesting.  I'll give my take on each of them here:
>
> > 1.  The representation must be able to unambiguously encode most
> > common data formats used in Internet standards.
>
>
> Seems reasonable.  For example, dates *do* come up quite often (e.g. MIME
> headers).
>
> > 2.  The code for an encoder or parser must be able to be compact in
> > order to support systems with very limited memory and processor
> > power and instruction sets.
>
>
> Seems reasonable, particularly if you're targeting embedded devices and
> sensors.  One of the other corollaries they ought to mention here is that
> devices that have small implementations should be able to receive a subset
> of the value of the format, implementing only the parts that they need to
> and skipping everything they don't understand.
>
> > 3.  Data must be able to be parsed without a schema description.
>
>
> I'd put this first.  It's an absolute must-have for future extensibility.
>
>
> > 4.  The serialization must be reasonably compact, but data
> > compactness is secondary to code compactness for the encoder and
> > parser.
>
>
> This is one of the things that people don't like about XML, which leads
> them to come up with aggressively complicated approaches like EXI.  I
> might argue that it's a little more important than code compactness for my
> use cases, but I think CBOR currently strikes an adequate balance.
>
> > 5.  The format must be applicable to both constrained nodes and high-
> > volume applications.
>
> In IoT applications, both will be true at the same time, so I like this
> requirement.  Low CPU usually means less battery usage anyway.
>
>
> > 6.  The format must support all JSON data types for conversion to and
> > from JSON.
>
>
> Why not?  There are some bits that don't convert quite cleanly enough to
> JSON, but it's pretty close.
>
> > 7.  The format must be extensible, with the extended data being able
> > to be parsed by earlier parsers.
>
>
> This is interesting, and I would be willing to give up several kinds of
> extensibility to get more simplicity.  Perhaps this is your greatest
> issue, and I don't think you are necessarily wrong.
>
>
>
> >I have reviewed the mailing list feedback, and it's not clear to me
> >that there is consensus to publish this.  It might be that the dissent
> >that I have observed is not significant in Barry's learned judgment,
> >or that this is merely dissent on design goals and therefore
> >irrelevant.  The fact that this work isn't a product of a working
> >group still concerns me.  I'm actually interested in why this is
> >AD-sponsored rather than a working group product.
>
> I think this draft is interesting, and might benefit from a short-lived
> working group to ensure that enough people have reviewed it.  As well, I
> think there might need to be a BCP70-style document that explores how to
> design protocols using CBOR, after we've got some implementation
> experience.
>
> I do think that rather a lot of the dissent from the list was heard and
> incorporated, including streaming, which I personally think is an
> over-complication, but others really seemed to want.
>
> >Major issues:
> >My major concerns with this document might be viewed as disagreements
> >with particular design choices.  And, I consider it likely that the
> >authors will conclude that the document is still worth publishing as
> >is, or perhaps with some minor changes.  In the end, I have no issue
> >with that, but expect that the end result will be that the resulting
> >RFC is ignored.
>
> I don't agree that CBOR will be ignored if published in the current form,
> but I do believe it could use a little more review and polish.
>
> >What would cause this to be tragic, is if publication of this were
> >used to prevent other work in this area from subsequently being
> >published.  (For those drawing less-than-charitable inferences from
> >this, I have no desire to throw my hat into this particular ring,
> >except perhaps in jest [1].)
>
> There will be other formats, even if this is published.
>
>
> >This design is far too complex and large.  Regardless of how
> >well-considered it might be, or how well this meets the stated design
> >goals, I can't see anything but failure in this document's future.
> >JSON succeeds largely because it doesn't attempt to address so many
> >needs at once, but I could even make a case for why JSON contains too
> >many features.
>
> I've implemented it twice now, using two different approaches.  The second
> time, I separated the lexing and generation using a SAX-style eventing
> model.  There were a total of eight events fired:
>
> - value
> - array start/stop
> - map start/stop
> - stream start/stop
> - end
>
> That seems roughly minimal for data shaped similarly to JSON (except for
> streams), and significantly less complex than the equivalent SAX2 set for
> XML.
>
> >In comparison with JSON, this document does one major thing wrong: it
> >has more options than JSON in several dimensions.  There are more
> >types, there are several more dimensions for extensibility than JSON:
> >types extensions, values extensions (values of 28-30 in the lower bits
> >of the type byte), plus the ability to apply arbitrary tags to any
> >value.  I believe all of these to be major problems that will cause
> >them to be ignored, poorly implemented, and therefore useless.
>
> I agree that not everyone will implement all of the semantic tags, but the
> fact that you can continue parsing even if you receive a tag you don't
> implement is an interesting and useful property.
>
> I do agree that the extensibility of the patterns 28-30 is scary; those
> came about when streaming was added.
>
> >In part, this complexity produces implementations that are far more
> >complex than they might need to be, unless additional standardization
> >is undertaken.  That idea is something I'm uncomfortable with.
>
> I didn't find either of those properties to be challenging to code around.
>  I error on bit patterns for 28-30, capture but ignore tags in the
> eventing layer, and in the generation layer am able to implement new tags
> with code that looks like:
>
> tag_DATE_STRING: (val)->
>   new Date(val)
>
> Which doesn't seem like an onerous burden for the upside of being able to
> round-trip all of the built-in JavaScript types.
>
>
> I guess I disagree with your characterization of overcomplexity, based on
> implementation experience.
>
> >Design issue: extensibility:
> >This document avoids discussion of issues regarding schema-less
> >document formats that I believe to be fundamental.  These issues are
> >critical when considering the creation of a new interchange format.
> >By choosing this specific design it makes a number of trade-offs that
> >in my opinion are ill-chosen.  This may be in part because the
> >document is unclear about how applications intend to use the documents
> >it describes.
> >
> >You may conclude after reading this review that this is simply because
> >the document does not explain the rationale for selecting the approach
> >it takes.  I hope that isn't the conclusion you reach, but appreciate
> >the reasons why you might do so.
> >
> >I believe the fundamental problem to be one that arises from a
> >misunderstanding about what it means to have no schema.  Aside from
> >formats that require detailed contextual knowledge to interpret, there
> >are several steps toward the impossible, platonic ideal of a perfectly
> >self-describing format.  It's impossible because ultimately the entity
> >that consumes the data is required at some level to understand the
> >semantics that are being conveyed.  In practice, no generic format can
> >effectively self-describe to the level of semantics.
> >
> >This draft describes a format that is more capable at self-description
> >than JSON.  I believe that to not just be unnecessary, but
> >counterproductive.  At best, it might provide implementations with a
> >way to avoid an occasional extra line of code for type conversion.
>
> As an example, the binary string type would be quite useful for JOSE-style
> documents that need to reproduce a set of octets exactly.  Today, Base64
> is required for that in JSON, which is causing the JOSE working group to
> limit the size (and therefore the ease of understanding) of value names,
> as multiple layers of Base64 can expand their output size past the
> external limits placed on applications by browsers.
>
> Secondly, numeric types in JSON are pretty woefully underspecified, and
> cannot be fixed. Transmitting both integers and floats with precise
> understanding of how they will be received is often quite useful.
>
> I don't see either of these as simple type conversion issues.
>
> >Extensibility as it relates to types:
> >The use of extensive typing in CBOR implies an assumption of a major
> >role for generic processing.  XML schema and XQuery demonstrate that
> >this desire is not new, but they also demonstrate the folly of
> >pursuing those goals.
>
> XML schema keeps these bits in a (quite complex) separate stream, which
> requires significant extra processing to add type information into the
> parse stream.  I agree that we've learned that this approach doesn't work
> well.  I don't think that either XSD or XQuery prove anything about
> putting some type information into the data stream itself.
>
>
> >JSON relies on a single mechanism for extensibility. JSON maps that
> >contain unknown or unsupported keys are (usually) ignored.  This
> >allows new values to be added to documents without destroying the
> >ability of an old processor to extract the values that it supports.
> >The limited type information JSON carries leaks out, but it's unclear
> >what value this has to a generic processor.  All of the generic uses
> >I've seen merely carry that type information, no specific use is made
> >of the knowledge it provides.
>
> I don't see how this is relevant.  Most CBOR extensibility for a given
> protocol will follow this model exactly.
>
> >ASN.1 extensibility, as encoded in PER, leads to no type information
> >leaking.  Unsupported extensions are skipped based on a length field.
>
> I'd say that CBOR has a slight advantage here, because you can still get
> some interesting information out of many tag types, even if you don't
> support that tag type.  That allows a non-extensible parsing system to be
> used, if desired.
>
> >(As an aside, PER is omitted from the analysis in the appendix, which
> >I note from the mailing lists is due to its dependency on schema.
> >Interestingly, I believe it to be possible - though not trivial - to
> >create an ASN.1 description with all the properties described in CBOR
> >that would have roughly equivalent, if not fully equivalent,
> >properties to CBOR when serialized.)
>
> Let's add PER to the analysis.  I don't think it will have the properties
> desired, since the decoder has to have access to the same schema the
> sender had.  I agree that it might be possible to create CBOR Encoding
> Rules or the moral equivalent, but I'm not sure I agree that it would be
> worth the effort.
>
> >By defining an extensibility scheme for types, CBOR effectively
> >acknowledges that a generic processor doesn't need type information
> >(just delineation information), but it then creates an extensive type
> >system.  That seems wasteful.
>
> In a given system using CBOR, the protocol designers will decide which
> types to use.  All of the tooling in that ecosystem will parse and encode
> the selected types.
>
> Think of the extensibility here being more for generating lots of
> different protocols on top of CBOR, not lots of different types in one
> protocol.
>
> >Design issue: types:
> >The addition of the ability to carry uninterpreted binary data is a
> >valuable and important feature.  If that was all this document did,
> >then that might have been enough.  But instead it adds numerous
> >different types.
>
> I agree that binary is the most important.  However, unambiguous,
> unescaped, UTF8-only text is also a big win over JSON.  Precise floats are
> a distant third for me, but seem to be the most important for folks that
> do remote sensors.
>
> >I can understand why multiple integer encoding sizes are desirable,
> >and maybe even floating point representations, but this describes
> >bignums in both base 2 and 10,
>
> I agree that the bignums are overkill.  I implemented them the second
> time, but I made sure they were slow. :)
>
> Seriously, I don't see using these in any applications, unless the
> security folks decide they are a good fit for an RSA modulus.
>
> >embedded CBOR documents in three forms,
> >URIs, base64 encoded strings, regexes, MIME bodies, date and times in
> >two different forms, and potentially more.
>
> I think I may be responsible for having suggested URIs and regexes, and
> would fully support removing them from the base draft, along with all of
> the baseX and the (underspecified) MIME stuff.
>
> Please keep the dates.
>
> >I also challenge the assertion made where the code required for
> >parsing a data type produces larger code sizes if performed outside of
> >a common shared library.  That's arguably provably true, but last time
> >I checked a few extra procedure calls (or equivalent) weren't the
> >issue for code size.  Sheer number of options on the other hand might
> >be.
>
> That assertion is also unconvincing to me.  However, I'd really like my
> docs to round-trip, and dates are a big hole in that.  I personally want
> precisely one date type (rather than two), and don't care which one is
> selected.  Javascript floating point milliseconds before/after the epoch
> seems like a reasonable starting point, for example.
>
> >Half-precision floating point numbers are a good example of excessive
> >exuberance.  They are not available in many languages for good reason:
> >they aren't good for much.
>
> They're also relatively new.
>
> >They actually tend to cause errors in
> >software in the same way that threading libraries do: it's not that
> >it's hard to use them, it's that it's harder than people think.  And
> >requiring that implementations parse these creates unnecessary
> >complexity.
>
> I know that it took me an hour or two with the IEEE754 docs to implement
> them in JavaScript, and that code passes a ton of unit tests, including
> subnormals, +/- infinity, and NaNs.  That said, I don't need halfs for my
> use cases, but if sensor folks feel strongly about them, I'm willing to
> keep that code in place.
>
> It's *really* nice to be able to encode NaN and +/- Infinity as separate
> from null, which JSON does not allow.
>
> >I do not believe that for the very small subset of cases
> >where half precision is actually useful, the cost of transmitting the
> >extra 2 bytes of a single-precision number is not going to be a
> >burden.  However, the cost of carrying the code required to decode
> >them is not as trivial as this makes out.  The fact that this requires
> >an appendix would seem to indicate that this is special enough that
> >inclusion should have been very carefully considered.  To be honest,
> >if it were my choice, I would have excluded single-precision floating
> >point numbers as well, they too create more trouble than they are
> >worth.
>
> I'd be fine with that for my use cases, too, but realize there are other
> use cases where people care about both half- and single-precision, and am
> willing to write a couple of extra lines of code to get them onboard.
>
> >Design issue: optionality
> >CBOR embraces the idea that support for types is optional.  Given the
> >extensive nature of the type system, it's almost certain that
> >implementations will choose to avoid implementation of some subset of
> >the types.  The document makes no statements about what types are
> >mandatory for implementations, so I'm not sure how it is possible to
> >provide interoperable implementations.
>
> I think it should say that *none* of the types is mandatory, which I think
> is implied by the text in section 2.4.
>
> >If published in its current form, I predict that only a small subset
> >of types will be implemented and become interoperable.
>
> I'd say let's agree on what those types are, and move all of the others to
> separate drafts.  My list:
>
> - Date (one type)
> - CBOR (sometimes parsed, other times kept intact for processing as a byte
> stream)
>
> >Design issue: tagging
> >The tagging feature has a wonderful property: the ability to create
> >emergency complexity.  Given that a tag itself can be arbitrarily
> >complex, I'm almost certain that this is a feature you do not want.
>
> I'm not sure I understand fully what your issue is.  Protocol designers
> can abuse any format, and when they do, implementors curse them down the
> ages.  Don't do that.
>
> >Minor issues:
> >Design issue: negative numbers
> >Obviously, the authors will be well-prepared for arguments that
> >describe as silly the separation of integer types into distinct
> >positive and negative types.  But it's true, this is a strange choice,
> >and a very strange design.
>
> I think it came from some of the other things in this space (e.g.
> zigzag-encoding).  I think another option would be to have both signed and
> unsigned types.  This choice didn't bother me too much in implementation,
> except when it was repeated in bigints.
>
> >The fact that this format is capable of describing 64-bit negative
> >numbers creates a problem for implementations that I'm surprised
> >hasn't been raised already.  In most languages I use, there is no
> >native type that is capable of carrying the most negative value that
> >can be expressed in this format.  -2^64 is twice as large as a 64-bit
> >twos-complement integer can store.
>
> Yes, that should be mentioned in the doc, probably with a MUST NOT.
>
> >It almost looks as though CBOR is defining a 65-bit, 33-bit or 17-bit
> >twos complement integer format, with the most significant bit isolated
> >from the others, except that the negative expression doesn't even have
> >the good sense to be properly sortable.  Given that and the fact that
> >bignums are also defined, I find this choice to be baffling.
>
> This is a quite valid concern.  Let's get it fixed.
>
> >Document issue: Canonicalization
> >Please remove Section 3.6.  c14n is hard, and this format actually
> >makes it impossible to standardize a c14n scheme, that says a lot
> >about it.  In comparison, JSON is almost trivial to canonicalize.
>
> Agree with removing section 3.6, and moving to another doc if really
> desired.
>
> STRONGLY disagree that JSON is trivial to canonicalize.  The Unicode
> issues notwithstanding (escaping, etc), the numeric format is a mess to
> get into canonical form.
>
> >If the intent of this section is to describe some of the possible
> >gotchas, such as those described in the last paragraph, then that
> >would be good.  Changing the focus to "Canonicalization
> >Considerations" might help.
> >
> >I believe that there are several issues that this section would still
> >need to consider.  For instance, the use of the types that contain
> >additional JSON encoding hints carry additional semantics that might
> >not be significant to the application protocol.
> >
> >Extension based on minor values 28-30 (the "additional information"
> >space):
> >...is impossible as defined.  Section 5.1 seems to imply otherwise.
> >I'm not sure how that would ever happen without breaking existing
> >parsers.  Section 5.2 actually makes this worse by making a
> >wishy-washy commitment to size for 28 and 29, but no commitment at all
> >for 30.
>
> +1.  I'd prefer we remove streaming and tighten the additional information
> bits back up to all being used as in previous versions of the draft.
>
> >Nits:
> >Section 3.7 uses the terms "well-formed" and "valid" in a sense that I
> >believe to be consistent with their use in XML and XML Schema.  I
> >found the definition of "valid" to be a little difficult to parse;
> >specifically, it's not clear whether invalid is the logical inverse of
> >valid.
>
> Agree that language needs to be massaged.
>
> >Appendix B/Table 4 has a TBD on it.  Can this be checked?
>
> Sure.
>
> >Table 4 keeps getting forward references, but it's hidden in an
> >appendix.  I found that frustrating as a reader because the forward
> >references imply that there is something important there.  And that
> >implication was completely right, this needs promotion.  I know why
> >it's hidden, but that reason just supports my earlier theses.
>
> I think it should not be referred to earlier.  It's an implementation
> choice, and not one that I (for example) made.  Thinking about that table
> made it more difficult for me to get started.
>
> >Section 5.1 says "An IANA registry is appropriate here.".  Why not
> >reference Section 7.1?
>
> Yes, that needs to be fixed.
>
> >[1] https://github.com/martinthomson/aweson
>
> Add binary, and remove the string quoting requirements, please. :)
>
>
>
> Other things that ought to be discussed:
>
> I would like to see another design goal: "May be implemented in modern web
> browsers".  That should be possible with the new binary types.
>
>
> I still don't see the need for non-string map keys.  JSON mapping would be
> easier without them, as would uniqueness checking.  If they are to be
> retained, they should have some motivation in the spec, describing how and
> why they might be used.
>
> I wish I could think of another good simple value that we might register
> one day.  The only one I've come up with is "no-op", which I might use in
> a streaming application as a trivial keep-alive or a marker between
> records to ensure parser state sync.  I wouldn't classify that as "good"
> however.
>
> Nested tags ought to be forbidden until we come up with a strong use case
> for them.
>
> I could use some implementation guidance on how to generate the most
> compact floating-point type for a given number, assuming we keep all of
> the floating-point types.
>
> I don't like 2.4.4.2 "Expected Later Encoding for CBOR to JSON
> Converters".  Having a sender care about the encoding of a second format
> at the same time seems unnecessarily complex.  I'd like to see this
> section and the corresponding tags moved to another spec, or just removed.
>
> Similar for tags 33 and 34.  Just send the raw bytes as a byte string;
> there's no need to actually base64 encode.
>
>
> Section 3.2.3, we should call out the heresy of including UTF-16
> surrogates encoded as UTF-8 for those that can't read the UTF-8 spec.
>
> Overall in section 3.2, "should probably issue an error but might take
> some other action" seems like it will cause some interop surprises in
> practice.
>
> Section 3.3, "all number representations are equivalent" is unclear, even
> with the clarifying phrase afterward.
>
> If section 3.6 stays, the numerics need more work.  +/- Infinity should be
> treated like NaN.  "if the result is the same value" would benefit from
> some more clarity.  The tags section is also somewhat unclear.
>
> Section 4.2, what about numbers with uninteresting fractional parts, like
> 1.0?  What about numbers in exponential format without fractional parts,
> like 1e10?  I would recommend against even suggest encoding in place.
> It's likely to cause a security issue for the reason mentioned.
>
> Section 6, including the diagnostic notation is a little strange.  I would
> at least like "it is not meant to be parsed" to be strengthened to "MUST
> NOT be parsed".
>
> Section 7.1, simple values 0..15 can never be used with this construction.
>  If that's intentional, then give a good reason.
>
> Section 7.2, do we have language we can use about how to reclaim
> first-come-first-serve tags that aren't being used anymore?  e.g. the web
> page is down and the requestor may be dead.
>
> In section 7.3, yes, we should make "application/mmmmm+cbor" valid.
>
> In section 8, perhaps mention a stack attack like:
>
> 0x818181818181818181...
>
> I implemented depth counting with a maximum as an approach to avoid this.
>
> There are likely to be lots of other security concerns.
>
> I have checked all of the examples in Appendix A.  I would have expected
>
> 2(h'010000000000000000') | 0xc249010000000000000000
>
> Not 18446744073709551616, since in the diagnostic, I don't necessarily
> support bignums.
>
> Same with 0xc349010000000000000000.
>
> For numbers, section 9.8.1 of ECMA-262 (JSON.stringify) is relevant.  So:
>
> 5.960464477539063e-8 | 0xf90001   (not e-08)
> 0.00006103515625 | 0xf90400       (not e-05)
>
>
>
> In Appendix B, there are holes in the jump table.  If you're going to have
> a table, call out the invalid values, such as 0x1c-0x1f.
>
> In Appendix C, the use of the variable "breakable" is unclear, and it's
> not obvious how you'll get to the "no enclosing indefinite" case.
>
> In Appendix D, shouldn't the input be unsigned or an array of bytes?
>
> Appendix E.1, aren't there some cases of DER where you need the schema to
> parse?  Add PER.
>
> Appendix E, we should mention Smile, since the first two octets are so
> cute.
>
> Overall, I think this doc shows a lot of promise, and I'm looking forward
> to having something on standards track that has these properties.
>
> --
> Joe Hildebrand
>
> _______________________________________________
> apps-discuss mailing list
> apps-discuss@ietf.org
> https://www.ietf.org/mailman/listinfo/apps-discuss
>