Re: [Cbor] tag 24 and 55799 (was Re: my (WGLC re-)views on error processing in RFC7049bis and future-proofing)

Carsten Bormann <cabo@tzi.org> Fri, 29 May 2020 21:16 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9B5F73A10A8 for <cbor@ietfa.amsl.com>; Fri, 29 May 2020 14:16:53 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FKD2JQIBd9Qg for <cbor@ietfa.amsl.com>; Fri, 29 May 2020 14:16:49 -0700 (PDT)
Received: from gabriel-vm-2.zfn.uni-bremen.de (gabriel-vm-2.zfn.uni-bremen.de [134.102.50.17]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 742343A0D80 for <cbor@ietf.org>; Fri, 29 May 2020 14:16:49 -0700 (PDT)
Received: from [172.16.42.112] (p5089ae91.dip0.t-ipconnect.de [80.137.174.145]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-vm-2.zfn.uni-bremen.de (Postfix) with ESMTPSA id 49Ycnq15tLzysD; Fri, 29 May 2020 23:16:47 +0200 (CEST)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.80.23.2.2\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <674CEEEB-4C75-4E8D-A4F1-13058507D558@island-resort.com>
Date: Fri, 29 May 2020 23:16:46 +0200
Cc: Michael Richardson <mcr+ietf@sandelman.ca>, cbor@ietf.org
X-Mao-Original-Outgoing-Id: 612479806.393633-ab5e7e31359dfad0b54e2cc281bf6f9b
Content-Transfer-Encoding: quoted-printable
Message-Id: <96520350-30A2-4048-A8C5-967887B341EA@tzi.org>
References: <17300.1588779159@localhost> <38BB6FFF-737F-4C11-AD7A-DA3F28A9F570@tzi.org> <CANh-dXkdjMyO=WFUxrF06OfP+RE9v11unKJXL8P3UtEe+prV1w@mail.gmail.com> <13690.1588894939@localhost> <CANh-dXmjD=RCwh7ExjSvFx+5ciew+eqHoVS88OommQ2xVnX5=Q@mail.gmail.com> <2963.1589473899@localhost> <BC0EC9BE-4202-4EED-A619-CDEB9BF312CE@tzi.org> <26665.1589593222@localhost> <589BF33E-9A41-400B-A91B-F45F85062269@island-resort.com> <AD183B67-2B49-4CB3-B81D-BB024B4317E7@tzi.org> <674CEEEB-4C75-4E8D-A4F1-13058507D558@island-resort.com>
To: Laurence Lundblade <lgl@island-resort.com>
X-Mailer: Apple Mail (2.3608.80.23.2.2)
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/Yc-AY_uGZEm72QoHKq55LQzTnyo>
Subject: Re: [Cbor] tag 24 and 55799 (was Re: my (WGLC re-)views on error processing in RFC7049bis and future-proofing)
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 29 May 2020 21:16:54 -0000

On 2020-05-29, at 21:51, Laurence Lundblade <lgl@island-resort.com> wrote:
> 
> Really appreciate the discussion here (despite my incorrect statements).
> 
>> On May 26, 2020, at 2:43 PM, Carsten Bormann <cabo@tzi.org> wrote:
>> 
>>> On 2020-05-26, at 23:04, Laurence Lundblade <lgl@island-resort.com> wrote:
>>> 
>>>> On May 15, 2020, at 6:40 PM, Michael Richardson <mcr+ietf@sandelman.ca> wrote:
>>>> 
>>>>>> I note that AFAIK, we do not use tag#24 (Encoded CBOR data item) for
>>>>>> the signed object, in COSE.  Should we?  What's the difference between
>>>>>> #24 and #55799.
>>>> 
>>>>> 55799 is a tag that can have any CBOR data item as tag content 24 is a
>>>>> tag that can only be on byte strings.  The byte string then *encodes*
>>>>> another CBOR data item.  (The main use here is to keep the decoder from
>>>>> decoding, to provide easy skip-ability or because we need exact bytes
>>>>> as in COSE.)  As often with tags, there is no need for tag 24 on a byte
>>>>> string when it is clear from context that the byte string contains
>>>>> encoded CBOR; this is the case in COSE.
>>>> 
>>>> Understood.
>>> 
>>> My answer on the difference is that you use 55799 when the surrounding data / file / protocol is not CBOR and 24 when it is CBOR. 55799 is intended to work as a magic number, 24 is not because it is not unique enough.
>>> 
>>> From a decoder point of view, they should be handled exactly the same
>> 
>> Actually, no.
>> 
>> 55799 has any CBOR data item as tag content and essentially has the semantics of that CBOR data item.
> 
> This is well-formed and valid CBOR by RFC 7049 & CBORbis:
> 
> D9 D9F7          # tag(55799)
>    81            # array(1)
>       D9 D9F7    # tag(55799)
>          D9 D9F7 # tag(55799)
>             01   # unsigned(1)

Certainly.  It becomes interesting if you add tags that do have structural requirements:

C1                # tag(1)
   D9 D9F7        # tag(55799)
      1A 5ED17501 # unsigned(1590785281)

is well-formed, but not valid.

It would probably be valid if we had moved from structural to semantic validity for the tags defined in RFC 7049; we decided not to.

> One is tempted to say that 55799, as a true CBOR no-op, must always be ignored by generic decoders.

Tempted, yes, but not supported by 7049 or 7049bis.

> However, I think most implementations of CBOR protocols (e.g., CWT, COSE) today would not tolerate 55799’s showing up anywhere in the encoded CBOR like in my example above because ignoring it universally has never been the rule.

Right.

> So you really need to mean it when you put tag 55799 in, the same as any other tag. Its use needs to be mandated as part of the protocol specification so implementors know it will be occurring and are ready to ignore it.

You can use it without a protocol specification as well.
But if you have one, that better tell you where 55799 is “expected”.

> Probably it is not even CBOR protocol specifications that says to use 55799. It is the protocol or file format that carries the encoded CBOR that would indicate it is in use.

55799 is a bit like a BOM on UTF-8 (except that the latter is damaged by BOMs because it loses its ASCII compatibility, but let’s ignore that for now).

Definitely not something you want to use unless there is value in determining the format from a “magic number”.

> Probably 55799 should never occur except at the beginning of some encoded CBOR.

Probably.

>> 24 has a byte string as tag content.  That byte string is identified by the tag as encoded CBOR. 
>> A byte string with embedded encoded CBOR is a data item, but it is different from the data item that was encoded and embedded. Decoders will differ in their tag 24 handling: they might go ahead and decode that CBOR or just hand the byte string and tag to the application.  In either case, the decoded CBOR is not simply “in place”, making the tag and the byte string vanish.
> 
> Yes of course (I should have known that).
> 
> A generic decoder could have a mode that skips the tag24-plus-byte-string the same it skips 55799,

The semantics are quite different, though.

> but it shouldn’t be the default because protocols usually assign some semantics to it such as having it hashed or distinguishing it from JSON (protocols should never assign such semantics to 55799).

Yes.


>> This is kind of twisted, but seems legal.
>>>  Encoding: 
>>>    - start with an RFC 3339 date string
>>>    - base 64 encode it and tag it as so with tag 34
>> 
>> Why would one want to do that?
>> 
>>>    - tag it with tag 24 (55799 ) or to say it is CBOR
>> 
>> Ignoring the 55799 case, you are missing one encoding step (tag 24 requires a byte string with encoded CBOR, not a tagged text string).
>> 
>>>    - tag it with tag 0 to say it is a date string
>> 
>> Not allowed.  This only takes major type 3, no tags.
>> 
>>>    - tag it with 22 to say it should be b64’d if re-encoded later
> 
> I’m after a deeper understanding of how base64, tag 24 and 55799 interact or don’t interact. My date string example was poor (sorry).
> 
> Base64 is really a content transfer encoding, not a data type.  The original data before base64 encoding is of some type or other. Whom ever processes it after the base64 encoding is removed has to come to know what type it is.

We don’t have transfer encodings in CBOR, so we can only model them as types.

The tags 32-36 are really “type tags” — they tell you what’s in there but don’t give that a new meaning.

Why anyone would want to send base64-encoded data in CBOR instead of transferring the byte string (then possibly tagged by 21/22 so it looks the same again after a JSON conversion) is not discussed in RFC 7049.  There may be a need to preserve the actual text string for some reason; base64 is deterministic only if you know all the choices.

> However, CBOR generally treats base64 as a data type itself and as it is has no means for indicating the type of the original data before base 64 encoding was added (by contrast, MIME can describe both type and transfer encoding).

Right, the semantics of a “decoded” 33/34 is that of a byte string.

> So how can the receiver of base64-tagged data know the type of original data? It pretty much has to be in the definition of the CBOR protocol.

Yes, byte string.

If you need a “base 64 encoded MAC-48 address”, you can go ahead and define a tag for that.

> It might say it is always an X.509 cert or an elf executable or such. It might say it is looked up by magic number (in which case 55799 might apply). Lots more options...

The tag, or a data definition (CDDL etc.), are the obvious two approaches.

> It is also possible to define tags whose content type can be bstr, base64, base64-url. If it was really common to base64 encode X.509 certs (maybe because of JOSE) and a tag was defined to indicate an X.509 cert, it might be useful to say the X.509 tag content could be bstr, base64 or base64-url.

Yes.  I think we don’t have a tag for X.509 certs at this time; where these occur, they are defined in the protocols (e.g., draft-ietf-cose-x509, which actually defines useful constructions around X.509 certs, which anyway really should be called RFC 5280 certs), ultimately supported by the COSE tags.

> No suggestion for change and nothing at issue so far, just thinking through how it fits together.
> 
> What would really be helpful to me is some detailed example of when you’d use the base64 tags. Why would you ever use them when CBOR carries binary data just fine? Why wouldn’t every CBOR-based protocol just say you always strip the base64 before CBOR encoding? And why wouldn’t every CBOR bstr always be base64 encoded when translating to JSON?

These are all good guidelines; see above for a possible reason, but I don’t know a good one either.

>> 
>> Well, not “it”, but any byte string in the CBOR data item.
>> So, here it would base-64 encode for JSON conversion the byte string in tag 24 (the tag 24 itself would presumably be stripped in any JSON conversion, but I’m already confused — you can do “JSON in JSON” as well, just not tag it).
>> 
>>> Decoding:
>>>    - remove the base 64 encoding because of the tag 34
>> 
>> That is inside, so you see it last.
>> 
>>>    - feed it back to the CBOR encoder because of tag 24
>>>    - interpret it as an RFC 3339 date because of tag 0
>> 
>> Lost track.
>> 
>>> Base64 encoding / decoding is not that much code or that difficult, so a generic decoder might actually do this.
>> 
>> I wouldn’t touch that with a ten-foot pole — it might convert that pole into base64 without asking me.
> 
> I wasn’t suggesting any base64 decoding conversion that wasn’t indicating by a tag in an input stream or any base 64 encoding that wasn’t requested through an explicit API request. Section 3.4.5.3 seems to suggest the decoding is a good thing to do:
> 
> 3.4.5.3.  Encoded Text
> 
> 
>    Some text strings hold data that have formats widely used on the
>    Internet, and sometimes those formats 
> can be validated and presented
>    to the application in appropriate form by the decoder
> .  There are
>    tags for some of these formats.  As with tag numbers 21 to 23, if
>    these tags are applied to an item other than a text string, they
>    apply to all text string data items it contains.
> 
> 
> Is there anything wrong with a generic decoder validating and removing bas64 encoding when encountering tag 34?

As long as the application knows this is happening — it may no longer be able to validate what it got on the structural level, though.

> Also see this issue I filed against this text and Table 5.

(Where “issue” was a link to https://github.com/cbor-wg/CBORbis/issues/194 .)
Yes, I think we do have a genuine bug there.  The specific text is from pull request #18, with the change dated 2018-01-29.  Thinking about a good resolution now.

Grüße, Carsten