[core] John Scudder's Discuss on draft-ietf-core-new-block-11: (with DISCUSS and COMMENT)

John Scudder via Datatracker <noreply@ietf.org> Thu, 06 May 2021 00:41 UTC

MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: John Scudder via Datatracker <noreply@ietf.org>
To: The IESG <iesg@ietf.org>
Cc: draft-ietf-core-new-block@ietf.org, core-chairs@ietf.org, core@ietf.org, marco.tiloca@ri.se, marco.tiloca@ri.se
Auto-Submitted: auto-generated
Precedence: bulk
Reply-To: John Scudder <jgs@juniper.net>
Message-ID: <162026169267.30008.8195219304146866165@ietfa.amsl.com>
Date: Wed, 05 May 2021 17:41:32 -0700
Archived-At: <https://mailarchive.ietf.org/arch/msg/core/yB3vwuumabvv-ZepbU4AW9DP5Y8>
Subject: [core] John Scudder's Discuss on draft-ietf-core-new-block-11: (with DISCUSS and COMMENT)

John Scudder has entered the following ballot position for
draft-ietf-core-new-block-11: Discuss

When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)


Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
for more information about DISCUSS and COMMENT positions.


The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-core-new-block/



----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------

For the most part I found this document relatively easy to follow, considering
my complete lack of background in CoAP. However, despite a concerted effort I
have not been able to nail down with confidence what the intended semantics of
several of your timeouts are, notably NON_RECEIVE_TIMEOUT. Some of the text
(for example, §4.4) implies that the timeout is an upper bound on how long an
implementation should wait before declaring a block to have been lost (“The
client SHOULD wait for up to NON_RECEIVE_TIMEOUT”). At the very least, this is
imprecise because the timeout increases exponentially with repeated timeouts —
but this is a relatively minor matter, discussed further in my comments.

Later, in §7.2, you say that expiry of the timeout is not the only trigger for
a 4.08 response:

   It is likely that the client will start transmitting the next set of
   MAX_PAYLOADS payloads before the server times out on waiting for the
   last of the previous MAX_PAYLOADS payloads.  On receipt of the first
   payload from the new set of MAX_PAYLOADS payloads, the server SHOULD
   send a 4.08 (Request Entity Incomplete) Response Code indicating any
   missing payloads from any previous MAX_PAYLOADS payloads.

It makes sense to me that you use this additional trigger. At this point in my
reading of the spec, my understanding of the retransmission algorithm was that
a 4.08 should be sent when either a payload is received from a new set of
MAX_PAYLOADS, or NON_RECEIVE_TIMEOUT expires. But then I got to the example in
10.2.3, which shows the client waiting for the expiration of
NON_RECEIVE_TIMEOUT even though it has received the first of a new set of
MAX_PAYLOADS, and I concluded that either I’ve missed something basic, or the
document is internally inconsistent.

As an aside, I’m also unclear as to why the only trigger you specify for
sending a 4.08 is the arrival of the first of a new MAX_PAYLOADS flight. Other
possible triggers I noticed include a gap in the sequence, and reception of a
payload with More=0.

Some of these issues are repeated in my comments, below — I’ve noted those in
the comment. Possibly in addressing this DISCUSS we’ll clear up some of those
comments too.


----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------

Comments:

(draft-ietf-core-new-block-11)

1. Section 3.2

   This mechanism is not intended for general CoAP usage, and any use
   outside the intended use case should be carefully weighed against the
   loss of interoperability with generic CoAP applications.

I’m curious: is the only reason the mechanism isn’t intended for general usage,
the fact some implementations won’t support it? Or does it have other
deficiencies that also make it unsuitable?

2. Section 4.1

   Q-Block2 Option is useful with GET, POST, PUT, FETCH, PATCH, and
   iPATCH requests and their payload-bearing responses (2.01, 2.02,
   2.03, 2.04, and 2.05) (Section 5.5 of [RFC7252]).

I found the list of codes incomprehensible on first encountering it, since the
concept of response codes hadn’t been introduced yet. I do understand that the
document assumes familiarity with CoAP; nonetheless for basic clarity I think
this should say “(response codes 2.01, 2.02…”. Additionally, the reference to
RFC 7252 §5.5 doesn’t seem to be especially germane?

By the way, is 2.03 indeed a payload-bearing response? The only other place the
spec touches on it is in §4.4, which says “the server could respond with a 2.03
(Valid) response with no payload”.

3. Section 4.1

   To indicate support for Q-Block2 responses, the CoAP client MUST
   include the Q-Block2 Option in a GET or similar request (FETCH, for
   example), the Q-Block2 Option in a PUT or similar request, or the
   Q-Block1 Option in a PUT or similar request so that the server knows
   that the client supports this Q-Block functionality should it need to
   send back a body that spans multiple payloads.  Otherwise, the server
   would use the Block2 Option (if supported) to send back a message
   body that is too large to fit into a single IP packet [RFC7959].

Is this paragraph really supposed to mention both Q-Block2 and Q-Block1? In
particular, I’m confused by the mention of both of these in relation to PUT.

4. Section 4.1

   The Q-Block1 and Q-Block2 Options are unsafe to forward.  That is, a
   CoAP proxy that does not understand the Q-Block1 (or Q-Block2) Option
   MUST reject the request or response that uses either option.

Presumably (hopefully) this is simply describing the behavior of existing
spec-compliant proxies when processing the new messages. As such, is the MUST
appropriate? I would think not.

5. Section 4.3

      body.  Note that the last received payload may not be the one with
      the highest block number.

“Might not” would be less ambiguous than “may not”.

6. Section 4.4 (also two places in §4.3)

(This comment rehashes, in more detail, the difficulty explained in my DISCUSS.
You may want to skip over it until we’ve resolved the DISCUSS, after which this
may, or may not, be relevant.)

   The client SHOULD wait for up to NON_RECEIVE_TIMEOUT (Section 7.2)

I read this as meaning the client should wait for as little as zero, or as long
as NON_RECEIVE_TIMEOUT — that’s my understanding of “up to”. Is that the
intended meaning? If it is, I think it’s worth writing out as I’ve done, for
clarity. If it’s not, it definitely needs to be fixed.

There’s a similar issue with “up to NON_PARTIAL_TIMEOUT” later in the section.

Referring ahead to Section 7.2 muddies the waters further. Even though the text
quoted above says NON_RECEIVE_TIMEOUT is an upper limit on how long to wait,
§7.2 says it’s a lower limit instead... maybe? From §7.2:

   NON_RECEIVE_TIMEOUT is the initial maximum time to wait for a missing

“Maximum”, ok great, that means “upper bound” and so lines up with §4.4
although the “initial” is surprising since §4.4 doesn’t say anything about the
upper limit increasing. It continues:

   payload before requesting retransmission for the first time.  Every
   time the missing payload is re-requested, the time to wait value
   doubles.  The time to wait is calculated as:

      Time-to-Wait = NON_RECEIVE_TIMEOUT * (2 ** (Re-Request-Count - 1))

But this part says it’s (a) an exact time-to-wait, not a “maximum”, and (b) it
says it increases exponentially, so NON_RECEIVE_TIMEOUT isn’t a maximum at all,
but a minimum.

This later text in §7.2 implies that perhaps the problem in the above passages
is the word “maximum”, and it should simply be deleted:

   For the server receiving NON Q-Block1 requests, it SHOULD send back a
   2.31 (Continue) Response Code on receipt of all of the MAX_PAYLOADS
   payloads to prevent the client unnecessarily delaying.  If not all of
   the MAX_PAYLOADS payloads were received, the server SHOULD delay for
   NON_RECEIVE_TIMEOUT (exponentially scaled based on the repeat request
   count for a payload) before sending the 4.08 (Request Entity
   Incomplete) Response Code for the missing payload(s).

Similarly “up to” in the quote that began this comment should be “at least”.

Whether you adopt those suggestions or not,  it seems as though all this needs
to be rewritten with careful attention to conveying what the desired behavior
is.

But the plot thickens. Later in §7.2 we have

   It is likely that the client will start transmitting the next set of
   MAX_PAYLOADS payloads before the server times out on waiting for the
   last of the previous MAX_PAYLOADS payloads.  On receipt of the first
   payload from the new set of MAX_PAYLOADS payloads, the server SHOULD
   send a 4.08 (Request Entity Incomplete) Response Code indicating any
   missing payloads from any previous MAX_PAYLOADS payloads.

The point being that the retransmission request can be triggered by an event
other than timer expiration. So in that sense, “maximum” is right — it provides
an upper bound on how long to wait before requesting a retransmission — but in
another sense it’s wrong because the exponential increase is applied to it. I
think the word “maximum” is trying to do too much work, and more words are
probably required in order to make this clear. I also think the problem is
exacerbated by the fact both §4.4 and §7.2 are talking normatively about how to
use NON_RECEIVE_TIMEOUT. It seems as though the main description is found in
§7.2, and some confusion would be avoided by making §4.4 less specific, and
simply referring forward to §7.2.

And, as noted in my DISCUSS, example 10.2.3 muddies the waters still further
since it illustrates yet another behavior.

7. Section 4.4

   The client SHOULD wait for up to NON_RECEIVE_TIMEOUT (Section 7.2)
   after the last received payload for NON payloads before issuing a
   GET, POST, PUT, FETCH, PATCH, or iPATCH request that contains one or
   more Q-Block2 Options that define the missing blocks with the M bit
   unset.  The client MAY set the M bit to request this and later blocks
   from this MAX_PAYLOADS set.  Further considerations related to the
   transmission timing for missing requests are discussed in
   Section 7.2.

I find this whole paragraph pretty confusing with the dueling SHOULD and MAY,
where it appears the SHOULD might be doing two jobs at once. I *think* your
intent is something like the following?

“The client SHOULD wait as specified in Section 7.2 for NON payloads before
requesting retransmission of any missing blocks. Retransmission is requested by
issuing a GET, POST, PUT, FETCH, PATCH, or iPATCH request that contains one or
more Q-Block2 Options that define the missing block(s). Generally the M bit on
the Q-Block option(s) SHOULD be unset, although the M bit MAY be set to request
this and later blocks from this MAX_PAYLOADS set, see Section 10.2.4 for an
example of this in operation.”

8. Section 5

   If the size of the 4.08 (Request Entity Incomplete) response packet
   is larger than that defined by Section 4.6 [RFC7252], then the number
   of missing blocks MUST be limited so that the response can fit into a
   single packet.  If this is the case, then the server can send

Suggestion: “then the number of missing blocks reported MUST...” (The thing
being limited is not the actual number of missing blocks. You’re limiting the
number you report on.)

9. Section 7.1

   It is implementation specific as to whether there should be any
   further requests for missing data as there will have been significant
   transmission failure as individual payloads will have failed after
   MAX_TRANSMIT_SPAN.

This paragraph seems as though it’s a non-sequitur. It just doesn’t make sense
to me. :-(

10. Section 7.2

(This comment relates to the difficulty explained in my DISCUSS. You may want
to skip over it until we’ve resolved the DISCUSS, after which this may, or may
not, be relevant.)

   NON_TIMEOUT is the maximum period of delay between sending sets of
   MAX_PAYLOADS payloads for the same body.  By default, NON_TIMEOUT has
   the same value as ACK_TIMEOUT (Section 4.8 of [RFC7252]).

Presumably the use of “maximum” means it’s fine to delay zero seconds (or any
value lower than NON_TIMEOUT).

11. General

By the way, none of the timers specify jitter (and indeed, if read literally,
jitter would be forbidden). Is this intentional?

12. Section 7.2

   If the CoAP peer reports at least one payload has not arrived for
   each body for at least a 24 hour period and it is known that there
   are no other network issues over that period, then the value of
   MAX_PAYLOADS can be reduced by 1 at a time (to a minimum of 1) and
   the situation re-evaluated for another 24 hour period until there is
   no report of missing payloads under normal operating conditions.  The
   newly derived value for MAX_PAYLOADS should be used for both ends of
   this particular CoAP peer link.  Note that the CoAP peer will not
   know about the MAX_PAYLOADS change until it is reconfigured.  As a
   consequence of the two peers having different MAX_PAYLOADS values, a
   peer may continue indicate that there are some missing payloads as
   all of its MAX_PAYLOADS set may not have arrived.  How the two peer
   values for MAX_PAYLOADS are synchronized is out of the scope.

I take it this is just thrown in here as an operational suggestion? It’s not
specifying protocol, right? It seems a little misplaced, if so.

13. Section 10.1.3

(This comment relates to the aside in my DISCUSS. You may want to skip over it
until we’ve resolved the DISCUSS, after which this may, or may not, be
relevant.)

Why doesn’t the server request 1,9,10 in one go? Since its rxmt request is
triggered by rx of 11, one would think it could infer 10 had been lost.

14. Section 10.1.4 (also 10.3.3)

(This comment relates to the aside in my DISCUSS. You may want to skip over it
until we’ve resolved the DISCUSS, after which this may, or may not, be
relevant.)

Why doesn’t reception of a message with More=0 trigger the server to request
retransmission of the missing block? Why does it have to wait for timeout?

15. Section 10.2.3

(This comment relates to my DISCUSS. You may want to skip over it until we’ve
resolved the DISCUSS, after which this may, or may not, be relevant.)

Why doesn’t reception of QB2:10/0/1024 trigger the client to request
retransmission? Why does it have to wait for timeout? Similarly reception of
QB2:9/1/1024 later in the example.

16. Section 10.2.4

Since MAX_PAYLOADS is 10, why does the example say “MAX_PAYLOADS has been
reached” after payloads 2-9 have been retransmitted? That’s only 8 payloads.

[core] John Scudder's Discuss on draft-ietf-core-… John Scudder via Datatracker
Re: [core] John Scudder's Discuss on draft-ietf-c… mohamed.boucadair
Re: [core] John Scudder's Discuss on draft-ietf-c… John Scudder
Re: [core] John Scudder's Discuss on draft-ietf-c… mohamed.boucadair
Re: [core] John Scudder's Discuss on draft-ietf-c… John Scudder
Re: [core] John Scudder's Discuss on draft-ietf-c… mohamed.boucadair
Re: [core] John Scudder's Discuss on draft-ietf-c… John Scudder
Re: [core] John Scudder's Discuss on draft-ietf-c… Martin Duke
Re: [core] John Scudder's Discuss on draft-ietf-c… mohamed.boucadair
Re: [core] John Scudder's Discuss on draft-ietf-c… Martin Duke
Re: [core] John Scudder's Discuss on draft-ietf-c… supjps-ietf
Re: [core] John Scudder's Discuss on draft-ietf-c… Martin Duke
Re: [core] John Scudder's Discuss on draft-ietf-c… mohamed.boucadair