Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf-6lo-fragment-recovery-13: (with DISCUSS and COMMENT)

"Pascal Thubert (pthubert)" <pthubert@cisco.com> Fri, 06 March 2020 18:58 UTC

IronPort-PHdr: 9a23:O/mb2hHE16rbqTEQJYejTJ1GYnJ96bzpIg4Y7IYmgLtSc6Oluo7vJ1Hb+e4z1Q3SRYuO7fVChqKWqK3mVWEaqbe5+HEZON0pNVcejNkO2QkpAcqLE0r+eeb2bzEwEd5efFRk5Hq8d0NSHZW2ag==
From: "Pascal Thubert (pthubert)" <pthubert@cisco.com>
To: Mirja Kühlewind <ietf@kuehlewind.net>, The IESG <iesg@ietf.org>, Benjamin Kaduk <kaduk@mit.edu>
CC: "draft-ietf-6lo-fragment-recovery@ietf.org" <draft-ietf-6lo-fragment-recovery@ietf.org>, Carles Gomez <carlesgo@entel.upc.edu>, "6lo-chairs@ietf.org" <6lo-chairs@ietf.org>, "6lo@ietf.org" <6lo@ietf.org>
Thread-Topic: Mirja Kühlewind's Discuss on draft-ietf-6lo-fragment-recovery-13: (with DISCUSS and COMMENT)
Thread-Index: AQHV5yxvQMp3kJRXD0SmWqJXtHBMvqg7nFTw
Date: Fri, 06 Mar 2020 18:57:48 +0000
Deferred-Delivery: Fri, 6 Mar 2020 18:57:00 +0000
Message-ID: <MN2PR11MB356540107CC4FC7F9CB9F412D8E30@MN2PR11MB3565.namprd11.prod.outlook.com>
References: <158212059997.17584.9409485384556514167.idtracker@ietfa.amsl.com>
In-Reply-To: <158212059997.17584.9409485384556514167.idtracker@ietfa.amsl.com>
Accept-Language: fr-FR, en-US
Content-Language: en-US
received-spf: None (protection.outlook.com: cisco.com does not designate permitted sender hosts)
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-Network-Message-Id: 1d867444-3e2c-49b6-58d8-08d7c2004f05
X-MS-Exchange-CrossTenant-originalarrivaltime: 06 Mar 2020 18:58:07.7103 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 5ae1af62-9505-4097-a69a-c1553ef7840e
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: ActKRp0BnhPMuuDyrG2oRyOQ7qyNhVqOHf3rehVWBQkS2qyqQlLSfXuVbXSF9igtrRvtRYXOZpUe4eqzkfhaoA==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR11MB4549
X-OriginatorOrg: cisco.com
X-Outbound-SMTP-Client: 173.37.102.14, xch-rcd-004.cisco.com
X-Outbound-Node: rcdn-core-6.cisco.com
Archived-At: <https://mailarchive.ietf.org/arch/msg/6lo/lbTwbRw4GdTMLAiALDcFC9H_CC8>
Subject: Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf-6lo-fragment-recovery-13: (with DISCUSS and COMMENT)
X-BeenThere: 6lo@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Mailing list for the 6lo WG for Internet Area issues in IPv6 over constrained node networks." <6lo.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/6lo>, <mailto:6lo-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/6lo/>
List-Post: <mailto:6lo@ietf.org>
List-Help: <mailto:6lo-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/6lo>, <mailto:6lo-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 06 Mar 2020 18:58:16 -0000

Hello Mirja

A great many thanks for your really deep review, this is both really appreciated and  incredibly useful.

If that's OK with you let's make a round to clear the DISCUSSes separately like I did for Benjamin's review.

Also considering the breadth of the discuss alone, I'd rather publish the proposed changes so you and Benjamin can review the changes I made for your DISCUSSes.

Please find the proposed changes discussed below in https://www.ietf.org/rfcdiff?url2=draft-ietf-6lo-fragment-recovery-14 

> 
> ----------------------------------------------------------------------
> DISCUSS:
> ----------------------------------------------------------------------
> 
> Thanks for this well written document, however, I have a couple points
> below that need further clarification, all mostly related to congestion
> control. From an editorial point of view most of this is discussed either in the
> intro text of section 6, then some part in 7.1, and some in the appendix C. I
> would really recommend you to instead have a separate section that much
> clearer states what should be done by default (probably no dynamically
> window but a small fixed window with maybe size of 1) and what could be
> don as further optimisation, and also to discuss the parameter/variables
> there before the algorithms are discussed.
> 

A size of 1 is probably not acceptable for the general LLN case, considering the cost of an ack. I'd rather leave that to config.
Otherwise, I agree. What about adding a subsection in section 4 as follows (including changes that cover your comments below):

"
4.3.  Flow Control

   The inter-frame gap is the only protection that [FRAG-FWD] imposes by
   default.  This document enables to group fragments in windows and
   request intermediate acknowledgements so the number of in-flight
   fragments can be bounded.  This document also adds an ECN mechanism
   that can be used to adapt the size of the window, the size of the
   fragments, and/or the inter-frame gap to protect the network.

   This specification enables the source endpoint to apply a flow
   control mechanism to tune those parameters, but the mechanism itself
   is out of scope.  In most cases, the expectation is that most
   datagrams will represent only a few fragments, and that only the last
   fragment will be acknowledged.  A basic implementation of the source
   endpoint is NOT REQUIRED to variate the size of the window, the
   duration of the inter-frame gap or the size of a fragment in the
   middle of the transmission of a datagram, and it MAY ignore the ECN
   signal or simply reset the window to 1 (see Appendix C for more) till
   the end of this datagram upon detecting a congestion.

   The size of the fragments is typically computed from the Link MTU to
   maximize the size of the resulting frames.  The size of the window
   and the duration of the inter-frame gap SHOULD be configurable, to
   roughly adapt the size of the window to the number of hops in an
   average path, and to follow the general recommendations in
   [FRAG-FWD], respectively.
"


> And a bit of a provoking question: wouldn't it be easier to just use a reliable
> transport protocol on top?

Just that the classical transports I'm aware of will :
- not support the interframe gap and that's the basic requirement in 6lo 
- not be capable to variate the interframe gap nor the fragment size
- use the return path excessively for acks. 6lo is very much about saving energy and bandwidth
- be generally Overkill/too complex for a LLN node, see the text we just added

Also this spec enables a flow control mechanism but that mechanism is out of scope. 
It is internal to the sending endpoint and does not affect the interoperability that this specification enables.
A bit like there's an ECN in IP but the behavior belongs to the various transport protocols. 
Just that on top of signaling ECN we provide tools that the flow control mechanism may play with, e.g., window size

> If this mechanism is intended to be used over a
> short path with a few hops only (in a local network), I think this should be
> stated more clearly at the beginning of the document. 

This is very true and implicit since we are talking about a contiguous 6LoWPAN route-over mesh.
Propose to tweak the last sentence in the introduction to
"
   This specification provides a method to forward fragments over
   typically a few hops in a route-over 6LoWPAN mesh, and a selective
   acknowledgment to recover individual fragments between 6LoWPAN
   endpoints.  The method is designed to limit congestion loss in the
   network and addresses the requirements that are detailed in
   Appendix B.

"


> In the appendix you state
> this: " In addition, deploying such a mechanism requires
>    that the end-to-end transport is aware of the delivery properties of
>    the underlying LLN,..."
> But I'm not sure what you mean...? Can you further explain?

"Requires" might be exaggerated since TCP was shown to work fine in LLNs. 
But things like the default RTO of 1s is really unsafe, for an endpoint in the internet that communicates to the LLN device (e.g., using HTTP instead of COAP).
But it's  probably better to just remove that text. Instead it is probably good to mention the extra acks on the return path.
For one thing, though, we do not want to discard packets that traversed the LLN to indicate congestion to the source. Ideally things like slow start should be really smooth, and the window size should remain very small to cope with the memory available in the LN node without the need to drop a packet in the LLN.

The end of appendix A becomes:

"

   Mechanisms such as TCP or application-layer segmentation could be
   used to support end-to-end reliable transport.  One option to support
   bulk data transfer over a frame-size-constrained LLN is to set the
   Maximum Segment Size to fit within the link maximum frame size.
   Doing so, however, can add significant header overhead to each
   802.15.4 frame and cause extraneous acknowledgements across the LLN
   compared to the method in this specification.

"

> 
> 1) Sec 6:
> "Upon exhaustion of the retries the
>    sender may either abort the transmission of the datagram or retry the
>    datagram from the first fragment with an 'X' flag set in order to
>    reestablish a path and discover which fragments were received over
>    the old path in the acknowledgment bitmap. "
> I'm not sure about this "or". Why should the first fragment be more
> successful than any other which requests an ACK? Also if you really want to
> keep this condition, you need to specify it better. How often do you retry? I
> guess you need to set the PTO again...? Further the RTO should also
> implement an exponential back-off.

The first fragment draws a new path so it may avoid the problem though there is no guarantee. 
Once the new path is established, the next fragments will follow it and the segments of the old path that are no more used time out.

Proposed updated text:

"
   This automatic repeat request (ARQ) process MUST be protected by a
   Retransmission Time Out (RTO) timer, and the fragment that carries
   the 'X' flag MAY be retried upon a time out for a configurable number
   of times (see Section 7.1) with an exponential backoff.  Upon
   exhaustion of the retries the sender may either abort the
   transmission of the datagram or resend the first fragment with an 'X'
   flag set in order to establish a new path for the datagram and obtain
   the list of fragments that were received over the old path in the
   acknowledgment bitmap.

"


> 
> 2) sec 6.3:
> "Upon an acknowledgment with a NULL bitmap, the sender endpoint
>    MUST abort the transmission of the fragmented datagram with one
>    exception: In the particular case of the first fragment, it MAY
>    decide to retry via an alternate next hop instead."
> What's mean with "In the particular case of the first fragment"? And does
> this mean it should retry only with the first fragment or the whole
> transmission. 
> However, if this signal is from the receiving endpoint why should that
> endpoint change it mind only if a different path is used? If the assumption is
> that this NULL bitmap is sent by an intermediate node? However, then it
> would make sense to  rather signal this information explicitly (e.g. using a
> flag).

This is also linked to the fact that the first fragment draws the path as in the case above. As you figured, the expectation is that a node in the middle experiences an issue and cannot do the FF operation for that datagram, so it aborts with a NULL bitmap.

Yes, the problem could be in the receiving endpoint in which case rerouting does not help. But then, it is probably temporary, e.g., if the receiving endpoint has a single reassembly buffer, which is quite common, and is already receiving a datagram from another source. There is a variety of use cases and which is most probable depends on the use case. 

So let the source endpoint decide. 



> 3) Sec 7.1 (and to some extend sec 6)
> "   OptWindowSize:  The OptWindowSize is the value for the Window_Size
>       that the sender should use to start with.  It is greater than or
>       equal to MinWindowSize.  It is less than or equal to
>       MaxWindowSize.  The Window_Size should be maintained below the
>       number of hops in the path of the fragment to avoid stacking
>       fragments at the bottleneck on the path.  If an inter-frame gap is
>       used to avoid interference between fragments then the Window_Size
>       should be at most on the order of the estimation of the trip time
>       divided by the inter-frame gap."
> This needs normative language and more explanation. 

Well, this was not intended to be normative but just a rule of a thumb. Some (many I expect) people will want to ack only the last fragment and that's a tradeoff between the cost of the ack back and the chances of congestion loss. 


> I recommend to even
> say that if no congestion control (as discussed in the appendix) is applied, the
> Window MUST be set to 1. 

This makes full sense in other cases but is too expensive here. The default that people go for is a single ack in the end. Note that this spec compares to the art of RFC 4944 where all the fragments are pushed to the network without any feedback as hot potatoes. Apparently that did noes work too well in cases, thus this work. But people still love the fact that there's no traffic back and that is why this original work (https://tools.ietf.org/html/draft-thubert-6lowpan-simple-fragment-recovery-01) was split 3 docs, this,  the overarching minimal-fragments and the LWIG draft that forwards fragments in the RFC 4944 and no ack.



> Further, the assumption that the window can or
> should be set to (at maximum) the number of hop does seem correctly to
> me. No matter how many hops there are packets are only queued at the
> bottleneck (the link where the current rate is smaller than the sending rate)
> and it depends on the sending rate of the bottleneck link how many packets
> need to be queued. This is completely independent of the number of hops.

The rationale here is due to the inter-frame gap. In normal conditions, it ensures that a fragment progresses before the next comes in. So there's at most one fragment per hop and there's no point having a window bigger that. There's less than that actually, because frag 0 reaches node 2 before node 0 can send frag 1 to node 1, so we could divide the recommendation by 2. But then we need to keep the network busy while the ack comes back. 

I agree that this is not providing the optimal window but more like a reasonable upper bound. If there was no interframe gap I'd say (please correct me) that the lower bound to keep the bottleneck busy (in fragments not bytes) is (Bottleneck Speed / Bottleneck MTU) * RTT.  The average low power network is mostly sleeping, and does not  experience congestion in normal operation. So there is usually no bottleneck. If  the LLN is homogeneous we get a lower bound of  (PHY Speed / MTU) * RTT. But then, some people could be conservative and use a window of 1 as you recommend.

All in all it appears that the text creates more confusion then help, and dives into the sender flow control which is out of scope.
I'd rather remove recommendation on the runtime window size at all.

So we'd get:
"
   OptWindowSize:  The OptWindowSize is the value for the Window_Size
      that the sender should use to start with.  It is greater than or
      equal to MinWindowSize.  It is less than or equal to
      MaxWindowSize.  A rule of a thumb for OptWindowSize could be an
      estimation of the trip time divided by the inter-frame gap to keep
      the network busy.
"

> Further, even if that would be true, as long as this document does not discuss
> also away to estimate or know the number of hops, this advise would
> unfortunately be useless... 

Yes, better remove it

> Further I don't think pointing to rfc6298 for RTT
> calculation is sufficient (as done in the appendix). rfc6298 assume frequent
> ACKs and a reasonably large window, which is both not the case here. All in

TCP has been successfully used on LLNs, though, so it cannot be that bad a recommendation.
Note that there's probably less fragments than hops, so there's probably not a chance to even measure RTT before all fragments are out, and if tehre is, not many chances to update the initial reading till the datagram is fully sent. I agree there may be better ways so we need to remove the RECOMMEND. What about:

"
                                                        For the lack of a more adapted
   technique, the method detailed in "Computing TCP's Retransmission
   Timer" [RFC6298] may be used for that computation. 

"

Earlier in 6.0 I also suggest to change:

"
                                                                                                      The sender
   protects the transmission over the LLN mesh with a retry timer that
   is configured for a use case and may be adapted dynamically, e.g.,
   according to the method detailed in [RFC6298].  

"


> all, any window adjustments itself are not described at all. What should be
> done when a congestion marking is received? How does the window need to
> be adjusted based on an RTO? When should the window be increased again?
> And how much?

This is out of scope.

The goal of the draft is to specify what goes over the air to recover fragments. The flow control operation is an internal decision to the sender endpoint, and can be adapted to the use case an interoperation issue with the other endpoints. We do not have enough experience to enforce something, and there can be very different use cases and variations, so we only provide non-normative hints.

We want to allow implementations to try their own stuff, including slow start and fast recovery for a  device that can afford it in a use case that justifies it. I hope we'll see future spec(s) that specify flow control mechanisms, but as I said earlier, this is out of scope here, we just provide the controls.

Following your earlier recommendation, we could suggest in case of a ECN we set  W=1 and stay there as a rule of a thumb for that datagram in the absence of a more intelligent / adapted flow control operation

To clarify let me change
"
4.  Extending draft-ietf-6lo-minimal-fragment

   This specification implements the generic FF technique defined in
   "LLN Minimal Fragment Forwarding" [FRAG-FWD], provides end-to-end
   fragment recovery and mechanisms that can be used for flow control.


"

Also see the new section 4.3


> 
> 4) Sec 7.1.: Inline with the TSV-ART review (Thanks Collin!), the parameters
> need more guidance. Especially for he number of retries it should be possible
> to recommend a default value (e.g. 3) and it would be good to also give an
> upper limits (MUST NOT be larger than X). Similar for the window size: there
> should be also at least a default value (see comment above). And further the
> RTO needs further explanation about how to find a reasonable value. If the
> RTO is configured (and not estimated dynamically) e.g. it could be set to 3x
> the maximum expected RTT in the respective network. And it would be even
> better to provide a minimum default (initial) value. Not that TCP is also
> designed to work on a large variety of timescales and a minimum initial value
> of 1s is seen as safe for all Internet scenarios. It's really important to also
> provide some recommendations like this here.

Makes sense. 

The number of retries is really bounded by the upper layer protocol. 

It is actually the time allowed to transfer the datagram that must remain below whatever the upper layer protocol expects.
We could give a rule of a thumb and yes your 3 looks good. 

The window size can be anything, I expect many will only ask for an ack the last fragment. 
But by construction that is bounded by the bitmap to 32. 

All in all we get:

"
   MinWindowSize:  The minimum value of Window_Size that the sender can
      use.  A value of 1 is RECOMMENDED.

   OptWindowSize:  The OptWindowSize is the value for the Window_Size
      that the sender should use to start with.  It is greater than or
      equal to MinWindowSize.  It is less than or equal to
      MaxWindowSize.  A rule of a thumb for OptWindowSize could be an
      estimation of the one-way trip time divided by the inter-frame
      gap.  If the acknowledgement back is too costly, it is possible to
      set this to 32, meaning that only the last Fragment is
      acknowledged in the first round.

   MaxWindowSize:  The maximum value of Window_Size that the sender can
      use.  The value MUST be strictly less than 33.

   An implementation may perform its estimate of the RTO or use a
   configured one.  The ARQ process is controlled by the following
   parameters:

   MinARQTimeOut:  The minimum amount of time a node should wait for an
      RFRAG Acknowledgment before it takes the next action.  It MUST be
      more than the maximum expected round-trip time in the respective
      network.

   OptARQTimeOut:  The initial value of the RTO, which is the amount of
      time that a sender should wait for an RFRAG Acknowledgment before
      it takes the next action.  It is greater than or equal to
      MinARQTimeOut.  It is less than or equal to MaxARQTimeOut.  See
      Appendix C for recommendations on computing the round-trip time.
      By default a value of 3 times the maximum expected round-trip
      time in the respective network is RECOMMENDED.

   MaxARQTimeOut:  The maximum amount of time a node should wait for the
      RFRAG Acknowledgment before it takes the next action.  It must
      cover the longest expected round-trip time, and be several times
      less than the time-out that covers the recomposition buffer at the
      receiver, which is typically on the order of the minute.  An upper
      bound can be estimated to ensure that the datagram is either fully
      transmitted or dropped before an upper layer decides to retry it.

   MaxFragRetries:  The maximum number of retries for a particular
      fragment.  A default value of 3 is RECOMMENDED.  An upper bound
      can be estimated to ensure that the datagram is either fully
      transmitted or dropped before an upper layer decides to retry it.

   MaxDatagramRetries:  The maximum number of retries from scratch for a
      particular datagram.  A default value of 1 is RECOMMENDED.  An
      upper bound can be estimated to ensure that the datagram is either
      fully transmitted or dropped before an upper layer decides to
      retry it.




"

> 
> 5) Sec 7.2:
> "The management system should monitor the number of retries and of ECN
>    settings that can be observed from the perspective of both the sender
>    and the receiver, and may tune the optimum size of Fragment_Size and
>    of Window_Size, OptFragmentSize, and OptWindowSize, respectively, at
>    the sender."
> This does not see seem correct, as OptFragmentSize and OptWindowSize are
> the initial values which are configured and therefore should not be changed
> dynamically. Only Fragment_Size and Window_Size are changes. Further the
> network should also normatively state somewhere that Fragment_Size and
> Window_Size MUST not grow above the configured max value. That seems
> obvious but it's better to be explicit and use normative language respectively.

This is meant to change the starting values to be applied to the next datagrams.
Note that talking to the management system can take a very long time. 
We do not expect the kind of reactivity that would affect the current datagram.

Proposed changes:
"
7.1.  Protocol Parameters

   The management system SHOULD be capable of providing the parameters
   listed in this section and an implementation MUST abide by those
   parameters and in particular never exceed the minimum and maximum
   configured boundaries.

"

And

"
7.2.  Observing the network

   The management system should monitor the number of retries and of ECN
   settings that can be observed from the perspective of both the sender
   and the receiver with regards to the other endpoint.  It may then
   tune the optimum size of Fragment_Size and of Window_Size,
   OptFragmentSize, and OptWindowSize, respectively, at the sender
   towards a particular receiver, applicable to the next datagrams.
"



> 6) Further sec 7.2 says:
> "The inter-frame gap is another tool that can be
>    used to increase the spacing between fragments of the same datagram
>    and reduce the ratio of time when a particular intermediate node
>    holds a fragment of that datagram."
> However, inter-frame gap is a configuration parameter and this is the first
> time that adapting it dynamically is mentioned here. If you want to adapt it
> dynamically you need to add more information.

This is now discussed in 4.3 . But not in great details but then the flow control mechanism is out of scope.

Again many thanks!

Pascal

[6lo] Mirja Kühlewind's Discuss on draft-ietf-6lo… Mirja Kühlewind via Datatracker
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Pascal Thubert (pthubert)
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Pascal Thubert (pthubert)
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Mirja Kuehlewind
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Pascal Thubert (pthubert)
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Mirja Kuehlewind
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Pascal Thubert (pthubert)
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Mirja Kuehlewind
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Pascal Thubert (pthubert)
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Pascal Thubert (pthubert)
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Mirja Kuehlewind
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Pascal Thubert (pthubert)
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Mirja Kuehlewind
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Pascal Thubert (pthubert)
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Suresh Krishnan
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Pascal Thubert (pthubert)
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Suresh Krishnan
Re: [6lo] Mirja Kühlewind's Discuss on draft-ietf… Pascal Thubert (pthubert)