Re: [Tsv-art] Tsvart telechat review of draft-ietf-pim-source-discovery-bsr-08

Stewart Bryant <stewart.bryant@gmail.com> Wed, 24 January 2018 16:45 UTC

To: "Black, David" <David.Black@dell.com>, Stig Venaas <stig@venaas.com>
Cc: "tsv-art@ietf.org" <tsv-art@ietf.org>, "ietf@ietf.org" <ietf@ietf.org>, "pim@ietf.org" <pim@ietf.org>, "draft-ietf-pim-source-discovery-bsr.all@ietf.org" <draft-ietf-pim-source-discovery-bsr.all@ietf.org>
References: <151675081688.15722.801207813861297527@ietfa.amsl.com> <CAHANBt+a1eoMGNJs5tKvNOtLKKBbW5CHE00ZaUGS62OP5goviw@mail.gmail.com> <CE03DB3D7B45C245BCA0D243277949362FFAB39B@MX307CL04.corp.emc.com>
From: Stewart Bryant <stewart.bryant@gmail.com>
Message-ID: <622506ad-8fb6-f194-3b70-403c26f67d02@gmail.com>
Date: Wed, 24 Jan 2018 16:45:08 +0000
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2
MIME-Version: 1.0
In-Reply-To: <CE03DB3D7B45C245BCA0D243277949362FFAB39B@MX307CL04.corp.emc.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: quoted-printable
Content-Language: en-GB
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsv-art/RZNCDnMHEJ39lTIRGf814cNxV-g>
Subject: Re: [Tsv-art] Tsvart telechat review of draft-ietf-pim-source-discovery-bsr-08
Precedence: list

The problem with complex processing under error conditions is that that 
is where all the software bugs hang out because they are hard to test 
and don't show up until you have the problem they are trying to fix.

This is a case where you want the simplest possible process like a small 
burst followed by your 60s interval which seems unlikely to stress any 
sensibly designed implementation on a reasonably sized network.

- Stewart


On 24/01/2018 16:30, Black, David wrote:
> Hi Stig,
>
>> I agree with all you wrote and will update the document. However,
>> there is one slight issue with the minimum time between origination of
>> each message. When a new source is detected, we would like to
>> originate a message ASAP so that receivers can start receiving the
>> multicast without much delay. A 10s delay would be a rather long time
>> if a source was detected right after the previous message was
>> originated. I think some delay would be warranted though, in
>> particular in a case where perhaps a router starts up and a large
>> number of directly connected sources could be detected within a short
>> time frame. I think an exponential back-off could make sense here.
>> E.g., if it is just one new source, maybe trigger a message ASAP. If a
>> new source is detected right after the previous one, wait a bit
>> longer, which also allows for aggregation of multiple sources in one
>> messages if several are detected later. In extreme cases one could
>> over time keep increasing the delay until the next update.
>> If sufficient we could maybe have a fixed minimum delay of 1s or not,
>> but that is probably too short in those extreme cases. Hence maybe an
>> exponential back-off.
> Exponential back-off sounds like a very good idea - I'd suggest adding something starting from RFC 5059's back-off functionality.
>
>> I would appreciate some further guidance what you think is reasonable
>> here, and perhaps whether I can borrow something here from other
>> protocols/drafts. Part of the experiment here might be to find out
>> what minimum values, or how rapid back-off, is needed based on the
>> size of the network, the amount of sources, the types of links etc.
> In addition to burst scenarios (e.g., router starts up, lots of new sources detected quickly as a result), I strongly suggest thinking about chaos scenarios where links and/or routers are coming and going so rapidly that the source population is in a constant state of flux.   If things are really bad, the best thing to do may be to shut up and hope that the chaos settles out, as not much useful will happen until it does, and send messages about observed changes risks make things worse.  Again, exponential back-off makes sense, possibly quite aggressive, e.g., back-off from 10 seconds by a small factor a few times, and if things still look bad, wait at least a minute or two with further back-off from that longer time until things stabilize.  This needs more thought on how to adjust the back-off factor, as that off-the-top-of my-head example probably exhibits peculiar behavior in scenarios that just are on the edge of tripping the long delay - some thinking about what stability means and how to get there may help in figuring out the relative merits and applicability of backing off further vs. some kind of dramatic reset, analogous to TCP's congestion window reset on timeout.
>
> As this is intended to be an experimental RFC, I don’t think a completely worked-out solution is expected or required - a good discussion of the problems and explanation of areas that need investigation as part of the experiment ought to suffice, as suggested in last sentence quoted above.  I would add some initial exponential back-off functionality as a starting point.
>
>> Also note that the general mechanism can be used for many types of
>> information. It depends on the information how urgent it is to
>> distribute it. Source discovery is particular is fairly urgent.
> And that should be discussed, perhaps in Section 3 somewhere.
>
> Thanks, --David
>
>
>> -----Original Message-----
>> From: Stig Venaas [mailto:stig@venaas.com]
>> Sent: Tuesday, January 23, 2018 7:44 PM
>> To: Black, David <david.black@emc.com>
>> Cc: tsv-art@ietf.org; draft-ietf-pim-source-discovery-bsr.all@ietf.org;
>> ietf@ietf.org; pim@ietf.org
>> Subject: Re: Tsvart telechat review of draft-ietf-pim-source-discovery-bsr-08
>>
>> Hi, thanks for the great comments.
>>
>> I agree with all you wrote and will update the document. However,
>> there is one slight issue with the minimum time between origination of
>> each message. When a new source is detected, we would like to
>> originate a message ASAP so that receivers can start receiving the
>> multicast without much delay. A 10s delay would be a rather long time
>> if a source was detected right after the previous message was
>> originated. I think some delay would be warranted though, in
>> particular in a case where perhaps a router starts up and a large
>> number of directly connected sources could be detected within a short
>> time frame. I think an exponential back-off could make sense here.
>> E.g., if it is just one new source, maybe trigger a message ASAP. If a
>> new source is detected right after the previous one, wait a bit
>> longer, which also allows for aggregation of multiple sources in one
>> messages if several are detected later. In extreme cases one could
>> over time keep increasing the delay until the next update.
>> If sufficient we could maybe have a fixed minimum delay of 1s or not,
>> but that is probably too short in those extreme cases. Hence maybe an
>> exponential back-off.
>>
>> I would appreciate some further guidance what you think is reasonable
>> here, and perhaps whether I can borrow something here from other
>> protocols/drafts. Part of the experiment here might be to find out
>> what minimum values, or how rapid back-off, is needed based on the
>> size of the network, the amount of sources, the types of links etc.
>>
>> Also note that the general mechanism can be used for many types of
>> information. It depends on the information how urgent it is to
>> distribute it. Source discovery is particular is fairly urgent.
>>
>> Stig
>>
>>
>> On Tue, Jan 23, 2018 at 3:40 PM, David Black <david.black@dell.com> wrote:
>>> Reviewer: David Black
>>> Review result: Ready with Issues
>>>
>>> I've reviewed this document as part of TSV-ART's ongoing effort to review key
>>> IETF documents. These comments were written primarily for the transport area
>>> directors, but are copied to the document's authors for their information and
>>> to allow them to address any issues raised.  When done at the time of IETF Last
>>> Call, the authors should consider this review together with any other last-call
>>> comments they receive. Please always CC tsv-art@ietf.org if you reply to or
>>> forward this review.
>>>
>>> This draft describes an experimental PFM (PIM Flooding Mechanism) mechanism for
>>> flooding PIM information among multicast routers that is a generalized form of
>>> the RFC 5059 PIM BSR (BootStrap Router) mechanism, and applies this mechanism
>>> to distribution of source group mappings (PFM-SD).
>>>
>>> Early implementation experience with PFM-SD on low bandwidth radio links
>>> (described Section 2) suggests that the mechanism is able to work better than
>>> PIM-SM without starving other traffic in the fashion that PIM-DM may. This is
>>> promising and (in this reviewer's opinion) justifies experimentation at larger
>>> scale and in other network environments.  In general, this is a well-written
>>> document and the authors should be commended for including the "running code"
>>> implementation experience report in Section 2.
>>>
>>> Flooding mechanisms are very useful, but the time periods that govern sending
>>> of flooding messages are crucial to avoid excessive consumption of network
>>> resources.  Section 5 of RFC 5059 has a solid discussion of the time periods
>>> that apply to use of flooding by the BSR mechanism.   The discussion in this
>>> draft is somewhat weaker, raising a couple of minor issues:
>>>
>>> 1) For PFM-SD, Section 4.2 provides a reasonable discussion of time periods
>>> that apply, but appears to be missing a minimum time period between sending
>>> messages.   Section 5 of RFC 5059 recommends a default of 10 seconds for that
>>> minimum time period by comparison to a default PIM BSR sending interval of 60
>>> seconds.  That 10 second minimum default should be added to this draft, as the
>>> same default sending interval of 60 seconds is used.
>>>
>>> 2) For future use of PFM for other purposes, Section 3.3 provides the following
>>> guidance:
>>>
>>>     Each TLV definition will need to define when a triggered PFM message needs
>>>     to be originated, and also whether to send periodic messages, and how
>>>     frequent.
>>>
>>> That guidance is correct as far as it goes, but it's not particularly helpful
>>> to future protocol designers.   Text should be added to at least point to the
>>> examples in section 4.2 of this draft and/or part of Section 5 of RFC 5059 to
>>> suggest the sorts of values that have proven to be workable, and perhaps also
>>> strongly encourage (SHOULD use) a default minimum time between messages of at
>>> least 10 seconds.
>>>
>>> Understanding this draft requires that the reader be familiar with multicast
>>> and PIM, which is reasonable.  In addition, an understanding of PIM BSR is also
>>> required, which is perhaps somewhat less reasonable.  An example that this
>>> reviewer tripped over is that Section 3 of this draft states that "Like BSR,
>>> messages are forwarded hop by hop."  There is no further explanation or
>>> definition of "forwarded hop by hop," making it necessary to consult RFC 5059
>>> to understand that term, e.g., this has nothing to do with IPv6 hop-by-hop
>>> options.  A sentence or two of explanation of this hop by hop forwarding
>>> concept ought to be copied and adapted from RFC 5059, and it would be good to
>>> check for other concepts that rely on RFC 5059 for definitions.
>>>
>>>

[Tsv-art] Tsvart telechat review of draft-ietf-pi… David Black
Re: [Tsv-art] Tsvart telechat review of draft-iet… Stig Venaas
Re: [Tsv-art] Tsvart telechat review of draft-iet… Mirja Kuehlewind (IETF)
Re: [Tsv-art] Tsvart telechat review of draft-iet… Black, David
Re: [Tsv-art] Tsvart telechat review of draft-iet… Stewart Bryant
Re: [Tsv-art] Tsvart telechat review of draft-iet… Black, David
Re: [Tsv-art] Tsvart telechat review of draft-iet… Stig Venaas
Re: [Tsv-art] Tsvart telechat review of draft-iet… Black, David
Re: [Tsv-art] Tsvart telechat review of draft-iet… Stig Venaas
Re: [Tsv-art] Tsvart telechat review of draft-iet… Black, David
Re: [Tsv-art] Tsvart telechat review of draft-iet… Stig Venaas
Re: [Tsv-art] Tsvart telechat review of draft-iet… Black, David