Re: [MBONED] WGLC for draft-ietf-mboned-dc-deploy

Hi Mike,

Sorry for taking so long.

This draft seems borderline to me, mostly on editorial grounds.  I support moving
forward, but I'd prefer to see some tuning of the text first.

There's some excellent technical insights in here worth publishing, but as an
informational doc I think it needs to be easier to read and give advice that's
less tentative, caveat-filled, and speculative.

Overall, I think it's about 35% too wordy, and as it stands I'm not sure the
people trying to set up their data center will bother reading it deeply enough to
extract its valuable insights, whereas more straightforward prose that gets to the
point would make it useful to them.

I was trying to do a by-section walkthrough with suggestions, but it was taking me
way too long, and will maybe never be really done, and incorporated too many
judgement calls, so I'll just throw in a few examples of what I'm talking about, and
hope the authors can use them as a guide for how I'd suggest they focus some of their
efforts.

Overall, I think this doc should go forward, and provides some value even as-is, but
I think would be more than twice as useful if the text were revised with an eye toward
being concise and decisive, with a specific target audience in mind.  And so I urge the
authors to consider doing so.

----
Editorial:

1. With respect, this bit from 2.2 reads to me like 3 lines of awful word salad that
would be better said as "Overlays provide":
  "The
   often fervent and arguably partisan debate about the relative merits
   of these overlay technologies belies the fact that, conceptually, it
   may be said that these overlays mainly simply provide"

This is one of the worst examples I saw, but the overwhelming bulk of my editorial
objections are about text that's got similarities to this.  It's gotta be tighter text,
nobody I know can read that kind of stuff for long.  Everything similar to this is the
main thing that I'd like to see changed.

I'm not giving a complete list of detailed examples in this review, but when I said
"35% too wordy overall" in the intro, I mean to suggest that it's probably possible to
say the same thing more effectively by cutting or rephrasing the least essential 35%
of the words.

For the particular snippet above, I was able to suggest about a 96% cut.  Most of the
rest of the text is much less severe, but has similar opportunities distributed
liberally throughout, IMO.

2.  Every sentence with "likely" or "future" in it seems speculative, and usually like
it's trying to justify why someone would bother reading this doc.

I suggest assuming instead that whoever got as far as trying to read this doc already
strongly suspects they want to roll out multicast in a datacenter, and wants to know
how to do it, what to watch out for, where they have to make tricky choices, and what
the important factors in those choices are.  I think they won't care whether things
looked likely when the doc was first written, and will be annoyed at having to wade
through that kind of speculation.

3. The "widely available" deployment guides and best practices in 3.4 should include
example references, IMO.  Searching for "PIM best practices" gives a bunch of "Project
Information Management" junk.

4. North/South East/West should get a definition and maybe a reference, I don't think
these terms have a well-established usage in the RFC series yet.  Probably leaf/spine
also.

5. The "Applications" section would be better split into subsections.  It's sort of a
wall of text that changes subjects a lot.

6. I think 4.3 is far too abstract.  Phrases like "enticing possibility" and "novel
algorithms and concepts" elide the problem being discussed to the point I don't really
know what it's talking about from reading it.

The reference to [Shabaz19] is a good step in the right direction, but I'd recommend
pulling in some of the references it contains in its "comprehensive overview of other
approaches", and describing the problems they're solving, along with the pros and cons
(especially since an acm reference comes with a paywall), and trim most of the abstract
description of the solution space in the first 3 paragraphs.

---
Technical:

Though my feedback is mainly about editorial issues, I'll also suggest adding one new
technical section about gotchas to watch out for.

I don't insist it be added, especially if it's all well-covered in the references for
the deployment guides and best practices mentioned in 3.4, but I thought I'd offer a
few particulars as suggestions to include in such a section.  It's likely there are
some others I haven't encountered, but below are a few of the most obnoxious that have
bitten me or that I've heard of.

I think what ties these together as nasty gotchas is that you think your network is
working fine, but then it suddenly stops and you have to debug it.  I think these are
probably the failure modes that are most important to highlight.

There may be other such failure scenarios worth listing, but these are the ones I know
of offhand:

- it's important to get redundancy in your IGMP/ND querier setup, because snooping
relies on seeing the membership reports.  It's easy to accidentally get traffic that
works for 60 or 120 seconds after the spontaneous report from the initial join, then
stops working because nothing is sending the query that causes re-sending of the
report, or alternatively it starts flooding everywhere in the layer 2 lan instead of
only to the joined groups when the snooping info expires, both of which can cause
disruptions in service.

- it's important to disable igmpv2 everywhere if you rely on ssm, because seeing igmpv2
messages can put the devices on a LAN into compatibility mode, which can even happen
spontaneously if the right sequence of igmpv3 messages were dropped, and which can be
persistent once it happens and the devices on the lan continue sending the v2 messages.
This can result in service disruptions when using PIM-SSM or otherwise relying on SSM
for specific (S,G)s, since the older igmp versions don't have the necessary SSM info.
(With a reference to section 7 of RFC 3376, and probably similar for mldv2.)

- there's a failure mode from having too many joined groups to re-build the membership
state in the rpf tree before the membership expires.  This can also cause a persistent
service disruption after a single link failure with redundant paths but not a redundant
forwarding tree on an otherwise functional network, and even on a network that can
recover successfully with fewer groups joined, so it can be a nasty surprise that gets
worse with scale of multicast usage, and would have a threshold that depends on the
timers. (I raise this more tentatively because it hasn't hit me, but I've heard of it
happening.)

---

I guess I'll leave it at that in the interest of actually sending a review out this
time (I started and got stuck on this response about 3 times, starting in October).

I hope these comments are helpful, and I do think the doc is worth publishing, though
I'd ideally like to see it become easier to read first.

Thanks and regards,
Jake

On 2/27/20, 3:13 PM, "Mike McBride" <mmcbride7@gmail.com> wrote:

    mboned crew,

    Only one response to the wglc. One more day. These types of drafts are
    what this wg are chartered to produce. Please give it a quick read and
    respond either way. If it's not useful we will drop it. But if you
    find it at all useful please respond so we can finally be done and
    move to iesg.

    thanks!
    mike

    On Thu, Feb 6, 2020 at 12:27 PM Leonard Giuliano
    <lenny=40juniper.net@dmarc.ietf.org> wrote:
    >
    >
    > We would like to begin working group last call on Multicast in the Data
    > Center Overview.  This draft has been recently updated based on feedback
    > from last year's WGLC, where there was some support, but not enough
    > responses to advance the draft.  Please post whether you support/oppose
    > the advancement of the drafts as well as any comments you may have to the
    > list by Feb 28.
    >
    > Most recent version of the draft can be found here:
    > https://datatracker.ietf.org/doc/draft-ietf-mboned-dc-deploy/
    >
    > -Chairs
    >
    > _______________________________________________
    > MBONED mailing list
    > MBONED@ietf.org
    > https://www.ietf.org/mailman/listinfo/mboned

    _______________________________________________
    MBONED mailing list
    MBONED@ietf.org
    https://www.ietf.org/mailman/listinfo/mboned