Re: [v6ops] Interesting problems with using IPv6

> With so little apparent proper evidence collecting and proper troubleshooting and diagnosis, how can the blog author be sure IPv6 MLD with the RA option is the specific cause of their problems?

I'm pretty sure that the Router Alert is a red herring and the real problem
was MLD snooping (i.e. a layer violation by L2 switches). Even so,
the MLD Router Alert issue seems to be a real one, if not in this scenario.

Regards
   Brian

On 15/09/2014 13:03, Mark ZZZ Smith wrote:
> Hi Brian,
> 
> 
> ----- Original Message -----
>> From: Brian E Carpenter <brian.e.carpenter@gmail.com>
>> To: Mark ZZZ Smith <markzzzsmith@yahoo.com.au>
>> Cc: Dale W. Carder <dwcarder@wisc.edu>; IPv6 Operations <v6ops@ietf.org>; "l.wood@surrey.ac.uk" <l.wood@surrey.ac.uk>
>> Sent: Tuesday, 9 September 2014, 12:14
>> Subject: Re: [v6ops] Interesting problems with using IPv6
>>
>> Mark,
>>
>> My point is that it's worth *understanding* such problems and then
>> perhaps writing up operational or implementation recommendations.
> 
> 
> Sure.
> 
> The trouble with that blog post is that it doesn't provide conclusive evidence that the Router Alert option in MLD messages is the only cause of their networking problems.
> 
> The actual blog post doesn't really provide much technical evidence of anything. The evidence is does provide however is that they haven't been performing troubleshooting well, nor maintaining their network very well:
> 
> - they took a 'suck-it-and-see' approach to trying to fix the first issue, and in particular, changed multiple things at once. If the problem then disappears, which change made resolved it? Would they now perform disruptive firmware upgrades even though that change may not have been the one that fixed the problem?
> 
> - they suspect they have 'bridge loops' forming, and that is overloading the control plane during these 'bridge loops'. Yet they don't say they have spent any time confirming their speculation, or any effort in identifying the cause of these 'bridge loops'. They also speculate about what the consequences of these 'bridge loops' are (STP messages being dropped because of control plane overload). They're actively working on taking actions to remedy this, without any evidence their speculation is true. They're jumping to conclusions.
> 
> With so little apparent proper evidence collecting and proper troubleshooting and diagnosis, how can the blog author be sure IPv6 MLD with the RA option is the specific cause of their problems? Consequently, how can this group then accept that as conclusive and therefore spend time speculating on changes that could be made to MLD that may also be completely ineffective on that network - because it may not be the cause at all? There isn't even enough detail in that blog post to be able to reproduce the scenario independently - what hosts/host OSes are they using, how many are there, are they running other multicast applications etc., etc.,
> 
> I think it is possible that RA option in MLD messages might be one of the causes of or contributors to their problems. However, there are many other possible causes that don't seem to have been investigated.
> 
> 
> One reason I'm sceptical about MLD being the cause is because versions of Windows since Vista (released January 2007) have been IPv6 enabled by default, and have been issuing solicited-node MLDv2 reports for its link-local addresses, and if global or ULA prefixes are available, has been generating privacy addresses and corresponding solicited-node MLDv2 reports for those too. So since at least 2007, many MLD messages have been sent onto many different switched networks for many years. If RA option in MLDv2 reports were such a problem to switched networks, we should have heard about it by now and have also seen other networks suffer from this problem. This problem seems to be unique to this network, and therefore I think that is a sign that something else is going on, in addition to the other signs like this 'bridge loops' and unexplained software and perhaps hardware faults.
> 
> I've proposed using these solicited-node MLD reports to further help mitigate the ND cache DoS (draft-smith-v6ops-mitigate-rtr-dos-mld-slctd-node). People were concerned that hosts weren't sending them, so I've collected some packet captures of an individual host booting on a (virtual) network with a single IPv6 router (OpenWRT 14.07 rc1) announcing a single ULA prefix. If people want to see what is typical, here are the Windows captures (Windows XP after IPv6 was enabled). Note the captures were taken at the router's interface rather than the host's, so they're showing what the router receives from the hosts or sends towards them, rather than what the hosts were sending or receiving.
> 
> http://www.users.on.net/~markachy/windows-xp-sp3-boot-rtr-cap.pcap
> 
> http://www.users.on.net/~markachy/windows-vista-boot-rtr-cap.pcap
> 
> http://www.users.on.net/~markachy/windows-7-boot-rtr-cap.pcap
> 
> http://www.users.on.net/~markachy/windows-8.1-boot-rtr-cap.pcap
> 
> Regards,
> Mark.
> 
> 
>> Exactly what the v6ops charter says we should do, in fact.
>>
>> "1. Solicit input from network operators and users to identify
>> operational issues with the IPv4/IPv6 Internet, and
>> determine solutions or workarounds to those issues. These issues
>> will be documented in Informational or BCP RFCs, or in
>> Internet-Drafts."
>>
>> Or recommend protocol changes if they might help, which is why I
>> want to understand the value of Router Alert in MLD messages for
>> Solicited-Node multicast addresses. Or should we revive
>> draft-pashby-magma-simplify-mld-snooping-01?
>>
>>     Brian
>>
>>
>>
>> On 09/09/2014 13:55, Mark ZZZ Smith wrote:
>>>
>>>
>>>  ----- Original Message -----
>>>>  From: Brian E Carpenter <brian.e.carpenter@gmail.com>
>>>>  To: Dale W. Carder <dwcarder@wisc.edu>
>>>>  Cc: IPv6 Operations <v6ops@ietf.org>; l.wood@surrey.ac.uk
>>>>  Sent: Tuesday, 9 September 2014, 7:59
>>>>  Subject: Re: [v6ops] Interesting problems with using IPv6
>>>>
>>>>  I switched to the relevant list.
>>>>
>>>>  On 09/09/2014 06:33, Dale W. Carder wrote:
>>>>>   Thus spake Brian E Carpenter (brian.e.carpenter@gmail.com) on Mon, 
>> Sep 08, 
>>>>  2014 at 07:50:26AM +1200:
>>>>>>   If they really are interesting problems, it might be more to 
>> the
>>>>>>   point to analyse them over on v6ops. Given the number of large
>>>>>>   IPv6 deployments that don't have such problems, it seems 
>> like
>>>>>>   this particular deployment hit an unfortunate combination of
>>>>>>   implementation issues.
>>>
>>>  If this email is referring to this blog post:
>>>
>>>
>> http://blog.bimajority.org/2014/09/05/the-network-nightmare-that-ate-my-week/
>>>
>>>  then I think this particular network was already on the verge of 
>> catastrophic failure, and some of IPv6's ND differences were just the 
>> trigger to push it over that threshold.
>>>
>>>  According to the blog post, they might have:
>>>
>>>  - hardware errors
>>>  - software errors
>>>  - 'bridge loops' causing the control plane to overload
>>>
>>>  In some of these cases they've taken actions to try to resolve them, 
>> however the actions taken seem to be based on speculating what is happening 
>> rather than finding evidence to support the hypothesised cause before taking 
>> action.
>>>  For example, there isn't any evidence that they've actually 
>> determined that 'bridge loops' are occurring, found out what is causing 
>> them, and then taking measures to prevent them or at least reduce the chances of 
>> them occurring. If a 'bridge loop' can cause the control plane to 
>> overload, then adding anything to this network like IPv6 is likely to only make 
>> the problem worse.
>>>  I'd like to see them get to the bottom of and then fix these 
>> 'bridge loop' and other problems first before any value is placed on 
>> addressing their criticisms of IPv6. Other people have deployed IPv6 without 
>> these problems, so what is different between this network and everybody 
>> else's that doesn't have these sorts of problems?
>>>
>>>
>>>   That is worth understanding (for example,
>>>>>>   how large is the layer 2 network that leads to the MLD 
>> listener
>>>>>>   report overload?).
>>>>>   Implementing MLD snooping for Solicited-Node multicast addresses 
>>>>>   is probably a bad idea.
>>>>>
>>>>>   See: draft-pashby-magma-simplify-mld-snooping-01
>>>>  OK, but I would also like to understand why we require
>>>>  MLD messages for a Solicited-Node multicast address to
>>>>  set Router Alert.
>>>>
>>>>      Brian
>>>>
>>>>  _______________________________________________
>>>>  v6ops mailing list
>>>>  v6ops@ietf.org
>>>>  https://www.ietf.org/mailman/listinfo/v6ops
>>>>
>