Re: IPv6 Anycast has been killed by LINUX patch in 2016 - who cares?

Toerless Eckert <tte@cs.fau.de> Sat, 07 August 2021 01:47 UTC

Return-Path: <eckert@i4.informatik.uni-erlangen.de>
X-Original-To: ipv6@ietfa.amsl.com
Delivered-To: ipv6@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 51DCF3A23BA; Fri, 6 Aug 2021 18:47:47 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.118
X-Spam-Level:
X-Spam-Status: No, score=-1.118 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.001, SPF_HELO_NONE=0.001, SPF_NEUTRAL=0.779, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0xu4uyoK88vb; Fri, 6 Aug 2021 18:47:42 -0700 (PDT)
Received: from faui40.informatik.uni-erlangen.de (faui40.informatik.uni-erlangen.de [IPv6:2001:638:a000:4134::ffff:40]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id AD5593A23B5; Fri, 6 Aug 2021 18:47:39 -0700 (PDT)
Received: from faui48f.informatik.uni-erlangen.de (faui48f.informatik.uni-erlangen.de [131.188.34.52]) by faui40.informatik.uni-erlangen.de (Postfix) with ESMTP id C9E71548053; Sat, 7 Aug 2021 03:47:30 +0200 (CEST)
Received: by faui48f.informatik.uni-erlangen.de (Postfix, from userid 10463) id C134C4400EF; Sat, 7 Aug 2021 03:47:30 +0200 (CEST)
Date: Sat, 07 Aug 2021 03:47:30 +0200
From: Toerless Eckert <tte@cs.fau.de>
To: Brian Carpenter <brian.e.carpenter@gmail.com>
Cc: Michael Tuexen <michael.tuexen@lurchi.franken.de>, IETF discussion list <ietf@ietf.org>, ipv6@ietf.org, vasilenko.eduard@huawei.com
Subject: Re: IPv6 Anycast has been killed by LINUX patch in 2016 - who cares?
Message-ID: <20210807014730.GA28901@faui48f.informatik.uni-erlangen.de>
References: <db8c1a5534e9412ebcfa37682d75f862@huawei.com> <C23D7023-B5B7-47C6-8AC5-65A98822A724@lurchi.franken.de> <CANMZLAZGawUjRhSSE_rA8AyqMx=mx1WFeJ_tZq0KVEXJd2XBfQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CANMZLAZGawUjRhSSE_rA8AyqMx=mx1WFeJ_tZq0KVEXJd2XBfQ@mail.gmail.com>
User-Agent: Mutt/1.10.1 (2018-07-13)
Archived-At: <https://mailarchive.ietf.org/arch/msg/ipv6/DD-9hQP3VGPXkOoA9EkG_hBCm_o>
X-BeenThere: ipv6@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "IPv6 Maintenance Working Group \(6man\)" <ipv6.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ipv6>, <mailto:ipv6-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ipv6/>
List-Post: <mailto:ipv6@ietf.org>
List-Help: <mailto:ipv6-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ipv6>, <mailto:ipv6-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 07 Aug 2021 01:47:48 -0000

[bitching]
I apologize for attempting to respond to the original post topic instead
of derailing the thread into my pet side topic without changing subject,
which seems to be expected behavior on ietf@ietf.org.
[/bitching]

Adding ipv6@ietf.org as that seems to be the closest WG list for the topic.

Brian reminded us that we have ample RFCs to elaborate on the fact that you can not
reasonably expect for connections to an anycast address to work when persistently
using the anycast address. 

Christian pointed out how QUIC does the right thing. Great! Maybe we should
have a an anycast support hall of fame and shame for protocols: DOes or does it
not support single round-trip resolution of anycast to unicast address.

But back to what seems to be the root cause, which isn't anycast, but IPv6
flow label "abuse" ?!

I specifically had not heard of this Linux "hack" to change flow-label
mid-connection after TCP RTO to overcome a seemingly broken path and hope for 
the new flow-label to pick another, working path (most likely in a data cener).

I do not think that this endpoint behavior is endorsed by RFC6437 and by absence
of a description in RFC6437 under which circumstances an endpoint could or should
change the Flow label for an in-progress transport connection, i would conclude 
that this linux behavior is not in compliance with how the normative RFC6437
describes proper host assignment of flow label.

Could someone who sleeps with RFC6437 under his pillow comment on whether
or not my assessment is accurate ? I think to remember Bob mentioned he just
carefully read through all those flow label RFCs...

Of course, pointing out to linux that what it does is a hack would not make
TCP to an anycast address any less of a hack.

Ultimately, i have not a lot of sympathy for the linux behavior, even if it
was blessed by RFC6437, because i think good networks should fix broken paths
fast enough for this hack to be not necessary...

Cheers
    Toerless

On Tue, Aug 03, 2021 at 10:45:29PM +1200, Brian Carpenter wrote:
> The issue of anycast and unstable routes is hardly a new discovery; this
> Linux feature is not creating a new problem. I suggest reading RFC7094 and
> RFC4786 before continuing this conversation.
> 
> I certainly wouldn't design a protocol that relied on stable transport
> connections to an anycast address.
> 
> Regards,
>     Brian Carpenter
>     (via tiny screen & keyboard)
> 
> On Tue, 3 Aug 2021, 22:10 Michael Tuexen, <michael.tuexen@lurchi.franken.de>
> wrote:
> 
> > > On 3. Aug 2021, at 11:44, Vasilenko Eduard <vasilenko.eduard@huawei.com>
> > wrote:
> > >
> > > Hi all,
> > > I am writing to this alias because I do not know the proper one for such
> > type of a problem (OS/LINUX/BSD).
> > > The history of how Alexander Azimov (Yandex) has found the problem is
> > below.
> > >
> > > In short: if TCP loses connectivity for 200ms (or 1s in SYN stage) then
> > TCP changes IPv6 flow label (for the active TCP session!) to push traffic
> > to a different path.
> > > Current networks are extensively ECMP, if intermediate nodes support
> > flow label for hash calculation then a high probability that the path would
> > be changed.
> > > LINUX/BSD does not want to wait till the network will fix its problem.
> > As far as I know, Linux implements something you describe, but I'm not
> > aware on this behaviour being
> > implemented in *BSD, at least not in FreeBSD.
> > >
> > > If the final destination was anycast then the final destination would be
> > changed too by the same hash calculation.
> > > The stateful session would be broken as a result (see the second part of
> > Alexander’s presentation below).
> > >
> > > Since the time LINUX has made the default RTO flow label recalculation
> > (2016), IPv6 Anycast is broken.
> > > People would have one more reason not to migrate to IPv6. Flow label
> > does not exist in IPv4 – OS is not capable to break IPv4 Anycast similarly.
> > >
> > > Is anybody would like to spend his/her karma to save IPv6 Anycast OR let
> > it die?
> > > It is broken already for 5 years and nobody has spotted it up to now. Is
> > it needed?
> > > (I have seen a few drafts heavily dependent on IPv6 anycast)
> > >
> > > What is proper WG for such a problem?
> > At IETF 110 Alexander gave a presentation on this in TCPM and V6OPS. See
> > the Minutes and the corresponding slides at
> > https://datatracker.ietf.org/meeting/110/proceedings
> >
> > At least at the TCPM meeting, it was suggested that an ID would be written.
> >
> > However, the behaviour you are describing, is implementation specific to
> > Linux, this is not described or
> > recommended by an RFC.
> >
> > Best regards
> > Michael
> > >
> > > I am concerned that Anycast has been killed, it is not an easily
> > replaceable tool.
> > > Maybe somebody would propose something better but if not
> > > then LINUX should be returned to 2015 when flow label change on RTO was
> > a non-default configuration.
> > > Such LINUX behavior could be valuable in some restricted domains (see
> > below) when the administrator is sure that Anycast is not possible on the
> > traffic path.
> > >
> > > Eduard
> > > From: Vasilenko Eduard
> > > Sent: Tuesday, August 3, 2021 12:05 PM
> > > To: 'Jeff Tantsura' <jefftant.ietf@gmail.com>; Alexander Azimov <
> > a.e.azimov@gmail.com>
> > > Cc: Alexander Azimov <mitradir@yandex-team.ru>; routing WG <
> > rtgwg@ietf.org>
> > > Subject: RE: Self-healing Networking with Flow Label
> > >
> > > Hi all,
> > > Not many people worldwide read this alias and understand
> > > That RTO could be leveraged to fight “silent drops” in the DC
> > environment.
> > > It is a good use case to publish/document (with more details that it was
> > in the presentation).
> > > I hope that in the future OAM would be used for this purpose – it is
> > better from architecture point of view.
> > > Eduard
> > > From: Jeff Tantsura [mailto:jefftant.ietf@gmail.com]
> > > Sent: Tuesday, August 3, 2021 1:08 AM
> > > To: Alexander Azimov <a.e.azimov@gmail.com>
> > > Cc: Vasilenko Eduard <vasilenko.eduard@huawei.com>; Alexander Azimov <
> > mitradir@yandex-team.ru>; routing WG <rtgwg@ietf.org>
> > > Subject: Re: Self-healing Networking with Flow Label
> > >
> > > Eduard,
> > >
> > > The idea of the draft to come is to explain what to do - when and how.
> > > The goal is not to regulate (we really don’t) but to provide, similarly
> > to RFC7938 a set of guidelines that community can use to build better and
> > more resilient networks.
> > >
> > > Cheers,
> > > Jeff
> > >
> > >
> > > On Aug 2, 2021, at 04:01, Alexander Azimov <a.e.azimov@gmail.com> wrote:
> > >
> > > 
> > > Eduard,
> > >
> > > пн, 2 авг. 2021 г. в 13:45, Vasilenko Eduard <
> > vasilenko.eduard@huawei.com>:
> > > It is the key in this presentation “This behavior MUST be switched off
> > by default”
> > > It has been shown on slides 7-10 that flow label change on RTO is
> > enabled by default for BSD and LINUX.
> > > It needs regulation. It needs a standard RFC. Because it kills Anycast
> > otherwise.
> > > As I'm partially responsible for the key points of the presentation, I
> > can stress that it is a bit different.
> > >       • We have an opportunity for self-healing TCP on top of IPv6, it
> > should be preserved;
> > >       • The Linux defaults should be changed to a safe mode to prevent
> > session timeouts;
> > >       • The hash recalculation behavior should be documented;
> > > I'm not sure what you mean by the term 'regulation'.
> > >
> > > The story of how to use RTO to work-around “silent drop” vendor’s bugs
> > could be a good informational RFC.
> > > My be people developing iOAM would pay more attention to this use case.
> > >
> > > IMHO: these are 2 separate drafts.
> > > I'm not sure about it, we'll try to provide -00 before the next IETF
> > meeting, let's see how it progresses.
> > >
> > > Eduard
> > > From: Alexander Azimov [mailto:mitradir@yandex-team.ru]
> > > Sent: Monday, August 2, 2021 1:20 PM
> > > To: Vasilenko Eduard <vasilenko.eduard@huawei.com>; Jeff Tantsura <
> > jefftant.ietf@gmail.com>
> > > Cc: routing WG <rtgwg@ietf.org>
> > > Subject: Re: Self-healing Networking with Flow Label
> > >
> > > Eduard,
> > >
> > > Please see the quote from the slide 28. My suggestion was:
> > >
> > > Client – sends SYN, Server – responds with SYN&ACK
> > >       • In case of SYN_RTO or RTO events Server SHOULD recalculate its
> > TCP socket hash, thus change Flow Label. This behavior MAY be switched on
> > by default;
> > >       • In case of SYN_RTO or RTO events Client MAY recalculate its TCP
> > socket hash, thus change Flow Label. This behavior MUST be switched off by
> > default;
> > > This looks like a safe default behavior, that saves the part of the
> > improvements, but makes the work with stateful anycast services safe.
> > >
> > > And yes, IMO it's ok to have a knob to enable it in the controlled
> > environment. If you ask how to enable it in the presence of internal
> > anycast services - there was also a suggestion in the slides: eBPF. It
> > gives a good way to make this kind of separation.
> > >
> > > 02.08.2021, 11:48, "Vasilenko Eduard" <vasilenko.eduard@huawei.com>:
> > > Hi Jeff,
> > > The situation when Control Plane does not understand what the Forwarding
> > pane doing is a bug.
> > > Yes, RTO in TCP helps to find a work-around for this bug. And yes,
> > Anycast is typically absent inside DC – it does not create the problem in
> > the DC environment.
> > >
> > > But the same LINUX is used outside DC. RTO Flow Label change here would
> > create even more problems if Anycast would happen on the traffic path (not
> > much predictable for client).
> > > Do we need separate LINUX distribution for DC and separate distribution
> > for other environments?
> > > Or should we rely on the proper non-default configuration for different
> > environments? (Admin should not forget to change)
> > > What if Anycast would become needed in DC?
> > >
> > > RTO flow label recalculation is mutually exclusive with Anycast on the
> > traffic part.
> > > What is more valuable for the public?
> > >
> > > IMHO: It is better to fight the problem of such type of a bug with iOAM
> > than to cancel Anycast.
> > >
> > > IMHO: It is better to have Flow Label recalculation on RTO as “off” by
> > default.
> > > DC Admin has the higher qualification to activate it in a controlled
> > environment than every client worldwide that should not forget to disable
> > it.
> > >
> > > Eduard
> > > From: Jeff Tantsura [mailto:jefftant.ietf@gmail.com]
> > > Sent: Monday, August 2, 2021 6:56 AM
> > > To: Vasilenko Eduard <vasilenko.eduard@huawei.com>
> > > Cc: mitradir@yandex-team.ru; routing WG <rtgwg@ietf.org>
> > > Subject: Re: Self-healing Networking with Flow Label
> > >
> > > Eduard,
> > >
> > > The issue is present not in link/device case, if well implemented - fast
> > rehash takes care of updating forwarding within a number of ms. The problem
> > is with  “gray” failures,  when the link in question is UP from
> > routing/forwarding prospective but drops traffic (mostly occasionally and
> > when a HW bug occurs has a distinct flow attributes).
> > >
> > > In many large DC fabrics, the majority of the traffic is east-west,
> > between end-points that aren’t anycast. In such deployments - the solution
> > solves  issues rather elegantly and without any interventions from the
> > operator.
> > > The issues/side effects are well understood and will be documented.
> > >
> > > The best way to receive RTGWG emails is well… to subscribe to RTGWG ;-)
> > >
> > > Cheers,
> > > Jeff
> > >
> > >
> > > On Aug 1, 2021, at 09:47, Vasilenko Eduard <vasilenko.eduard@huawei.com>
> > wrote:
> > >
> > > 
> > > Hi  Alexander,
> > >
> > > Have I understood your presentation right?
> > > The client SHOULD change IPv6 flow label after SYN RTO to have a chance
> > to be moved to the working path inside DC fabric (if DC fabric supports
> > flow label for hash calculation)
> > > But at the same time
> > > The client SHOULD NOT change the IPv6 flow label after SYN RTO to avoid
> > being switched to a different TCP proxy engine.
> > >
> > > Looks like a deadlock, especially if both things should happen for the
> > same traffic:
> > > it should reach DC fabric
> > > and
> > > it should be hash load-balanced between different TCP proxy engines (or
> > applications) inside DC Fabric.
> > >
> > > I see one bad solution (“Disable Flow Label”):
> > > Routers up to TCP proxy engine SHOULD be configured not to use flow
> > label (by the way these are all routers on the Internet),
> > > TCP flow engines SHOULD be outside of the DC Fabric (CLOS) – probably in
> > front of it.
> > > Routers/Switches inside DC Fabric SHOULD use flow labels.
> > >
> > > I see another bad solution (“Disable Anycast”):
> > > Disable anycast on routers in principle, use only stateful LB.
> > >
> > >
> > > It has been commented in the chat that Anycast is not possible in
> > principle for stateful connection. It is too general a statement.
> > > Anycast is just not compatible with Flow Label. It is not a problem for
> > IPv4 anycast even if the connection is stateful (TCP) because 5-tuple for
> > hash would not change.
> > > Hence, IPv6 anycast has become dead at the time when Flow Label change
> > has been added in LINUX for active TCP session.
> > >
> > > Among 3 thins:
> > > -          Anycast
> > > -          Flow Label load balancing (basic Flow Label functionality)
> > > -          Flow Label change on the active session for application to be
> > more active in new path search
> > > You have to choose which one to kill – all 3 are not compatible with
> > each other at the same.
> > > I vote to disable Flow Label change in LINUX. Then wait till the network
> > would fix itself.
> > > We have so many fancy TE tools our days. A broken link or a broken node
> > could be excluded from routing for 50ms.
> > >
> > > PS: I am not subscribed to the RTGWG alias, please keep me on a copy of
> > this thread.
> > > <image001.png>
> > > Best Regards
> > > Eduard Vasilenko
> > > Senior Architect
> > > Europe Standardization & Industry Development Department
> > > Tel: +7(985) 910-1105, +7(916) 800-5506
> > >
> > > _______________________________________________
> > > rtgwg mailing list
> > > rtgwg@ietf.org
> > > https://www.ietf.org/mailman/listinfo/rtgwg
> > >
> > >
> > > --
> > > Best regards,
> > > Alexander Azimov
> > >
> > > _______________________________________________
> > > rtgwg mailing list
> > > rtgwg@ietf.org
> > > https://www.ietf.org/mailman/listinfo/rtgwg
> > >
> > >
> > > --
> > > Best regards,
> > > Alexander Azimov
> >
> >

-- 
---
tte@cs.fau.de