Re: [babel] Restarting nodes and seqno requests

Toke Høiland-Jørgensen <toke@toke.dk> Mon, 30 April 2018 13:08 UTC

Return-Path: <toke@toke.dk>
X-Original-To: babel@ietfa.amsl.com
Delivered-To: babel@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 89C6E12DA17 for <babel@ietfa.amsl.com>; Mon, 30 Apr 2018 06:08:13 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 2.999
X-Spam-Level: **
X-Spam-Status: No, score=2.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, GB_SUMOF=5, SPF_PASS=-0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=toke.dk
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ms0R3NL19yEb for <babel@ietfa.amsl.com>; Mon, 30 Apr 2018 06:08:12 -0700 (PDT)
Received: from mail.toke.dk (mail.toke.dk [IPv6:2001:470:dc45:1000::1]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A2E5C12DA14 for <babel@ietf.org>; Mon, 30 Apr 2018 06:08:11 -0700 (PDT)
From: Toke Høiland-Jørgensen <toke@toke.dk>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=toke.dk; s=20161023; t=1525093687; bh=dBPD3Kwh0qP3Tf/MVWnNGNLxYIGQrICPn68s/9zEtcU=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=LJGltU/yQuetF8N+GCU9V0+67EXDbo+QitdQmKYjFpouBYJNHP0/JYucHcPtqkaa1 THJix6ckCTWgGrug4Srqnsbh99bmcGj2AkN0vY5mqqdzWtARBIg+GH8yJF5S9B+7/V krqFik55/bJEVCnv1v4+AAL/T0yPmEiY6lPQ2beLTu0v3jQuB7SUSslTYzcdQ1iPDU w2uf7LBgF5KW9OVItzbrSP8K2x9DU94OTgPGBRBbLVt2RHWit9ZZj8h4TtO2IgjgZy tVDW3v3eYJ9qp8f4q+NxpdaJqA1sR+iAgT7tBVkM0wuFkx9TdoZGF2LFQbgbST4qLL 23eGK4fPK/6lQ==
To: Juliusz Chroboczek <jch@irif.fr>
Cc: babel@ietf.org
In-Reply-To: <874ljszz54.wl-jch@irif.fr>
References: <87po2h2b31.fsf@toke.dk> <874ljszz54.wl-jch@irif.fr>
Date: Mon, 30 Apr 2018 15:08:07 +0200
X-Clacks-Overhead: GNU Terry Pratchett
Message-ID: <87k1so3kqw.fsf@toke.dk>
MIME-Version: 1.0
Content-Type: text/plain
Archived-At: <https://mailarchive.ietf.org/arch/msg/babel/bURcu76w1JE3hlYQGOj7oA5WvNU>
Subject: Re: [babel] Restarting nodes and seqno requests
X-BeenThere: babel@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "A list for discussion of the Babel Routing Protocol." <babel.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/babel>, <mailto:babel-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/babel/>
List-Post: <mailto:babel@ietf.org>
List-Help: <mailto:babel-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/babel>, <mailto:babel-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 30 Apr 2018 13:08:14 -0000

Juliusz Chroboczek <jch@irif.fr> writes:

>> - Node B restarts (i.e. shuts down, loses its transient state such as
>>   seqnos and comes back up either immediately or after a relatively
>>   short time). It will then start announcing prefix P again, but now
>>   with seqno S' < S [which is unfeasible for A].
>
> Yes.  The loop avoidance mechanism in Babel is stateful (the source table
> contains the state), and if you loose your state, you're in trouble.  That
> is why babeld saves its seqno into persistent storage at shutdown and
> restores it at startup.
>
> If the seqno is lost (either because a node has crashed or because you
> have no persistent storage), then you need to timeout the source table entry:
>
>   - first, the route times out, so its metric becomes infinite;
>   - you start sending retractions, and retractions don't update the source
>     table;
>   - at some point, the source table GC timer triggers, and the source
>     entry gets updated.
>
> Note that the total time you wait for the route to become feasible again
> is the sum of the route hold time and the source GC time, so its on the
> order of a few minutes -- Babel is optimised for the case where links go
> up and down, but routers do not reboot often.  If that's not acceptable,
> you can work around the issue by changing router ids at every boot, so
> that the old and new state don't interact.  Babeld implements this with
> the "random-id" option, and it is useful in environments where routers
> reboot oftent without saving their seqno.

Right, that's what I thought. I don't think there's a good way to store
persistent state in Bird, but it may be possible to try harder to do
graceful restarts... I'll look into the options.

>> The question is, how is this supposed to be resolved?
>
> There's no good solution.  The loop avoidance mechanism is stateful, and
> that's a fact of nature.  (BGP avoids the statefulness by putting the
> whole state into each update, which causes updates to have an unbounded
> size.)
>
> If you have an idea for a good mechanism to avoid the issue, I'm
> listening.

Not off the top off my head..

>> Should A keep resending the seqno requests each time it gets a new
>> unfeasible update - and if so, is there any limit to the frequency?
>
> Only the usual delay on sending out updates (Section 4, "a Babel node
> SHOULD buffer every TLV and delay sending a packet by a small,
> randomly chosen delay"). Is that a problem in practice?

No, don't think so, it's just an implementation issue: Bird currently
won't resend the same request until it expires after two seconds. So
when the triggered update arrives with S'' (that is still too low), it
won't ask for another seqno increase. But that is a straight-forward
fix, and with that it converges reasonably quickly (as long as the seqno
is not too high I guess...)

>> Or should A immediately consider the update with seqno S' as feasible
>> because its selected route has metric infinity?
>
> As you mentioned, that would be unsafe.  Consider the following topology:
>
>   ::0 --- A --- B
>
> A loses its Internet connection, so it sets the metric of its selected
> route to infinity.  Before it sends a retraction, it receives an
> unfeasible update from B (with router-id A) -- routing loop.

Right, thought so :)

-Toke