Re: [manet] OLSRv2 router restart

"Rogge, Henning" <henning.rogge@fkie.fraunhofer.de> Tue, 27 July 2021 12:23 UTC

Return-Path: <henning.rogge@fkie.fraunhofer.de>
X-Original-To: manet@ietfa.amsl.com
Delivered-To: manet@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D93643A21A5 for <manet@ietfa.amsl.com>; Tue, 27 Jul 2021 05:23:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=fkie.fraunhofer.de
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id dJshMvFr4vUa for <manet@ietfa.amsl.com>; Tue, 27 Jul 2021 05:23:13 -0700 (PDT)
Received: from mail-edgeS23.fraunhofer.de (mail-edges23.fraunhofer.de [153.97.7.23]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 035303A166E for <manet@ietf.org>; Tue, 27 Jul 2021 05:23:12 -0700 (PDT)
IronPort-SDR: k09KORsI/R6mS8p0DfQltQHCG+55XdqApNK9j1vgUx7UxnW77I93F8vUDDcY8vMEeETUCI5S8a ecqgsO6ZEg4g==
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: =?us-ascii?q?A2FOBQDd+f9g/xmnZsBaHgEBCxIMQIM?= =?us-ascii?q?sgxELjUGIXgObdIFoCwEBAQEBAQEBAQk/AgQBAQMDhFICgn8BJTgTAgQBAQE?= =?us-ascii?q?SAQEGAQEBAQEGBAICgQqFaA2DU4EIAQEBAQEBAQEBAQEBAQEBAQEBARYCCFJ?= =?us-ascii?q?MAR8BBAFAAQE3AQQLAgEpJQ8jJQIEDiAEhTkhAahugTSBAYIHAQEGglmFOwm?= =?us-ascii?q?BOo1uJ4FmQ4EVNoMthCuGFIIwZgYXgQACEhQFFw8oER0qFmcGHCkDkSEJgn+?= =?us-ascii?q?MZ5lYghIDBAOBfYEpkXGMTyuVTJEXthKFMIE/OIF+cYM4UBcCDlWNSoQkiit?= =?us-ascii?q?zOAIGCwEBAwl8iQIBMl4BAQ?=
X-IPAS-Result: =?us-ascii?q?A2FOBQDd+f9g/xmnZsBaHgEBCxIMQIMsgxELjUGIXgObd?= =?us-ascii?q?IFoCwEBAQEBAQEBAQk/AgQBAQMDhFICgn8BJTgTAgQBAQESAQEGAQEBAQEGB?= =?us-ascii?q?AICgQqFaA2DU4EIAQEBAQEBAQEBAQEBAQEBAQEBARYCCFJMAR8BBAFAAQE3A?= =?us-ascii?q?QQLAgEpJQ8jJQIEDiAEhTkhAahugTSBAYIHAQEGglmFOwmBOo1uJ4FmQ4EVN?= =?us-ascii?q?oMthCuGFIIwZgYXgQACEhQFFw8oER0qFmcGHCkDkSEJgn+MZ5lYghIDBAOBf?= =?us-ascii?q?YEpkXGMTyuVTJEXthKFMIE/OIF+cYM4UBcCDlWNSoQkiitzOAIGCwEBAwl8i?= =?us-ascii?q?QIBMl4BAQ?=
X-IronPort-AV: E=Sophos;i="5.84,273,1620684000"; d="scan'208";a="30710388"
Received: from mail-mtadd25.fraunhofer.de ([192.102.167.25]) by mail-edgeS23.fraunhofer.de with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Jul 2021 14:23:06 +0200
IronPort-SDR: nVl1RchyCuOMzx2Df7mKWy3Dj8u60Is/jCCMd/mftoDeajo2zNF5L+Z/woo8TTfnmRgkDxymDr nYODX/dKIiiOj+PaEHkZGddrByVV7R/m8=
IronPort-HdrOrdr: =?us-ascii?q?A9a23=3AUBPbA6jparlkaxLkLxP6RrVvT3BQXvwji2?= =?us-ascii?q?hC6mlwRA09TyXBrbHIoB1p726TtN9xYgBbpTnuAtjifZqxz/JICMwqTNOftW?= =?us-ascii?q?rdyRaVxeNZnOnfKlTbckWUnINgPOVbAs1D4bbLY2SS+Pyb3ODOKbcdKbe8n5?= =?us-ascii?q?xB2ozlvgtQpEpRGthdBk9Ce36m++dNNXZ77LQCZeGh2vY=3D?=
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0C0CgDd+f9g/wUDB4BagQmDLIIpJRI?= =?us-ascii?q?xC41AiF4Dm3SBaAsBAwEBAQEBCQQ7AgQBAYRYAoJ+JjgTAgQBAQESAQEBBAE?= =?us-ascii?q?BAQIBBgIBAXsThWgNhkMBBAFAAQE3AQQLAgEpJQ8jJQIEDiAEhTkhAahugTS?= =?us-ascii?q?BAYIHAQEGglmFOwmBOo1ugg1DgRU2gy2EK4YUgjBmBheBAAISFAUXDygRHSo?= =?us-ascii?q?WZwYcKQORIQmCf4xnmViCEgMEA4F9gSmRcYxPK5VMkRe2EoUwgT84JIFZcYM?= =?us-ascii?q?4UBcCDlWRboorczgCBgsBAQMJfIkCATJeAQE?=
X-IPAS-Result: =?us-ascii?q?A0C0CgDd+f9g/wUDB4BagQmDLIIpJRIxC41AiF4Dm3SBa?= =?us-ascii?q?AsBAwEBAQEBCQQ7AgQBAYRYAoJ+JjgTAgQBAQESAQEBBAEBAQIBBgIBAXsTh?= =?us-ascii?q?WgNhkMBBAFAAQE3AQQLAgEpJQ8jJQIEDiAEhTkhAahugTSBAYIHAQEGglmFO?= =?us-ascii?q?wmBOo1ugg1DgRU2gy2EK4YUgjBmBheBAAISFAUXDygRHSoWZwYcKQORIQmCf?= =?us-ascii?q?4xnmViCEgMEA4F9gSmRcYxPK5VMkRe2EoUwgT84JIFZcYM4UBcCDlWRboorc?= =?us-ascii?q?zgCBgsBAQMJfIkCATJeAQE?=
X-IronPort-AV: E=Sophos;i="5.84,273,1620684000"; d="scan'208";a="116898650"
X-IronPort-Outbreak-Status: No, level 0, Unknown - Unknown
Received: from mailguard.fkie.fraunhofer.de (HELO a.mx.fkie.fraunhofer.de) ([128.7.3.5]) by mail-mtaDD25.fraunhofer.de with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Jul 2021 14:22:59 +0200
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=fkie.fraunhofer.de; s=dkim202105; h=MIME-Version:Content-Transfer-Encoding: Content-Type:In-Reply-To:References:Message-ID:Date:Subject:CC:To:From:Sender :Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=QU9LFJXa1I9Fupe+nK0IyhVbwYoyq9UlcMHHshxCQKw=; b=ka2uKM0dOyFrI+U2QzEI0psd80 bK/K/Ewmqz0gTLQDrptsqrKG/zrF8T9sIOmNEPtwaH6DSgsBdkTEShSqLE3ukubt/kKEGxJgwO+ma J58aBCOmYh8dDTM47HHMy+8V5a42BZqzQqKzIo24g7PYQVHkYJ49M9cIUDMEHIDk2WA2QmLyf0II1 2HqGcTse5cXoRulPGnjZ4DY/oevqUqME8/Qx8CLYDP68+5qmwLevZzOrHcVu+Lbcy95hZtCZSaD/Q oYnRd4qhXg0NkWX2bdG1OGgCljM+m+1hNSB9yweqiHqdGCfetm2tbHa9Zh4qi+sBgInCSlehjXTIb GEyY2D/g==;
Received: from srv-mailhost-b.fkie.fraunhofer.de ([128.7.10.131]) by a.mx.fkie.fraunhofer.de with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from <henning.rogge@fkie.fraunhofer.de>) id 1m8M6t-0001y8-8o; Tue, 27 Jul 2021 14:22:59 +0200
Received: from srv-mail-03.fkie.fraunhofer.de ([128.7.11.18] helo=srv-mail-03.gaia.fkie.fraunhofer.de) by srv-mailhost-b.fkie.fraunhofer.de with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from <henning.rogge@fkie.fraunhofer.de>) id 1m8M6o-0000u7-4J; Tue, 27 Jul 2021 14:22:54 +0200
Received: from srv-mail-03.gaia.fkie.fraunhofer.de (128.7.11.18) by srv-mail-03.gaia.fkie.fraunhofer.de (128.7.11.18) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Tue, 27 Jul 2021 14:22:58 +0200
Received: from srv-mail-03.gaia.fkie.fraunhofer.de ([fe80::bdb5:83e4:9ad3:822f]) by srv-mail-03.gaia.fkie.fraunhofer.de ([fe80::bdb5:83e4:9ad3:822f%13]) with mapi id 15.00.1497.018; Tue, 27 Jul 2021 14:22:58 +0200
From: "Rogge, Henning" <henning.rogge@fkie.fraunhofer.de>
To: MANET IETF <manet@ietf.org>
CC: Justin Dean <bebemaster@gmail.com>, Christopher Dearlove <christopher.dearlove@gmail.com>
Thread-Topic: OLSRv2 router restart
Thread-Index: AQHXf8fGyR8B9pYXFkSKiM5YZFf1aatWwWkF
Date: Tue, 27 Jul 2021 12:22:58 +0000
Message-ID: <1627388578611.27171@fkie.fraunhofer.de>
References: <1626937943164.99401@fkie.fraunhofer.de> <CA+-pDCdGjVPyVxnfMt2trN_Rk5J_btZrt2teFg43JSEbo0sn6g@mail.gmail.com> <1627020703962.66100@fkie.fraunhofer.de>, <23565714-003C-43B4-B367-16AA3EC35FA5@gmail.com>
In-Reply-To: <23565714-003C-43B4-B367-16AA3EC35FA5@gmail.com>
Accept-Language: de-DE, en-US
Content-Language: de-DE
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-ms-exchange-transport-fromentityheader: Hosted
x-originating-ip: [128.7.4.48]
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Archived-At: <https://mailarchive.ietf.org/arch/msg/manet/yDAq0O7zpgyJkZ6cYpt-PrgITW8>
Subject: Re: [manet] OLSRv2 router restart
X-BeenThere: manet@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Mobile Ad-hoc Networks <manet.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/manet>, <mailto:manet-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/manet/>
List-Post: <mailto:manet@ietf.org>
List-Help: <mailto:manet-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/manet>, <mailto:manet-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jul 2021 12:23:18 -0000

> I would start with the principle that we don’t want to change OLSRv2, create incompatibilities etc. unless we actually have to. And I don’t think we need to.
Yes, breaking the protocol would be bad.

> So what we should start with is advice to restarting routers. Can we solve all our problems that way?

> Before addressing that, you might say, what about the case of routers that restart but don’t follow advice, can they mess us up? Possibly (though I think it needs a matching ANSN, not just an expired one). But that’s the nature of OLSRv2 and similar protocols, a poorly behaving router can always mess you up. Worse if it is malicious, but that’s a different problem.

Every advice should hopefully fall back onto a "solve with timeout".

> And let’s supposed we managed to come up with a set of really good rules for restarting routers. What to do with them.  Could be an informational RFC, advice on restarting a router. Or a standards track RFC, what you should and what you must do on restarting a router. I could see either. Or a non-RFC approach.
> But that’s getting ahead. What advice could we give?

> First, if the router remembers its old MSN and ANSN (and if necessary, PSN) there’s no problem. (Obviously you’d include this in an RFC/whatever but it’s a trivial case.)

There is also the issue that you might not want to constantly write to flash memory.

> The second trivial case is that you just wait longer than your data will take to timeout. Often the easiest solution. But if you might have been using long timeouts, might be a problem. Or if you want rapid readmission. But that needs other routers to also act rapidly, and might not be an option.

Yes.

> So let’s assume you want back in, have no idea of your old MSN/ANSN, and there might be data out there that needs replacing, and you can’t just wait.

> Handling two sequence numbers (MSN/ANSN) together would be tricky. But fortunately we can do these one at a time. Because to get back in, first we need HELLO exchange (NHDP). That only uses MSN. So first we fix MSN, then we move on to ANSN. (Until we’ve exchanged HELLOs, no one will forward our TC messages.)

I already have a mechanism in place to fix the problem for MSN... because they always increase, I just "restart" the old/new decision as soon as I get 8 consecutive "too old but increasing" MSNs without a single valid one. With this kind of strategy you can reduce the waiting time from TC-VTIME to TC-ITIME*factor

> So pick a random MSN. If we pick badly, we will be ignored. We could wait, see if we are ignored, if we are, move on. (We might be ignored because messages are lost. I’ll ignore that for now. You might want to send messages more than once before moving on to handle that.)
> But maybe we would rather not wait that long. So here’s an approach I think works. Haven’t been able to test it - my former employer has my OLSRv2 code unfortunately and won’t let me have it - but that’s the point of discussion.
> So we pick three or four equispaced (or roughly so) numbers. Two won’t work, the case of separation by exactly N/2 is unreliable, where N is 2^16. (The version in OLSRv1 is specified but has strange consequences, so we left it unspecified in OLSRv2.). Four is easiest (0, N/4, N/2, 3N/4).

> So we send a HELLO (or more than one) with one of those MSNs. Then, without needing to wait, do the same for the rest of the numbers in order. At first we might be ignored, but at some point we will get past the last one we used. Then we will start overriding ourselves. And the last one we send will be accepted. (We can stop early if we get a response, that tells us we have an MSN we can use.)

Hmm... interesting idea...

> Is that ugly? Yes. But we are trying to solve the problem of having no information and not being prepared to wait. There are more efficient solutions if we modify OLSRv2 (e.g. reserve 0 or N-1 as a “forced reset” number) but I don’t want to do that as noted.

and having a single "reset" number would just fail if we hit a single packet loss.

> Now we have established an MSN, once we have data to send in TC messages we could do the same with the ANSN. Or actually we don’t need to wait, we can do that with empty TC messages. Complete ones. (Note that, unlike OLSRv1, empty messages are recorded.)

If we don't worry about packet loss, we could send all 4 empty TCs in a single OLSR packet.

> Why might this not work? (Apart of course from something I’ve overlooked.) I think there is implicit permission for a router to ignore messages with a big gap of sequence number from last seen as a security measure. We need that not to be done. (That would be firming up behaviour.) But - and as even there will be a security implications section - we really want to be running with authenticated messages anyway.

Lets not worry about security, just point to "sign messages" if someone asks.

> Is this fast enough? That depends on how long your data would take to expire anyway. This is really for cases where long duration data can be assumed.

> If anyone does want to take this further, I’m around.

This is definitely a nice set of ideas, I have to think about them a bit.

Christopher