Re: [manet] RFC7181 (OLSRv2) trouble with ANSN and router restart

"Rogge, Henning" <henning.rogge@fkie.fraunhofer.de> Tue, 27 July 2021 12:10 UTC

Return-Path: <henning.rogge@fkie.fraunhofer.de>
X-Original-To: manet@ietfa.amsl.com
Delivered-To: manet@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0A7F93A2405 for <manet@ietfa.amsl.com>; Tue, 27 Jul 2021 05:10:07 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=fkie.fraunhofer.de
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qO9cRnlVe4YQ for <manet@ietfa.amsl.com>; Tue, 27 Jul 2021 05:10:02 -0700 (PDT)
Received: from mail-edgeKA27.fraunhofer.de (mail-edgeka27.fraunhofer.de [153.96.1.27]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id CC51D3A23F8 for <manet@ietf.org>; Tue, 27 Jul 2021 05:09:51 -0700 (PDT)
IronPort-SDR: GWoLTMvBNX259DUzo3lhVfhxl2RX3fclMzDJeqJvsffv0SgEkjvS9Ek3eyy7wZgNjnyBQnWc19 cFmJuHQuFIwGyQoVkhp0v19ZaenaPYvDwyRcXHRH4TRakcvI2XNSRGt44wQqsi3ItD+yVsBkp7 BbvS8yeQblcKeSgr7zP8fT+6hxbZV3JFFMb+SIWOLq8q/kaNcwzqYY/AVmSoVwqjvYhBZkjcdQ Q3J/p78k3fRKPp5tvJ7YX2kQWwQ2fMXVSTahYdQuG2Y2l4SLeknoAxAc7FlGD4VmYK//oIc2gH izA=
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A2GFBQBn9v9g/xwBYJlagQmDLCOBfXELjUGIXwObdIFoCwEBAQEBAQEBAQk/AgQBAQMDhFICgn8BJTgTAgQBAQESAQEGAQEBAQEGBAICgQqFaAEMg1OBCAEBAQEBAQEBAQEBAQEBAQEBAQEWAghSTAEfAQQBQAEBNwEECwIBKSUPIyUCBA4khTkgAQGoeYE0gQGCBwEBBoJZhUMJgTqJc4N7J4FmQ4EVNoMtgQSDCwESAQiGFIIrBSg+CCw2PwgUHAJZCkBDKhwplCynOoEXAwQDgX2BKZ5AK4McR5FpkReWC4IcoxuBPzhrI3BxT4JpUBcCDo42hA2KK3M4AgYLAQEDCXyJAgGBEAEB
X-IPAS-Result: A2GFBQBn9v9g/xwBYJlagQmDLCOBfXELjUGIXwObdIFoCwEBAQEBAQEBAQk/AgQBAQMDhFICgn8BJTgTAgQBAQESAQEGAQEBAQEGBAICgQqFaAEMg1OBCAEBAQEBAQEBAQEBAQEBAQEBAQEWAghSTAEfAQQBQAEBNwEECwIBKSUPIyUCBA4khTkgAQGoeYE0gQGCBwEBBoJZhUMJgTqJc4N7J4FmQ4EVNoMtgQSDCwESAQiGFIIrBSg+CCw2PwgUHAJZCkBDKhwplCynOoEXAwQDgX2BKZ5AK4McR5FpkReWC4IcoxuBPzhrI3BxT4JpUBcCDo42hA2KK3M4AgYLAQEDCXyJAgGBEAEB
X-IronPort-AV: E=Sophos;i="5.84,273,1620684000"; d="scan'208";a="34629515"
Received: from mail-mtaka28.fraunhofer.de ([153.96.1.28]) by mail-edgeKA27.fraunhofer.de with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Jul 2021 14:09:45 +0200
IronPort-SDR: 27SLMSV1VZadBa++n0wEPLv7UWxVCWy0MgYhKLBeoCsPFFqWDPtgT4uxMVJEh66w1jKFoCl+aN +Oo8BD1R+qddIVXEWR2Dt99gXM9xTD4Yyy9Nh+Inx2oye2JgboWYK+UIG9/ZnGrVO3F8BkxNiB VTw1AXP8jvpI1kabh94lxhUxgcl0KxifMKix5fohzmDCoyMnQKNLvQdWmsAhVKbb0LpStW3BeM QilbRTpQTtEFedP34H81/YO8pkEiULviqCY0hB5g3+YJ6c1VPkHGiZ3c+4MsZxhKQBcqRrjY20 ElM2V45oBuCHgAApjGqPsS8S
IronPort-HdrOrdr: A9a23:xhmtaq7RwuIbjgA1NAPXweGBI+orL9Y04lQ7vn2ZhyY1TiX4rbHMoB11726LtN98Yh4dcLO7WJVoP0msiqKdiLN5VdyftW/d1ldAR7sO0WKN+VfdMhy73tRm5Z1cN4BVNf3XKm5WpfvXiTPIb+oI8Z2uypqZv9qb511RbSdMXZtLhj0JdzqzIwlffjN3P6d8PLWyyuB7ixeXRU4tWOCSJlxtZZmgm/T70LrdWy49OloO0jOvoxSFxYjBLDWv5D12aUIr/Z4StUD+qTzC2+GKicua5Djx+lnu1a9hovGJ8KomOOW8zuAuEAXXt0KFW7JKc4CvkhYPkIiUmS4XueiJiy0bD/5Pr1/vSFiPgT3X+zTRuQxekUPK+Buxu0HSm/G8aA0NMOZvs6V+GyGpjHYIjZVV6ph65V/cm6VgNz/6vALA3f/lbSsCrDvOnVMS1cYotUxkbM8zV4lqgbch3Gl4ea1wZR7S2cQOKtNfNvyZyMlhS26zUlyxhBgI/DR7Nk5eIiu7
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A0AeCwCW9v9g/wUDB4BagQmDLCOCBiUSMQuNQIheA5t0gWgLAQMBAQEBAQkEOwIEAQGEWAKCfiY4EwIEAQEBEgEBAQQBAQECAQYCAQF7E4VoAQyGQwEEAUABATcBBAsCASklDyMlAgQOJIU5IQGoeoE0gQGCBwEBBoJZhT4JgTqJc4N7gg1DgRU2gy2BBIMLARIBCIYUgisFKD4ILDY/CBQcAlkKQEMqHCmULKc6gRcDBAOBfYEpnkArgxxHkWmRF5YLghyjG4E/OCRGI3BxT4JpUBcCDo42hA2KK3M4AgYLAQEDCXyJAgGBEAEB
X-IPAS-Result: A0AeCwCW9v9g/wUDB4BagQmDLCOCBiUSMQuNQIheA5t0gWgLAQMBAQEBAQkEOwIEAQGEWAKCfiY4EwIEAQEBEgEBAQQBAQECAQYCAQF7E4VoAQyGQwEEAUABATcBBAsCASklDyMlAgQOJIU5IQGoeoE0gQGCBwEBBoJZhT4JgTqJc4N7gg1DgRU2gy2BBIMLARIBCIYUgisFKD4ILDY/CBQcAlkKQEMqHCmULKc6gRcDBAOBfYEpnkArgxxHkWmRF5YLghyjG4E/OCRGI3BxT4JpUBcCDo42hA2KK3M4AgYLAQEDCXyJAgGBEAEB
X-IronPort-AV: E=Sophos;i="5.84,273,1620684000"; d="scan'208";a="62955418"
X-IronPort-Outbreak-Status: No, level 0, Unknown - Unknown
Received: from mailguard.fkie.fraunhofer.de (HELO a.mx.fkie.fraunhofer.de) ([128.7.3.5]) by mail-mtaKA28.fraunhofer.de with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Jul 2021 14:09:43 +0200
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=fkie.fraunhofer.de; s=dkim202105; h=MIME-Version:Content-Transfer-Encoding: Content-Type:In-Reply-To:References:Message-ID:Date:Subject:CC:To:From:Sender :Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=xHQ8jlWksxLNHl2TvvyQe/YVlvwINSZhMtOViz/Fefk=; b=g55lAOvpzHxtBfu8nywbvt3JOI if1fwRfxZYSRHb29blKx9Zm/Q9bfWCIV0PiF2Z1xdjA0tqKWrywi2jbCfTCdeGoYaBXlID0mnAReQ GwirhUs/mVTzlZC5nMJqr3pLrxB12KX76+GfB8SabGYMA4rSNKPELn2drbeND+aOZexlTfdqTrzLS RCZLv3NZn2CFtnKh6zff1lOARURAnkAUPodmTsPztNriMbRuLEpwDYmu8h38A+weAfDnevX9a1tPf EgrMrMSakzBGby4m6s5DN4njHCXLEz/ixzcXek5yqWm8gbq2Xgyw9M6XXkyjPdUD7NJ2zvC5q3sBO I85VSi6Q==;
Received: from srv-mailhost-b.fkie.fraunhofer.de ([128.7.10.131]) by a.mx.fkie.fraunhofer.de with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from <henning.rogge@fkie.fraunhofer.de>) id 1m8Lu2-0001u2-Um; Tue, 27 Jul 2021 14:09:42 +0200
Received: from srv-mail-01.fkie.fraunhofer.de ([128.7.11.16] helo=srv-mail-01.gaia.fkie.fraunhofer.de) by srv-mailhost-b.fkie.fraunhofer.de with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from <henning.rogge@fkie.fraunhofer.de>) id 1m8Ltx-0000i2-Sz; Tue, 27 Jul 2021 14:09:37 +0200
Received: from srv-mail-03.gaia.fkie.fraunhofer.de (128.7.11.18) by srv-mail-01.gaia.fkie.fraunhofer.de (128.7.11.16) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Tue, 27 Jul 2021 14:09:42 +0200
Received: from srv-mail-03.gaia.fkie.fraunhofer.de ([fe80::bdb5:83e4:9ad3:822f]) by srv-mail-03.gaia.fkie.fraunhofer.de ([fe80::bdb5:83e4:9ad3:822f%13]) with mapi id 15.00.1497.018; Tue, 27 Jul 2021 14:09:42 +0200
From: "Rogge, Henning" <henning.rogge@fkie.fraunhofer.de>
To: "manet@ietf.org" <manet@ietf.org>
CC: Christopher Dearlove <christopher.dearlove@gmail.com>
Thread-Topic: [manet] RFC7181 (OLSRv2) trouble with ANSN and router restart
Thread-Index: AQHXfseSiLtj09eyb0WM8ePG4fmzGqtQYQyAgAZecQE=
Date: Tue, 27 Jul 2021 12:09:42 +0000
Message-ID: <1627387782340.59393@fkie.fraunhofer.de>
References: <1626937943164.99401@fkie.fraunhofer.de>, <D4B167E8-160E-4CD0-8800-A5D8F0D50967@gmail.com>
In-Reply-To: <D4B167E8-160E-4CD0-8800-A5D8F0D50967@gmail.com>
Accept-Language: de-DE, en-US
Content-Language: de-DE
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-ms-exchange-transport-fromentityheader: Hosted
x-originating-ip: [128.7.4.48]
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Archived-At: <https://mailarchive.ietf.org/arch/msg/manet/ryNw2ZhOp2sCzl4vozB75jCAPik>
Subject: Re: [manet] RFC7181 (OLSRv2) trouble with ANSN and router restart
X-BeenThere: manet@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Mobile Ad-hoc Networks <manet.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/manet>, <mailto:manet-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/manet/>
List-Post: <mailto:manet@ietf.org>
List-Help: <mailto:manet-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/manet>, <mailto:manet-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jul 2021 12:10:07 -0000

> I’m failing to understand the specific problem here. There are restart issues that we never found the bandwidth to formally address, but they are different to this one.

>> The sequence that leads to the issue is:
>>
>> 1) router A restarts
>> 2) router A (randomly?) selects a new Message Sequence Number which is HIGHER (in terms of cyclical comparison) than the last one it used
>> 3) router A selects a new ANSN which is LOWER (or the same) than the last one it used

> This is unfortunate, and part of the problem, router A will find it hard to get back into the network. The easiest way is simply to wait out expiry times, but that might not be an issue if those have become long (which is a bandwidth saving approach in a stable network - but nothing unfortunately is free).

Yes, waiting for the configured VTIME before you restart the routing is a valid option with a lot of time-cost involved.

>> 4) router B sees the new message sequence number/ANSN in TCs from router A
>>   => router B does not allow the old TC data to timeout (message sequence number is higher!)
>>   => router B does NOT overwrite the old TC data (ANSN is lower)

> Yes, the old data won’t be overwritten. But nothing affects the timeout. It doesn’t happen, but it isn’t extended. 16.3.3.1 point 1, the TC message will be discarded. So Topology Tuples etc. aren’t modified and will in due course timeout. Maybe that’s too slow, but that’s a different issue.

I think Ronald and me overlooked this detail... I have to check my implementation, but I think I do this correctly.

>> b) the ANSN after the restart is the SAME as before... this is tricky, I have no idea how to resolve this at the receiver without comparing the TC data with the database, which is not reliable when we deal with incomplete TCs.

> This one might be a problem. Unlucky with 2^16 numbers to pick. But not unlucky enough. But we are into the region of how to restart a router, and advice to give it. Using more than one ANSN is one way to solve that. However the composite of MSN and ANSN is a tricky one. I can see (I think) a horribly inefficient way to make it work, but an efficient way - other than just wait out timeouts - needs more thought.

Yes. waiting for the Validity time is often a bad option.

> A restarting router that uses incomplete TCs though is not good behaviour. That would be part of the advice to restarting routers.

Sometimes the router has no choice, e.g. if the list of attached prefixes is just LONG. Long ago I wrote a hack for OLSRv1 that allowed to add the "range" of a partial update... so the receiver could clear every prefix within the range that is not in the TC.

But the most likely way to go for me is to define an OLSRv2 extension so I don't have to send the (mostly static) Attached Networks with every TC. Or finally look into my ideas about RFC 5444 compression.

> (I assume we are considering routers that restart with no memory of their last MSN and ANSN. Those with memory can simply continue.)

Yes, storing all relevant sequence numbers would also help.

> First, I don’t think you have the receiver side issue you note. The information will eventually timeout unless the sender sends incomplete messages with valid MSN and latest ANSN. But that’s something it could do anyway.

> There are heuristics you might add to say “ignore this message as it fails a smell test”. You can’t stop a router doing that, nor should you. There are various heuristics you could try. Some relevant to this situation, others not. One such is that if you see a message where MSN and ANSN show opposite behaviour, that’s a sign that router is misbehaving. If you interpret misbehaving as “probably reset, I should wipe all its data” that’s drastic but possible. But doing so on the basis of one message is a bit much. And I don’t think you’ll ever get a set of rules applicable to all situations. Which is why I used the term heuristics.

> Either in an informational RFC on router reset behaviour, or in a non-normative annex to a standards track RFC, you could include advice on behaviour to look out for that suggests a misbehaving/reset router and that you MAY (no stronger) discard messages or even wipe the database if you see that, probably more than once.

I think an informational document that points out the issue and offer a few suggestions what you could do would be a good way.

> In short, I don’t think this is a soluble, here are rules that always work, problem. But if you find a real problem in a real network you could include code that makes a tradeoff of handling that case while possibly making other cases behave less well. (For example the multiple MSNs/ANSNs I suggest in my last email might trigger some heuristics.) Information might help a network operator. If you are supplying an OLSRv2 implementation and have a heuristic you think is often useful, you could include it as an option. But it would be very easy to think up heuristics that in reality aren’t much use.

I think my confusion was mostly triggered by the fact that the same router coming only again can give me a validity-time long interval of bad network behavior...

Henning Rogge