Re: [arch-d] Centralization or diversity

Toerless Eckert <> Tue, 07 January 2020 16:53 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id B723B12011B for <>; Tue, 7 Jan 2020 08:53:10 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -0.87
X-Spam-Status: No, score=-0.87 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.25, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_NEUTRAL=0.779] autolearn=no autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id BIivd8Qi77Ug for <>; Tue, 7 Jan 2020 08:53:08 -0800 (PST)
Received: from ( [IPv6:2001:638:a000:4134::ffff:40]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id A5006120052 for <>; Tue, 7 Jan 2020 08:53:08 -0800 (PST)
Received: from ( []) by (Postfix) with ESMTP id 0745554802F; Tue, 7 Jan 2020 17:53:02 +0100 (CET)
Received: by (Postfix, from userid 10463) id 00608440059; Tue, 7 Jan 2020 17:53:01 +0100 (CET)
Date: Tue, 7 Jan 2020 17:53:01 +0100
From: Toerless Eckert <>
To: Spencer Dawkins at IETF <>
Cc: Andrew Campling <>, "" <>
Message-ID: <>
References: <LO2P265MB0573A1353911BFDD554DE5C8C2760@LO2P265MB0573.GBRP265.PROD.OUTLOOK.COM> <>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <>
User-Agent: Mutt/1.10.1 (2018-07-13)
Archived-At: <>
Subject: Re: [arch-d] Centralization or diversity
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: open discussion forum for long/wide-range architectural issues <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Tue, 07 Jan 2020 16:53:11 -0000

On Tue, Jan 07, 2020 at 10:22:56AM -0600, Spencer Dawkins at IETF wrote:
> I see that this topic is fragmenting into different threads, which is fine,
> so I'm not sure where I should insert this comment, but when we're talking
> about "centralized" and "decentralized", it's worth noting that
> "decentralized" fault tolerance can trip over implementation errors that
> are common across many devices owned or operated by a centralized entity.
> One of the largest SS7 outages in the US happened in 1990, when a one-line
> error that was present in a large number of switches caused >50 percent
> call failures across the entire ATT network.

Do i hear a slight criticism/concern about most networks current SDN ==
centralized intelligence & failure dependency ? Or is it just me ?

> So, implementation diversity matters, not just decentralization.

I think i had mentioned this already in the very beginning of the

Btw: The historic often cited cause for as much as half-country wide
Internet service outages in Germany was broken Radius/Accounting servers.
Of course, when there was volume accounting you could not make money
when those servers where broken, but even when the accounting models
changed did these issues/dependencies persist if i remember correctly.

In other words: design-wise you need to make sure you have distributed
resilience up to lets say "cash-registers" that can still ensure service
that still work under failure are also still accounted and charged for.

Aka: distributed "routing" and other lower layer resilience
is not sufficient in a commercial environment. And i can not remmber
that DoD/Arpanet ever tried to fund research into these commercial

> There's a nice write-up at
>, that
> matches my recollection of reports at the time.
> For me, the money quote was
> When the destination switch received the second of the two closely timed
> messages while it was still busy with the first (buffer not empty, line 7),
> the program should have dropped out of the if clause (line 7), processed
> the incoming message, and set up the pointers to the database (line 11).
> Instead, because of the break statement in the else clause (line 10), the
> program dropped out of the case statement entirely and began doing optional
> parameter work which overwrote the data (line 13). Error correction
> software detected the overwrite and shut the switch down while it could
> reset. Because every switch contained the same software, the resets
> cascaded down the network, incapacitating the system.

Yes. I think the low-level analysis of newer incidents of this type can
not be that easily explained anymore. A few years back, there was a big 
breakdown in google from too many conflicting automation systems if i remember
correctly. Try to explain in detail how that worked...


> Best,
> Spencer
> On Wed, Nov 13, 2019 at 12:57 AM Andrew Campling
> <> wrote:
> > "Martin Thomson" <** <>> wrote on Tue,
> > 05 November 2019 22:58:
> >
> > The draft specifically calls out the notion of a single point of failure
> > being a problem.  But my experience with centralized services is that they
> > aren't centralized in the fault tolerance sense.  If I look at the big
> > services, that scale is only achieved with careful distributed systems
> > design.  Name any modern service of even modest scale and you generally
> > find excellent fault tolerance.
> >
> > I thought that the document made it quite clear that it wasn???t
> > specifically referring to a single point of failure in a technical, fault
> > tolerance sense.  In fact it made this clear by, for example, also
> > highlighting issues such as ???administrative or governance system can become
> > weak through too much power or imagined power concentratred in one place???.
> >
> > Finally, I don't like the emphasis on DNS in this document.  It only
> > serves to sensationalize.
> >
> > I thought that the reference to DNS was particularly helpful given one of
> > the potential side-effects of the push behind DoH could be to centralise
> > what is currently a highly decentralised system.  I agree with the comment
> > in section 4 that ???where such centralised points are created, they will
> > eventually fail, or they will be misused through surveillance or legal
> > actions regardless of the best efforts of the Internet community.  The best
> > defense to data leak is to avoid creating that data store to begin with???.
> >
> > In addition, noting the references to RFC 1958 and RFC3935, I believe that
> > it would be prudent for RFC8484 to be reviewed accordingly.
> >
> >
> > *Andrew*
> >
> >
> > _______________________________________________
> > Architecture-discuss mailing list
> >
> >
> >

> _______________________________________________
> Architecture-discuss mailing list