Re: [Dcrouting] draft-przygienda-rift-03

Robert Raszuk <robert@raszuk.net> Thu, 11 January 2018 18:54 UTC

Return-Path: <rraszuk@gmail.com>
X-Original-To: dcrouting@ietfa.amsl.com
Delivered-To: dcrouting@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C181512EC34; Thu, 11 Jan 2018 10:54:13 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.398
X-Spam-Level:
X-Spam-Status: No, score=-1.398 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FREEMAIL_FORGED_FROMDOMAIN=0.25, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.25, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zX0ITynZDL13; Thu, 11 Jan 2018 10:54:11 -0800 (PST)
Received: from mail-wr0-x234.google.com (mail-wr0-x234.google.com [IPv6:2a00:1450:400c:c0c::234]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id ACB0212EC28; Thu, 11 Jan 2018 10:54:10 -0800 (PST)
Received: by mail-wr0-x234.google.com with SMTP id w50so3140129wrc.11; Thu, 11 Jan 2018 10:54:10 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=XBsVJdwaPGx2Oe8DF2MuJoBW8K3GxaHJzU3Al3NBB+U=; b=Nt7rkPYr7f6vk4HjRj7UDA2KSVB0pEVn7NTdeuqVeKDjktCYqjWWSx1Il38FJ+jwSO UyFwFFGTnaZ/ZPUlCkPJ9KK62oo/gPKcqlJIcLMFIe3C8BJkoTXKuY486u4DsvmDeT8P pOQzAA3HUmko8J+TEBFy7lJ3yGh1Y+ZtwOaCAm9iJgvJqDKyY3UfKV/ldMYNDjH/C/PK n90Wiy3e2T7xO1udpNfRKzQtb95xc17hoL7sOHvmipsb+c8K2DBPAvzB45ogguMix50L SmYqZ0+LHcYKV6PbPmn+AWLSLXdpWp4u1O5sUFB3pyy6ffxu+r6+eJMaV07Gp7u+i4wU xKaw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=XBsVJdwaPGx2Oe8DF2MuJoBW8K3GxaHJzU3Al3NBB+U=; b=QzhyCYOdXWzumINHBMmuCisWDebI7BUNP0oHE7vbCtDHXbUmMCzqxxFXOZ58ueTzmL pbzva8xkVq1qfuO1O7t3u5GiPHEM4ZUTZ/iQdMoSnG37olajd8CZEQ7S5eGHyrOghwkf NckhMlrEB2fFlZXH6kVsXYTfyeii8A9IPjwIsu+sEAShA1ZY4uMemaj9HQfjhLf20VyG xPxMmYEzYQcXUe88OFTlQIn3Jj1wbJzEuE1MUyO3XzyAvBw88G3kH7XUVJLWn4f1ONpd 9X9G1ti0rsExftbIdLM9O9Givc63L2/KncL1E5FaoGVPL4uLLpw1tIoIVHNZjp+SjS+B OMhA==
X-Gm-Message-State: AKGB3mIEXzjbpAchfFgyScXzCZ68e1L07Wd747tLA7+nXXxTvlNKmlBy hKu5ArOOparWWwLNCkrgn437OG5mXok6UsW6hnY=
X-Google-Smtp-Source: ACJfBotWZ+6WURyNM6XicTV2WRXclCfOF6oQ4Mr+UIAmsuvJYxRfW/bOXoLrErmMb+VVDL2OOubqjQL4N/y/JiVWcYE=
X-Received: by 10.223.162.138 with SMTP id s10mr19970405wra.239.1515696848935; Thu, 11 Jan 2018 10:54:08 -0800 (PST)
MIME-Version: 1.0
Sender: rraszuk@gmail.com
Received: by 10.28.24.71 with HTTP; Thu, 11 Jan 2018 10:54:07 -0800 (PST)
In-Reply-To: <CA+wi2hNbhXuXLKPD_0FL2csv1o9d37hF0XFex632z1skXUji+w@mail.gmail.com>
References: <CA+b+ERnOc7V7+OL2wsfZsRsdSpjeSQmQQdH7SX_WLbySaVtxKw@mail.gmail.com> <CA+wi2hNbhXuXLKPD_0FL2csv1o9d37hF0XFex632z1skXUji+w@mail.gmail.com>
From: Robert Raszuk <robert@raszuk.net>
Date: Thu, 11 Jan 2018 19:54:07 +0100
X-Google-Sender-Auth: 6e6RpwM60kiISNZLH8BBjo12tnQ
Message-ID: <CA+b+ERmFL_vnu3h9P2S+T1=kb0GKugUk9LWH8eYJnnPOtVkKRQ@mail.gmail.com>
To: Tony Przygienda <tonysietf@gmail.com>
Cc: rift@ietf.org, spring@ietf.org, dcrouting@ietf.org
Content-Type: multipart/alternative; boundary="f403045e97f844a0bc056284ac2a"
Archived-At: <https://mailarchive.ietf.org/arch/msg/dcrouting/NHFnuS4uFRnj-orUnMWsuLezTxY>
Subject: Re: [Dcrouting] draft-przygienda-rift-03
X-BeenThere: dcrouting@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "Routing in the Data Center: discussions about problems, requirements and potential solutions." <dcrouting.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dcrouting>, <mailto:dcrouting-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dcrouting/>
List-Post: <mailto:dcrouting@ietf.org>
List-Help: <mailto:dcrouting-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dcrouting>, <mailto:dcrouting-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Jan 2018 18:54:14 -0000

Hi Tony,

Thx for elaborating ...

Two small comments:

A) SID/SR use case in underlay could be as simple as gracefully taking a
fabric node out of service. Not much OPEX needed if your NMS is decent.
Otherwise in normal link state I can do overload bit, in BGP number of
solutions from shutdown to MED to LP ... depending what is your BGP design.
In RIFT how do you do that ? Note that overlay (if such exist) does not
help here.

B) For horizontal links imagine you have servers with 40 GB ports to TOR.
Then you have Nx100 GB from TOR up. You are going to oversubscribe on TOR
(servers to fabric) most likely 2:1 .. 3:1 etc. So if I want to
interconnect TORs because I do know that servers behind those TORs need to
talk to each other in a non blocking fashion _and_ I have spare 100 GB
ports on TORs having routing protocol which does not allow me to do that
seems pretty limited - wouldn't you agree ?

Thx,
R.







On Thu, Jan 11, 2018 at 6:40 PM, Tony Przygienda <tonysietf@gmail.com>
wrote:

> Robert, productive points, thanks for raising them ... I go a bit in depth
>
> 1. I saw no _real_ use-cases for SID in DC so far to be frank (once you
> run RIFT). The only one that comes up regularly is egress engineering and
> that IMO is equivalent to SID=leaf address (which could be a HV address of
> course once you have RIFT all way down to server) so really, what's the
> point to have a SID? It's probably much smarter to use IBGP & so on overlay
> to do this kind of synchronization if needed since labels/SIDs become very
> useful in overlay to distinguish lots stuff there like VPNs/services which
> you'd carry e.g. in MPLSoUDP. In underlay just use the destination v4/v6
> address. Having said that, discussion always to be had if you pay me dinner
> ;--) and I know _how_ we can do SIDs in RIFT since I thought it through but
> again, no _real_ use case so far. And if your only concern is to "shape
> towards a prefix" we have PGP in the draft which doesn't need new silicon
> ;-P And then ultimately, yes, if you really, really want a SID per prefix
> everywhere then you'll carry  SIDs to everywhere since unicast SIDs are
> really just a glorified way to say "I have this non-aggreagable 20 bit IP
> host address" which architecturally is a very interesting proposition in
> terms of scaling (but then again, no account for taste and RFC1925 clause 3
> applies) ...  Your LSDB will be still much smaller, your SPF will be still
> simple on leaf in RIFT but your FIB will blow up and anything changing on a
> leaf shakes all other leafs (unless you start to run pollicies to control
> distribution @ which point in time you start to baby-sit your fabric @ high
> OPEX). One of the reasons to do per-prefix SID would be non-ECMP anycast
> (where SIDs _are_ in fact usefull) but if you read RIFT draft carefully you
> will observe that RIFT can do anycast without need for ECMP, i.e. true
> anycast in a sense and with that having anycast SID serves no real purpose
> in RIFT and is actually generally much harder to do since you need globally
> unique label blocks and so on ...
>
> 2. Horizontal links on CLOSes are not used that way normally all I saw
> since your blocking goes to hell unless you provision some kind of really
> massive parallel links between ToRs _and_ understand your load. We _could_
> build RIFT that way but you give up balancing through the fabric and
> loop-free property in a sense (that's a longish discussion  and scaling
> since now you have prefixes showing up all kind of crazy places instead of
> default). I see enough demand, we get there ...  Otherwise RFC1925 clause
> 10 and 5.
>
> 3. PS1: Yes, lots of things "could" be done and then we "could" build a
> protocol to do that and RFC1925 clause 7 and 8 applies. Such horizontal
> links, unless provisioned correctly will pretty much just ruin your
> blocking/loss on the fabric is the experience (which the math supports). In
> a sense if you know your big flows you can build a specialized topology to
> do the optimal distribution (MPLS tunnels anyone ;-) but the point of
> fabric is that it's a fabric (i.e. load agnostic, cheap, no OPEX and easily
> scalable). Otherwise a good analogy would be that you like to build special
> RAM chips for the type of data structures you are storing and we know how
> well that scales over time. We know now that within 3-4 years
> characteristics of DC flows flip upside down without a sweat when people go
> from server/client to microservices, from servers to containers and so on
> and so on. So if you can't predict your load all the time you need a
> _regular_ topology where _regular_ is more of a mathematical than a
> protocol discussion. Fabric analogy of "buy more RAM chips in Fry's and
> just stick them in" applies here. So RIFT is done largely to serve a
> well-known structure called a "lattice" (with some restrictions) since we
> need an "up" and "down". Things like hypercubes, thoroidal meshes and so on
> and so on exist but CLOS won for a very good reason in history for that
> kind of problems (once you move to NUMA other things win ;-) And if you
> know your loads and your can heft the OPEX and you like to play with
> protocols generally and if you can support the scale in terms of leaf FIB
> sizes, flooding, slower convergence & so on & so on and you run flat IGP on
> some kind of stuff that you build that doesn't even have to be regular in
> any sense. We spent many years solving THAT problem obviously and doing
> something like RIFT to replace normal IGP is of limited interest IMO
> (albeit certain aspects having to do with modern implemenation techniques
> may get us there one day but it's much less of pressing problem than
> solving specialized DC routing well IMO again).
>
> 3. PS2: RIFT cannot build an "unsupported topology" no matter how you
> cable (that's the point of it) or rather we have miscabling detection and
> do not form adjacencies when you read the draft carefully. That's your
> "flash red light" and it comes included for free with my compliments  ;-)
> ... Otherwise RFC1925 clause 10.
>
> Otherwise, if you have concrete charter points you'd like to add, be more
> specific in your asks and we see what the list thinks after ...
>
> thanks
>
> --- tony
>
>
> On Thu, Jan 11, 2018 at 1:30 AM, Robert Raszuk <robert@raszuk.net> wrote:
>
>> Hi,
>>
>> I have one little question/doubt on scalability point of RIFT ...
>>
>> Assume that someone would like to signal IPv6 prefix SID for Segment
>> Routing in the underlay within RIFT.
>>
>> Wouldn't it result in amount of protocol state in full analogy to massive
>> deaggregation - which as of today is designed to be very careful and
>> limited operation only at moments of failure(s) ?
>>
>> I sort of find it a bit surprising that RIFT draft does not provide
>> encoding for SID distribution when it is positioned as an alternative to
>> other protocols (IGPs or BGP) which already provide ability to carry all
>> types of SIDs.
>>
>> Cheers,
>> Robert.
>>
>> PS1: Horizontal links which were discussed could be installed to offload
>> from fabric transit massive amount of data (ex: storage mirroring) directly
>> between leafs or L3 TORs and not to be treated as "backup".
>>
>> PS2: Restricting any protocol to specific topologies seems like pretty
>> slippery slope to me. In any case if protocol does that it should also
>> contain self detection mechanism of "unsupported topology" and flash red
>> light in any NOC.
>>
>>
>>
>> _______________________________________________
>> Dcrouting mailing list
>> Dcrouting@ietf.org
>> https://www.ietf.org/mailman/listinfo/dcrouting
>>
>>
>