Re: [Dcrouting] draft-przygienda-rift-03

Tony Przygienda <tonysietf@gmail.com> Thu, 11 January 2018 21:20 UTC

Return-Path: <tonysietf@gmail.com>
X-Original-To: dcrouting@ietfa.amsl.com
Delivered-To: dcrouting@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 640F512DA47; Thu, 11 Jan 2018 13:20:06 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.698
X-Spam-Level:
X-Spam-Status: No, score=-2.698 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id aDSWk7t2ahDJ; Thu, 11 Jan 2018 13:20:03 -0800 (PST)
Received: from mail-wm0-x229.google.com (mail-wm0-x229.google.com [IPv6:2a00:1450:400c:c09::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B2A6512EBEC; Thu, 11 Jan 2018 13:20:02 -0800 (PST)
Received: by mail-wm0-x229.google.com with SMTP id g75so8292981wme.0; Thu, 11 Jan 2018 13:20:02 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=DQBG+dtO/SbWFKgey3PkCMYKx5eGzyyZTMSJYOrWMfM=; b=D1rlsc1hihilQE9BbfXvbsmZYe1oP+4sMkp2sn2plO6TUZnWDFZf7wt35GCbHUljyH CS+un3v+TcpgJep+Qsa0/uhzBuGQ5udaDvkQU/uN+oqPznJaaHZRMCW7rcmyPStnbQOF bAytvpDxImGs57nwBCl3Gnw6wHcpsNmg2aJTF9h7j+k6vAqE+/B6PH7OtCyNDqVx/JEn aCTZL659uRlgw3PxJLRVHP2YG2SwTiUYN/c9sc8tnYeaLcvrrs2JVAektj5kzGJ/YZTf 86BOx5XYQkNutvAAmYOGRo08/vmq8hs1LmCS1mDyGLzELzaikNkKRwCil2906iff4Ys1 +h0Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=DQBG+dtO/SbWFKgey3PkCMYKx5eGzyyZTMSJYOrWMfM=; b=nSb4RhHLesRidW85iCwfuNipkTJXM17HlHhCVoaCfkfKZ637xQ19jODk6SiuchVY33 6K/vaxvvmJLYI2Z8Qg6y5sRbvXJXqWg6YlfzT87a18o7dsNe8O1PDFv0p+NshWEy1Hfn UpKYOJXPlgRIcTXysfMn8Ee5DauuRHXUqx3t0Jd6xStbhAhTLH1tiKcToqk3dsupIY/r lUDJ9bWD7fUYwaK96zaJkSkcVHnpvQ4spqjFbMSPAKkdqCD1hLck0lpQjqczdlUcU7MM YLLQJ8ZPKEx+DDoHuzfjlAQwE2DBYyJnJfqdIO56na2kQrxQBO2CQtqvOQspJmZSuvjb lTOg==
X-Gm-Message-State: AKGB3mLulbLZbBjDNq3xM3v7NKCLfbeboAauFL+OeD/XmClY+GqVKRJ0 RPdgBrvoueqlbjLE+j+bRu8uu7vP8rp+gT8IDnA=
X-Google-Smtp-Source: ACJfBotzQdfSmgV6vxmSUH/XtR6ztc3QHiUuWfBZm0vAAPsLuvIaa8g+PrELdcO7WMpaMJmK2kUQefyTTPGhDuCkLvY=
X-Received: by 10.80.205.203 with SMTP id h11mr32269245edj.159.1515705601209; Thu, 11 Jan 2018 13:20:01 -0800 (PST)
MIME-Version: 1.0
Received: by 10.80.164.199 with HTTP; Thu, 11 Jan 2018 13:19:20 -0800 (PST)
In-Reply-To: <CA+b+ERmFL_vnu3h9P2S+T1=kb0GKugUk9LWH8eYJnnPOtVkKRQ@mail.gmail.com>
References: <CA+b+ERnOc7V7+OL2wsfZsRsdSpjeSQmQQdH7SX_WLbySaVtxKw@mail.gmail.com> <CA+wi2hNbhXuXLKPD_0FL2csv1o9d37hF0XFex632z1skXUji+w@mail.gmail.com> <CA+b+ERmFL_vnu3h9P2S+T1=kb0GKugUk9LWH8eYJnnPOtVkKRQ@mail.gmail.com>
From: Tony Przygienda <tonysietf@gmail.com>
Date: Thu, 11 Jan 2018 13:19:20 -0800
Message-ID: <CA+wi2hMmR-pGmvpH476kUPKXbuShVi-fRiT_mqvNJ19ac_xG1A@mail.gmail.com>
To: Robert Raszuk <robert@raszuk.net>
Cc: rift@ietf.org, spring@ietf.org, dcrouting@ietf.org
Content-Type: multipart/alternative; boundary="f403045dc2e0f1b7d0056286b55a"
Archived-At: <https://mailarchive.ietf.org/arch/msg/dcrouting/o8otPAUPHg7reaZW7Aec_o9HRT4>
Subject: Re: [Dcrouting] draft-przygienda-rift-03
X-BeenThere: dcrouting@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "Routing in the Data Center: discussions about problems, requirements and potential solutions." <dcrouting.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dcrouting>, <mailto:dcrouting-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dcrouting/>
List-Post: <mailto:dcrouting@ietf.org>
List-Help: <mailto:dcrouting-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dcrouting>, <mailto:dcrouting-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Jan 2018 21:20:06 -0000

A. Reading the draft you'll find the overload bit on RIFT which IMO is the
best out-of-production solution.
B. Depends what your LEAF is. If LEAF is your TOR (most stuff today) then
RIFT will do that just fine using LEAF2LEAF procedures in fact. If we start
to run RIFT @ the server level then we'd need discuss but people do not do
horizontal links in non-leaf-to-non-leaf as far I saw (except protection
here and there) but then again, they mostly don't extend the fabric routing
protocol into the server (which RIFT intends to do). So, for today you're
good. Once you want RIFT on server I think having spine horizontals is not
a good idea in general

I hope that computes and I'd encourage you to read the draft again, maybe
without the "it should work just like BGP or just like today's IGP" hat on
...  ;-)  Of course, paying me fancy dinners and asking questions, letting
me ramble then is a valid substitute ;-)

--- tony

On Thu, Jan 11, 2018 at 10:54 AM, Robert Raszuk <robert@raszuk.net> wrote:

> Hi Tony,
>
> Thx for elaborating ...
>
> Two small comments:
>
> A) SID/SR use case in underlay could be as simple as gracefully taking a
> fabric node out of service. Not much OPEX needed if your NMS is decent.
> Otherwise in normal link state I can do overload bit, in BGP number of
> solutions from shutdown to MED to LP ... depending what is your BGP design.
> In RIFT how do you do that ? Note that overlay (if such exist) does not
> help here.
>
> B) For horizontal links imagine you have servers with 40 GB ports to TOR.
> Then you have Nx100 GB from TOR up. You are going to oversubscribe on TOR
> (servers to fabric) most likely 2:1 .. 3:1 etc. So if I want to
> interconnect TORs because I do know that servers behind those TORs need to
> talk to each other in a non blocking fashion _and_ I have spare 100 GB
> ports on TORs having routing protocol which does not allow me to do that
> seems pretty limited - wouldn't you agree ?
>
> Thx,
> R.
>
>
>
>
>
>
>
> On Thu, Jan 11, 2018 at 6:40 PM, Tony Przygienda <tonysietf@gmail.com>
> wrote:
>
>> Robert, productive points, thanks for raising them ... I go a bit in depth
>>
>> 1. I saw no _real_ use-cases for SID in DC so far to be frank (once you
>> run RIFT). The only one that comes up regularly is egress engineering and
>> that IMO is equivalent to SID=leaf address (which could be a HV address of
>> course once you have RIFT all way down to server) so really, what's the
>> point to have a SID? It's probably much smarter to use IBGP & so on overlay
>> to do this kind of synchronization if needed since labels/SIDs become very
>> useful in overlay to distinguish lots stuff there like VPNs/services which
>> you'd carry e.g. in MPLSoUDP. In underlay just use the destination v4/v6
>> address. Having said that, discussion always to be had if you pay me dinner
>> ;--) and I know _how_ we can do SIDs in RIFT since I thought it through but
>> again, no _real_ use case so far. And if your only concern is to "shape
>> towards a prefix" we have PGP in the draft which doesn't need new silicon
>> ;-P And then ultimately, yes, if you really, really want a SID per prefix
>> everywhere then you'll carry  SIDs to everywhere since unicast SIDs are
>> really just a glorified way to say "I have this non-aggreagable 20 bit IP
>> host address" which architecturally is a very interesting proposition in
>> terms of scaling (but then again, no account for taste and RFC1925 clause 3
>> applies) ...  Your LSDB will be still much smaller, your SPF will be still
>> simple on leaf in RIFT but your FIB will blow up and anything changing on a
>> leaf shakes all other leafs (unless you start to run pollicies to control
>> distribution @ which point in time you start to baby-sit your fabric @ high
>> OPEX). One of the reasons to do per-prefix SID would be non-ECMP anycast
>> (where SIDs _are_ in fact usefull) but if you read RIFT draft carefully you
>> will observe that RIFT can do anycast without need for ECMP, i.e. true
>> anycast in a sense and with that having anycast SID serves no real purpose
>> in RIFT and is actually generally much harder to do since you need globally
>> unique label blocks and so on ...
>>
>> 2. Horizontal links on CLOSes are not used that way normally all I saw
>> since your blocking goes to hell unless you provision some kind of really
>> massive parallel links between ToRs _and_ understand your load. We _could_
>> build RIFT that way but you give up balancing through the fabric and
>> loop-free property in a sense (that's a longish discussion  and scaling
>> since now you have prefixes showing up all kind of crazy places instead of
>> default). I see enough demand, we get there ...  Otherwise RFC1925 clause
>> 10 and 5.
>>
>> 3. PS1: Yes, lots of things "could" be done and then we "could" build a
>> protocol to do that and RFC1925 clause 7 and 8 applies. Such horizontal
>> links, unless provisioned correctly will pretty much just ruin your
>> blocking/loss on the fabric is the experience (which the math supports). In
>> a sense if you know your big flows you can build a specialized topology to
>> do the optimal distribution (MPLS tunnels anyone ;-) but the point of
>> fabric is that it's a fabric (i.e. load agnostic, cheap, no OPEX and easily
>> scalable). Otherwise a good analogy would be that you like to build special
>> RAM chips for the type of data structures you are storing and we know how
>> well that scales over time. We know now that within 3-4 years
>> characteristics of DC flows flip upside down without a sweat when people go
>> from server/client to microservices, from servers to containers and so on
>> and so on. So if you can't predict your load all the time you need a
>> _regular_ topology where _regular_ is more of a mathematical than a
>> protocol discussion. Fabric analogy of "buy more RAM chips in Fry's and
>> just stick them in" applies here. So RIFT is done largely to serve a
>> well-known structure called a "lattice" (with some restrictions) since we
>> need an "up" and "down". Things like hypercubes, thoroidal meshes and so on
>> and so on exist but CLOS won for a very good reason in history for that
>> kind of problems (once you move to NUMA other things win ;-) And if you
>> know your loads and your can heft the OPEX and you like to play with
>> protocols generally and if you can support the scale in terms of leaf FIB
>> sizes, flooding, slower convergence & so on & so on and you run flat IGP on
>> some kind of stuff that you build that doesn't even have to be regular in
>> any sense. We spent many years solving THAT problem obviously and doing
>> something like RIFT to replace normal IGP is of limited interest IMO
>> (albeit certain aspects having to do with modern implemenation techniques
>> may get us there one day but it's much less of pressing problem than
>> solving specialized DC routing well IMO again).
>>
>> 3. PS2: RIFT cannot build an "unsupported topology" no matter how you
>> cable (that's the point of it) or rather we have miscabling detection and
>> do not form adjacencies when you read the draft carefully. That's your
>> "flash red light" and it comes included for free with my compliments  ;-)
>> ... Otherwise RFC1925 clause 10.
>>
>> Otherwise, if you have concrete charter points you'd like to add, be more
>> specific in your asks and we see what the list thinks after ...
>>
>> thanks
>>
>> --- tony
>>
>>
>> On Thu, Jan 11, 2018 at 1:30 AM, Robert Raszuk <robert@raszuk.net> wrote:
>>
>>> Hi,
>>>
>>> I have one little question/doubt on scalability point of RIFT ...
>>>
>>> Assume that someone would like to signal IPv6 prefix SID for Segment
>>> Routing in the underlay within RIFT.
>>>
>>> Wouldn't it result in amount of protocol state in full analogy to
>>> massive deaggregation - which as of today is designed to be very careful
>>> and limited operation only at moments of failure(s) ?
>>>
>>> I sort of find it a bit surprising that RIFT draft does not provide
>>> encoding for SID distribution when it is positioned as an alternative to
>>> other protocols (IGPs or BGP) which already provide ability to carry all
>>> types of SIDs.
>>>
>>> Cheers,
>>> Robert.
>>>
>>> PS1: Horizontal links which were discussed could be installed to offload
>>> from fabric transit massive amount of data (ex: storage mirroring) directly
>>> between leafs or L3 TORs and not to be treated as "backup".
>>>
>>> PS2: Restricting any protocol to specific topologies seems like pretty
>>> slippery slope to me. In any case if protocol does that it should also
>>> contain self detection mechanism of "unsupported topology" and flash red
>>> light in any NOC.
>>>
>>>
>>>
>>> _______________________________________________
>>> Dcrouting mailing list
>>> Dcrouting@ietf.org
>>> https://www.ietf.org/mailman/listinfo/dcrouting
>>>
>>>
>>
>