Re: [Dcrouting] draft-przygienda-rift-03

Robert Raszuk <robert@raszuk.net> Sat, 13 January 2018 19:21 UTC

Return-Path: <rraszuk@gmail.com>
X-Original-To: dcrouting@ietfa.amsl.com
Delivered-To: dcrouting@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 394B812DA18; Sat, 13 Jan 2018 11:21:41 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.199
X-Spam-Level:
X-Spam-Status: No, score=-0.199 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FREEMAIL_FORGED_FROMDOMAIN=0.249, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.25, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BnE0sAhOl5TL; Sat, 13 Jan 2018 11:21:37 -0800 (PST)
Received: from mail-wm0-x22e.google.com (mail-wm0-x22e.google.com [IPv6:2a00:1450:400c:c09::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 671CC12D775; Sat, 13 Jan 2018 11:21:37 -0800 (PST)
Received: by mail-wm0-x22e.google.com with SMTP id r78so17754724wme.0; Sat, 13 Jan 2018 11:21:37 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=nDC5tEdZeKOJidfY4+fIen3tufn96hcV08bzyS97DBY=; b=uq7/lVnLKN9THamLL+QuLOXqHYDsQbuZmgE4q8ykpR1lRozFmggV8IzTQDG8PCcImA /aJ7cmwukbtYSxtEOiexfDuZ2Vyi556sG2+5ek4yp+GZqCHEYct/Yl/0rM3p3r2fxriW 1XEluU3FZUKoZQuzXQPfpTUJ4abh9LLs/hb1bINeR8VDY26EGrZKA46X6+7ZT68FIafV WD/E51+1Oiv41pF3X9ZinqqYH8S3eURgTNtmY0Yz0l5Z4jFdgNtGRjMCXGdNtmHrq+qM ZAdRenmZL/iQpfbNPLor0TAoJElIjxA0v/1mNf/Y2idRRA4LLxeOlspXvAbr6eViQEcY nWhw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=nDC5tEdZeKOJidfY4+fIen3tufn96hcV08bzyS97DBY=; b=EuvYl3OafrsaGw/KHmrXKIiypzqfIcPZOJUdh5CettTVMVN0PNN+/T9mXMlJg/UPiu ZpnOOwhqex4EBV898fFHIwoFDBoJ0jSniwqT0cJ62w35ni5NR0vGi3uMVjE/gXPM4UYm L1hFnQGyGoAFew/GitFZs28jIhvi3YbC8mKvxRtLlt1gWc6y6Pds5s2tOy80c3q7i/1T usAUx5W+mli4AN1vAwG/+b346nbsux+Xo4gGkfs8935YK9yjwssOD4/vzsmoXdGbJeqz I/9lkjLbVAMHh/4LGlRZVSLzbPCtduD/IWzLug34TWC+0Ng6etVJZsfjCZSpTtz/jjVn rptQ==
X-Gm-Message-State: AKwxytc1Yd6CbN9nr/T92SEUTNq8RilyaAuFrs+ssDnaWsBqKchO6arV Y0Y/gW0DuF1Sy8QwsBm5Ti9ZPpyESYk3wifWdLe/ubjB
X-Google-Smtp-Source: ACJfBovPpqxlmA8Hdn0xevvQ8aPWiVNMs3E4RVDtvAUmIrYUDwuXdxAel2FqRzKp13SlHk1Z/4fSUnIFmbsi8c2Klu4=
X-Received: by 10.28.61.68 with SMTP id k65mr6530537wma.147.1515871295478; Sat, 13 Jan 2018 11:21:35 -0800 (PST)
MIME-Version: 1.0
Sender: rraszuk@gmail.com
Received: by 10.28.24.71 with HTTP; Sat, 13 Jan 2018 11:21:34 -0800 (PST)
In-Reply-To: <CA+wi2hOLcPehhm65oas8JGhQvVXHJoKyzbxw6PWjx0Jhf_uzZA@mail.gmail.com>
References: <CA+b+ERnOc7V7+OL2wsfZsRsdSpjeSQmQQdH7SX_WLbySaVtxKw@mail.gmail.com> <CA+wi2hNbhXuXLKPD_0FL2csv1o9d37hF0XFex632z1skXUji+w@mail.gmail.com> <CA+b+ERmFL_vnu3h9P2S+T1=kb0GKugUk9LWH8eYJnnPOtVkKRQ@mail.gmail.com> <CA+wi2hOLcPehhm65oas8JGhQvVXHJoKyzbxw6PWjx0Jhf_uzZA@mail.gmail.com>
From: Robert Raszuk <robert@raszuk.net>
Date: Sat, 13 Jan 2018 20:21:34 +0100
X-Google-Sender-Auth: 7dHHwjSzFZUKamPXnSNXUpgcBSE
Message-ID: <CA+b+ERmC0iKDprFqKt8YYv3B6YGKmwM=tjOuvFkYSrLcEbY4gA@mail.gmail.com>
To: Tony Przygienda <tonysietf@gmail.com>
Cc: rift@ietf.org, spring@ietf.org, dcrouting@ietf.org
Content-Type: multipart/alternative; boundary="001a114b2d02179ff90562ad4af1"
Archived-At: <https://mailarchive.ietf.org/arch/msg/dcrouting/SOlnzKx2HftAbkLzjKhL2VOWTmI>
Subject: Re: [Dcrouting] draft-przygienda-rift-03
X-BeenThere: dcrouting@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "Routing in the Data Center: discussions about problems, requirements and potential solutions." <dcrouting.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dcrouting>, <mailto:dcrouting-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dcrouting/>
List-Post: <mailto:dcrouting@ietf.org>
List-Help: <mailto:dcrouting-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dcrouting>, <mailto:dcrouting-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 13 Jan 2018 19:21:41 -0000

Hi Tony,

> ​
if you're willing to provision things correctly yourself via S-PGP.

I am not willing to do that. I would like routing to do it for me
auto-magically. But yes you got the question right. I was asking for
shortcuts between last levels of fabric. Not so much between Node 111 &
Node 112 - but you said it is optional so ok.

Side note: PGP abbrev. for vast majority of people means completely
different thing then what you defined it to mean locally in your draft. I
highly recommended you rename it in -05 version to PGD (policy guided
destination(s) or PGR (policy guided reachability/routing).

Now requirement to switch off miscabling detection is not acceptable. You
are stating that only rift knows how should I cable my fabric ? And if I
cable it some other way it will detect it as miscabling ? I think in most
cases of correct cables check you actually define the intent then network
detects if such intent is met.

> Node112 can actually even go haywire

That may be not best property of a routing protocol :)

Best,
R.


On Sat, Jan 13, 2018 at 7:16 PM, Tony Przygienda <tonysietf@gmail.com>
wrote:

> So I thought over your horizontal link case to shortcut the spine levels
> for some kind of traffic again and I think the current draft actually
> covers that
> ​​
> if you're willing to provision things correctly yourself via S-PGP.
> Let me see whether we agree on this picture first:
>
> .                +--------+          +--------+
> .                |        |          |        |          ^ N
> .                |Spine 21|          |Spine 22|          |
> .Level 2         ++-+--+-++          ++-+--+-++        <-*-> E/W
> .                 | |  | |            | |  | |           |
> .             P111/2|  |P121          | |  | |         S v
> .                 ^ ^  ^ ^            | |  | |
> .                 | |  | |            | |  | |
> .  +--------------+ |  +-----------+  | |  | +---------------+
> .  |                |    |         |  | |  |                 |
> . South +-----------------------------+ |  |                 ^
> .  |    |           |    |         |    |  |              All TIEs
> .  0/0  0/0        0/0   +-----------------------------+     |
> .  v    v           v              |    |  |           |     |
> .  |    |           +-+    +<-0/0----------+           |     |
> .  |    |             |    |       |    |              |     |
> .+-+----++ optional +-+----++     ++----+-+           ++-----++
> .|       | E/W link |       +=====+       |           |       |
> .|Node111+----------+Node112|     |Node121|           |Node122|
> .+-+---+-+          ++----+-+     +-+---+-+           ++---+--+
> .  |   |             |   South      |   |              |   |
> .  |   +---0/0--->-----+ 0/0        |   +----------------+ |
> . 0/0                | |  |         |                  | | |
> .  |   +---<-0/0-----+ |  v         |   +--------------+ | |
> .  v   |               |  |         |   |                | |
> .+-+---+-+          +--+--+-+     +-+---+-+          +---+-+-+
> .|       |  (L2L)   |       |     |       |  Level 0 |       |
> .|Leaf111~~~~~~~~~~~~Leaf112|     |Leaf121|          |Leaf122|
> .+-+-----+          +-+---+-+     +--+--+-+          +-+-----+
> .  +                  +    \        /   +              +
> .  Prefix111   Prefix112    \      /   Prefix121    Prefix122
> .                          multi-homed
> .                            Prefix
> .+---------- Pod 1 ---------+     +---------- Pod 2 ---------+
>
> I assume here that what you ask for is the following scenario:
> a) POD1 being a compute generating very heavy load towards storage in
>    Prefix121.
> b) traffic from POD1 NOT being balanced through the spines but taking
>    a horizontal link Node112 to Node121 to reach your storage in
> Prefix121
>    to save bandwidth? or delay?
> c) The key to riches is -04 section4.2.5.1
> <https://tools.ietf.org/html/draft-przygienda-rift-04#section-4.2.5.1>.
> Northbound SPF
>
>   in the paragraph "Other south prefixes found when crossing E-W link MAY
> be used IIF". Now, we could make it a MUST (but it's really an
> implementation knob IMO) and what it says is that if you are willing to
> inject an "S-PGP" @ Node121 for Prefix121 it will get flooded to Node112
> and Node112 will have a more specific match than the default in N-SPF. From
> 121 normal RIFT takes over since the normal N-Prefix for Leaf121 kicks in
> on Node121. I assume Node112 policy on the ingress is to not propagate the
> S-PGP south but use it for N-SPF only.
>
>
> Observe that
>
> a) you have to switch off miscabling detection for PoD# on those nodes
> since you are "crossing PoDs illegally"
>
> b) if you want whole Pod#1 to do that you either cable Node111 to Node121
> as well (which will load balance whole Pod#1 towards storage without using
> Spine)  OR you propagate S-PGPs south towards leafs (which will cost you
> leaf FIB of course but make sure ALL traffic to storage goes over Node112
> only).
>
> c) Your forwarding on Node112 can actually even go haywire & load-balance
> some of the traffic to Prefix121 using this S-PGP over Node121 and some
> still using the default route towards spine(which here is a blatant
> violation of LPM of course) and RIFT will work just fine (unless you loop
> yourself to death with PGPs you install) but that of course is a deep
> rathole in itself. I just mention it to show why the "non-looping" design
> is so important and makes for the shortcomings for the SPF on a fabric,
> predicted by the non-directional mesh property that Dijkstra solved in his
> time.
>
> so?
>
> --- tony
>
>
>
>
>
>
> On Thu, Jan 11, 2018 at 10:54 AM, Robert Raszuk <robert@raszuk.net> wrote:
>
>> Hi Tony,
>>
>> Thx for elaborating ...
>>
>> Two small comments:
>>
>> A) SID/SR use case in underlay could be as simple as gracefully taking a
>> fabric node out of service. Not much OPEX needed if your NMS is decent.
>> Otherwise in normal link state I can do overload bit, in BGP number of
>> solutions from shutdown to MED to LP ... depending what is your BGP design.
>> In RIFT how do you do that ? Note that overlay (if such exist) does not
>> help here.
>>
>> B) For horizontal links imagine you have servers with 40 GB ports to TOR.
>> Then you have Nx100 GB from TOR up. You are going to oversubscribe on TOR
>> (servers to fabric) most likely 2:1 .. 3:1 etc. So if I want to
>> interconnect TORs because I do know that servers behind those TORs need to
>> talk to each other in a non blocking fashion _and_ I have spare 100 GB
>> ports on TORs having routing protocol which does not allow me to do that
>> seems pretty limited - wouldn't you agree ?
>>
>> Thx,
>> R.
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Jan 11, 2018 at 6:40 PM, Tony Przygienda <tonysietf@gmail.com>
>> wrote:
>>
>>> Robert, productive points, thanks for raising them ... I go a bit in
>>> depth
>>>
>>> 1. I saw no _real_ use-cases for SID in DC so far to be frank (once you
>>> run RIFT). The only one that comes up regularly is egress engineering and
>>> that IMO is equivalent to SID=leaf address (which could be a HV address of
>>> course once you have RIFT all way down to server) so really, what's the
>>> point to have a SID? It's probably much smarter to use IBGP & so on overlay
>>> to do this kind of synchronization if needed since labels/SIDs become very
>>> useful in overlay to distinguish lots stuff there like VPNs/services which
>>> you'd carry e.g. in MPLSoUDP. In underlay just use the destination v4/v6
>>> address. Having said that, discussion always to be had if you pay me dinner
>>> ;--) and I know _how_ we can do SIDs in RIFT since I thought it through but
>>> again, no _real_ use case so far. And if your only concern is to "shape
>>> towards a prefix" we have PGP in the draft which doesn't need new silicon
>>> ;-P And then ultimately, yes, if you really, really want a SID per prefix
>>> everywhere then you'll carry  SIDs to everywhere since unicast SIDs are
>>> really just a glorified way to say "I have this non-aggreagable 20 bit IP
>>> host address" which architecturally is a very interesting proposition in
>>> terms of scaling (but then again, no account for taste and RFC1925 clause 3
>>> applies) ...  Your LSDB will be still much smaller, your SPF will be still
>>> simple on leaf in RIFT but your FIB will blow up and anything changing on a
>>> leaf shakes all other leafs (unless you start to run pollicies to control
>>> distribution @ which point in time you start to baby-sit your fabric @ high
>>> OPEX). One of the reasons to do per-prefix SID would be non-ECMP anycast
>>> (where SIDs _are_ in fact usefull) but if you read RIFT draft carefully you
>>> will observe that RIFT can do anycast without need for ECMP, i.e. true
>>> anycast in a sense and with that having anycast SID serves no real purpose
>>> in RIFT and is actually generally much harder to do since you need globally
>>> unique label blocks and so on ...
>>>
>>> 2. Horizontal links on CLOSes are not used that way normally all I saw
>>> since your blocking goes to hell unless you provision some kind of really
>>> massive parallel links between ToRs _and_ understand your load. We _could_
>>> build RIFT that way but you give up balancing through the fabric and
>>> loop-free property in a sense (that's a longish discussion  and scaling
>>> since now you have prefixes showing up all kind of crazy places instead of
>>> default). I see enough demand, we get there ...  Otherwise RFC1925 clause
>>> 10 and 5.
>>>
>>> 3. PS1: Yes, lots of things "could" be done and then we "could" build a
>>> protocol to do that and RFC1925 clause 7 and 8 applies. Such horizontal
>>> links, unless provisioned correctly will pretty much just ruin your
>>> blocking/loss on the fabric is the experience (which the math supports). In
>>> a sense if you know your big flows you can build a specialized topology to
>>> do the optimal distribution (MPLS tunnels anyone ;-) but the point of
>>> fabric is that it's a fabric (i.e. load agnostic, cheap, no OPEX and easily
>>> scalable). Otherwise a good analogy would be that you like to build special
>>> RAM chips for the type of data structures you are storing and we know how
>>> well that scales over time. We know now that within 3-4 years
>>> characteristics of DC flows flip upside down without a sweat when people go
>>> from server/client to microservices, from servers to containers and so on
>>> and so on. So if you can't predict your load all the time you need a
>>> _regular_ topology where _regular_ is more of a mathematical than a
>>> protocol discussion. Fabric analogy of "buy more RAM chips in Fry's and
>>> just stick them in" applies here. So RIFT is done largely to serve a
>>> well-known structure called a "lattice" (with some restrictions) since we
>>> need an "up" and "down". Things like hypercubes, thoroidal meshes and so on
>>> and so on exist but CLOS won for a very good reason in history for that
>>> kind of problems (once you move to NUMA other things win ;-) And if you
>>> know your loads and your can heft the OPEX and you like to play with
>>> protocols generally and if you can support the scale in terms of leaf FIB
>>> sizes, flooding, slower convergence & so on & so on and you run flat IGP on
>>> some kind of stuff that you build that doesn't even have to be regular in
>>> any sense. We spent many years solving THAT problem obviously and doing
>>> something like RIFT to replace normal IGP is of limited interest IMO
>>> (albeit certain aspects having to do with modern implemenation techniques
>>> may get us there one day but it's much less of pressing problem than
>>> solving specialized DC routing well IMO again).
>>>
>>> 3. PS2: RIFT cannot build an "unsupported topology" no matter how you
>>> cable (that's the point of it) or rather we have miscabling detection and
>>> do not form adjacencies when you read the draft carefully. That's your
>>> "flash red light" and it comes included for free with my compliments  ;-)
>>> ... Otherwise RFC1925 clause 10.
>>>
>>> Otherwise, if you have concrete charter points you'd like to add, be
>>> more specific in your asks and we see what the list thinks after ...
>>>
>>> thanks
>>>
>>> --- tony
>>>
>>>
>>> On Thu, Jan 11, 2018 at 1:30 AM, Robert Raszuk <robert@raszuk.net>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have one little question/doubt on scalability point of RIFT ...
>>>>
>>>> Assume that someone would like to signal IPv6 prefix SID for Segment
>>>> Routing in the underlay within RIFT.
>>>>
>>>> Wouldn't it result in amount of protocol state in full analogy to
>>>> massive deaggregation - which as of today is designed to be very careful
>>>> and limited operation only at moments of failure(s) ?
>>>>
>>>> I sort of find it a bit surprising that RIFT draft does not provide
>>>> encoding for SID distribution when it is positioned as an alternative to
>>>> other protocols (IGPs or BGP) which already provide ability to carry all
>>>> types of SIDs.
>>>>
>>>> Cheers,
>>>> Robert.
>>>>
>>>> PS1: Horizontal links which were discussed could be installed to
>>>> offload from fabric transit massive amount of data (ex: storage mirroring)
>>>> directly between leafs or L3 TORs and not to be treated as "backup".
>>>>
>>>> PS2: Restricting any protocol to specific topologies seems like pretty
>>>> slippery slope to me. In any case if protocol does that it should also
>>>> contain self detection mechanism of "unsupported topology" and flash red
>>>> light in any NOC.
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Dcrouting mailing list
>>>> Dcrouting@ietf.org
>>>> https://www.ietf.org/mailman/listinfo/dcrouting
>>>>
>>>>
>>>
>>
>