Re: [Int-area] Call for adoption of draft-nachum-sarp-06.txt

Zhenghewen <zhenghewen@huawei.com> Fri, 11 October 2013 02:11 UTC

Return-Path: <zhenghewen@huawei.com>
X-Original-To: int-area@ietfa.amsl.com
Delivered-To: int-area@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6E30021E8188 for <int-area@ietfa.amsl.com>; Thu, 10 Oct 2013 19:11:57 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.892
X-Spam-Level:
X-Spam-Status: No, score=-2.892 tagged_above=-999 required=5 tests=[AWL=-2.533, BAYES_00=-2.599, GB_SUMOF=5, RCVD_IN_DNSWL_MED=-4, SARE_LWSHORTT=1.24]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5UjahrYPBIWO for <int-area@ietfa.amsl.com>; Thu, 10 Oct 2013 19:11:53 -0700 (PDT)
Received: from lhrrgout.huawei.com (lhrrgout.huawei.com [194.213.3.17]) by ietfa.amsl.com (Postfix) with ESMTP id 34BB321F9A65 for <int-area@ietf.org>; Thu, 10 Oct 2013 19:11:51 -0700 (PDT)
Received: from 172.18.7.190 (EHLO lhreml203-edg.china.huawei.com) ([172.18.7.190]) by lhrrg01-dlp.huawei.com (MOS 4.3.7-GA FastPath queued) with ESMTP id AYY74996; Fri, 11 Oct 2013 02:11:48 +0000 (GMT)
Received: from LHREML404-HUB.china.huawei.com (10.201.5.218) by lhreml203-edg.huawei.com (172.18.7.221) with Microsoft SMTP Server (TLS) id 14.3.146.0; Fri, 11 Oct 2013 03:11:05 +0100
Received: from SZXEML412-HUB.china.huawei.com (10.82.67.91) by lhreml404-hub.china.huawei.com (10.201.5.218) with Microsoft SMTP Server (TLS) id 14.3.146.0; Fri, 11 Oct 2013 03:11:42 +0100
Received: from z52048a (10.138.41.137) by szxeml412-hub.china.huawei.com (10.82.67.91) with Microsoft SMTP Server id 14.3.146.0; Fri, 11 Oct 2013 10:11:34 +0800
From: Zhenghewen <zhenghewen@huawei.com>
To: 'David Allan I' <david.i.allan@ericsson.com>, 'Thomas Narten' <narten@us.ibm.com>, 'Suresh Krishnan' <suresh.krishnan@ericsson.com>
References: <52143AFF.3060403@ericsson.com> <201309272128.r8RLSPVP002634@cichlid.raleigh.ibm.com> <000e01cec41f$138b7560$3aa26020$@com> <E6C17D2345AC7A45B7D054D407AA205C1161CD24@eusaamb105.ericsson.se> <003601cec48c$5d591db0$180b5910$@com> <E6C17D2345AC7A45B7D054D407AA205C1161D7FB@eusaamb105.ericsson.se> <000301cec562$7f7194a0$7e54bde0$@com> <E6C17D2345AC7A45B7D054D407AA205C1161E733@eusaamb105.ericsson.se>
In-Reply-To: <E6C17D2345AC7A45B7D054D407AA205C1161E733@eusaamb105.ericsson.se>
Date: Fri, 11 Oct 2013 10:11:28 +0800
Message-ID: <004701cec627$328c4200$97a4c600$@com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Mailer: Microsoft Office Outlook 12.0
Thread-Index: Ac67yKGjT3DiC2hBSHG/gcQdl1nw6wITv+7wAAYHnHAAFtr8UAAaqF0AABqh2yAAJOzYsAAMVnfQ
Content-Language: zh-cn
X-Originating-IP: [10.138.41.137]
X-CFilter-Loop: Reflected
Cc: 'Internet Area' <int-area@ietf.org>
Subject: Re: [Int-area] Call for adoption of draft-nachum-sarp-06.txt
X-BeenThere: int-area@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: IETF Internet Area Mailing List <int-area.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/int-area>, <mailto:int-area-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/int-area>
List-Post: <mailto:int-area@ietf.org>
List-Help: <mailto:int-area-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/int-area>, <mailto:int-area-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 11 Oct 2013 02:11:57 -0000

Hi, Dave,

	Let me repeat my viewpoint, the edge nodes will have to touch more addresses (host IP address or MAC address) from outer data centres under cloud data centre mode than under traditional internet data centre mode, because larger layer 2 domain introduces more addresses to the edge nodes. How large? Maybe all addresses of hosts inside the whole layer 2 domain, maybe part. More data centres join the cloud, More addresses the edge nodes touch.
	
	I use the Figure 1 ("Interconnecting TRILL Networks with TRILL-EVPN") in "TRILL-EVPN" (draft-ietf-l2vpn-trill-evpn-01) as the basic topology to explain the scenario, I add some notes on the figure as below.
                (In the first try to response your mail, I have to insert a picture as the attachment because my Outlook cannot correctly display the figure, but the response is blocked by the ietf mail system because " Message body is too big: 73900 bytes with a limit of 40 KB", so I have to align those characters to form a visible and readable figure in Outlook, hope that your mail client can read it, you can refer to figure 1 of draft-ietf-l2vpn-trill-evpn-01 if your mail client not work;
	If ietf mail system let the first response go, please ignore it, sorry the duplicated mail)
                            +---------+          +---------------------------------+                +-------+
   +--------------+  |            |           |                                           |                  |         |  +---------------+
   | Trill Edge |  |            |           |        MPLS                         |                  |         |  | Trill Edge |
   |    RB#1     |--|            |           |                                           |                  |         |--|    RB#3     |
   |    (TOR)    |   |            |    +--------------------+      +--------------------+   |          |  |    (TOR)    |
   +--------------+  |            |    | MPLS Edge     |      | MPLS Edge      |    |         |  +--------------+
                             |  Trill  |---|        PE#1          |      |         PE#2         |---| Trill |
   +--------------+   |           |    |(DC Gateway) |      |(DC Gateway) |   |          |  +--------------+
   | Trill Edge |  |            |    +--------------------+      +--------------------+   |          |  | Trill Edge |
   |    RB#2     |--|            |          |                                            |                  |         |--|    RB#4     |
   |    (TOR)    |   |            |          |                                            |                  |         |  |    (TOR)     |
   +--------------+  |            |          |       Backbone                   |                  |         |  +--------------+
                            +---------+          +---------------------------------+                 +-------+

   |<-----Data Centre #1------->|<-------------Core---------->|<----------Data Centre #2-------->|
   
   |<------------- IS-IS ------------->|<------------BGP------------>|<---------------- IS-IS ----------------->|  CP
   
   |<-------------------------------------------------  TRILL ------------------------------------------------------->|  DP
                                                      |<-------------MPLS---------->|

	There are two data centres (Data Centre #1 and Data Centre #2) interconnected via MPLS-based L2VPN to form a cloud data centre. We can see a larger layer2 domain across two data centres than just in any single data centre, and the edge nodes will have chance to touch more hosts' addresses, it is must.
	"TRILL-EVPN" (draft-ietf-l2vpn-trill-evpn-01) uses C-MAC Address Transparency to reduce FDB table burden of MPLS edge nodes, the way to do that should be that MPLS edges do not act as Trill Edge.
	But we do not see any solution to consider FDB table burden of Trill edge nodes (e.g., TOR).

Hewen

-----Original Message-----
From: David Allan I [mailto:david.i.allan@ericsson.com] 
Sent: Friday, October 11, 2013 4:10 AM
To: Zhenghewen; 'Thomas Narten'; Suresh Krishnan
Cc: 'Internet Area'
Subject: RE: [Int-area] Call for adoption of draft-nachum-sarp-06.txt

Hi Hewen:

My point would be that something like MACinMAC (or other tunneling mechanism that implements VLANs), not only hides state from the core, but the state at the edge a function of the sum of end points in the set of locally attached VLANs, not the all MACs in all  VLANs supported by the whole network. So I am not seeing a scenario whereby any edge device sees all MACs in a network, just a fraction of them...even when the edge device grows rather large...

If you have a use case whereby this is not true (and not just a consequence of extrapolating a bad design) I would be interested.

I hope that is clearer
Dave

-----Original Message-----
From: Zhenghewen [mailto:zhenghewen@huawei.com] 
Sent: Wednesday, October 09, 2013 7:43 PM
To: David Allan I; 'Thomas Narten'; Suresh Krishnan
Cc: 'Internet Area'
Subject: RE: [Int-area] Call for adoption of draft-nachum-sarp-06.txt

Hi, Dave,

	MPLS or PBB or Trill can use shim to make intermediate nodes not being aware the address of inner payload, it is right. It can reduce the FDB table burden of intermediate nodes, but it does not impact the edge nodes. Those solutions do not solve MAC address scalability issue of edge nodes. Those edge nodes include TOR switch and DC gateways.
	A host must consume at least a MAC entry of FDB table of edge nodes, VLAN only divides those MAC addresses into many groups but not reduce the consumption.
	Some solutions in L2VPN, such as PBB or VPLS, solve MAC address scalability issue of core network, but MAC address scalability issue of edge nodes is out of the scope, for example, we do not see a solution to solve C-MAC scalability issue on CE or DC Gateway.
	Network virtualization breaks subnet into many pieces in Cloud era, edge nodes (e.g., DC Gateway or TOR) will have to be aware other MAC address from other data centres, part or all of the other hosts in other data centres, it is a fact.

Hewen

-----Original Message-----
From: David Allan I [mailto:david.i.allan@ericsson.com] 
Sent: Wednesday, October 09, 2013 10:01 PM
To: Zhenghewen; 'Thomas Narten'; Suresh Krishnan
Cc: 'Internet Area'
Subject: RE: [Int-area] Call for adoption of draft-nachum-sarp-06.txt

My point was I would not have a design whereby every MAC for every vNIC in a cloud had to be in every FDB, which is what I believe you are describing.

I do not think I've seen mention of how SARP plays into an Ethernet active topology or VLANs. But if  you ran a single instance of spanning tree and a single VLAN and each VM had a sufficiently diverse set of peers that MAC table entries never aged out, you would end up with the scenario you described.  I do not think that is a realistic scenario.

Once it is partitioned into multiple VLANs and uses MSTP or SPBV the number of MACs in any individual FDB starts to divide, if you add MACinMAC/SPBM in divides significantly further, orders of magnitude.  The set of MACs a TOR would see would be a fraction of the whole network even in extreme scenarios. It is not a simple linear sum of endpoints.

Dave

-----Original Message-----
From: Zhenghewen [mailto:zhenghewen@huawei.com] 
Sent: Tuesday, October 08, 2013 6:11 PM
To: David Allan I; 'Thomas Narten'; Suresh Krishnan
Cc: 'Internet Area'
Subject: RE: [Int-area] Call for adoption of draft-nachum-sarp-06.txt

Hi, Dave,
	
	I do not catch your meanings, and I think that maybe you do not understand my meanings.
	My description shows the fact that current FDB tables seem small, especially for cloud data centre, it has nothing with VLAN. It is not one big subnet ("subnet" in my understanding means IP subnet) of 10M MACs, it is only about larger layer2 domain, for example, one operator provide layer2 service across data centres for tenants and those tenants can own separate and overlapped IP address space.
	I do not see any other solution for this issue, would you provide some info? thanks.

Hewen

-----Original Message-----
From: int-area-bounces@ietf.org [mailto:int-area-bounces@ietf.org] On Behalf Of David Allan I
Sent: Tuesday, October 08, 2013 10:12 PM
To: Zhenghewen; 'Thomas Narten'; Suresh Krishnan
Cc: 'Internet Area'
Subject: Re: [Int-area] Call for adoption of draft-nachum-sarp-06.txt

I'm confused here, is the problem you are trying to solve one big subnet of 10M MACs? Your message would seem to suggest this.

IMO the problem is a large number of VLANs with a much much smaller number of MACs per VLAN, which makes the problem divisible and tractable with current technology. I consider that a solved problem.

Dave



-----Original Message-----
From: int-area-bounces@ietf.org [mailto:int-area-bounces@ietf.org] On Behalf Of Zhenghewen
Sent: Tuesday, October 08, 2013 5:08 AM
To: 'Thomas Narten'; Suresh Krishnan
Cc: 'Internet Area'
Subject: Re: [Int-area] Call for adoption of draft-nachum-sarp-06.txt

Hi Thomas,

> 1) As DaveA points out, FDB tables are large on chips these days.

[Hewen]In real world,  FDB tables seem small, especially for large-size data centre in cloud era. 
	Engineers at least two choices to implement FDB table, TCAM or DRAM, but TCAM is very expensive and costs more power (thus high temperature), those limit feasible TCAM capacity on a board, then TCAM is not a good choice for very large table. DRAM is more popular and cheap, but DRAM chip has the limitation of capacity and speed. Usually, the single DRAM chip in the market has the maximum capacity of 4Gb (for most available DRAM chip in the market).
	Now, let's assume the packet lookup chipset would use 128 bits search/hash item, one MAC lookup means at least 3 items, then a 4Gb DRAM chipset can support up to 10M MAC entries.
                In real world, some operator has 80 data centres in the scope with the diameter of 1000KM, each data centre has average 1,500 racks, each rack has average 20 servers, each servers has 20 VMs. That means that there are up to 48millions MAC entries. Even only 20% VMs generate traffic outside from local data centres, there are about 10M MAC entries.
	It seems that single DRAM chip is enough, but there are some other obvious limitations:
	First, is 20% enough? Are all capacity of all DRAM chipsets used for MAC entries (nothing for L3 FIB, nothing for ACL)?
	Second, a DRAM at least means a SerDes@12.5Gbps link with packet lookup chipset, now if we want 125Gbps line-rate, we must place 10 copies of DRAM chipsets for 10M MAC entries; if we want 500Gbps line-rate, it means 40 DRAM chipsets for 10M MAC entries. It would be not feasible for engineers, especially considering the available board space, power consumption and SerDes links number.
	Now, as we see, current FDB tables seem small, especially for cloud data centre. Usually high-speed switch or router only has limited table size, for example 128K or 512K MAC entries.

	Maybe DRAM chipset capacity or speed's grow will solve this issue, but according to RFC4984, "Historically, DRAM capacity grows at about 4x every 3.3 years.  This translates to 2.4x every 2 years, so DRAM capacity actually grows faster than Moore's law would suggest.  DRAM speed, however, only grows about 10% per year, or 1.2x every 2 years ". It seems that we need some years to support 50millions MAC entries with single high-speed DRAM chipset.
	Some solutions to solve the issue will be valuable before the day with high-speed & high-capacity DRAM chip come.

Best regards,
Hewen

-----Original Message-----
From: int-area-bounces@ietf.org [mailto:int-area-bounces@ietf.org] On Behalf Of Thomas Narten
Sent: Saturday, September 28, 2013 5:28 AM
To: Suresh Krishnan
Cc: Internet Area
Subject: Re: [Int-area] Call for adoption of draft-nachum-sarp-06.txt

Given the flurry of mail, I went back and reviewed my notes on this document.

I believe I understand the problem (reduce size of FDB), but am skeptical that solving it via the approach in this document is worthwhile.

1) As DaveA points out, FDB tables are large on chips these days. However, at least initially, SARP would be implemented in the service processor (consuming CPU cycles). It would be years (if ever) before silicon would implement this. But in parallel, silicon will just get better and have bigger FDB tables... Moreover, the SARP motivation section specifically cites a shortage of processor cycles as a problem ... but doing SARP in software will increase the demands on the CPU...  its unclear to me that the increased CPU burdens SARP implies would be offset by the reduced amount of ARP processing that the solution is hoping will result... I.e., at least in the short term, it's unclear that SARP would actually help in practice.

2) Doing L2 NAT at line rates (an absolute requirement for edge
devices) will only happen if this is done in silicon. I don't see that happening unless there is strong support from vendors/operators/chip makers... Software based solutions simply will not have sufficient performance. I think the IEEE would be a better place to have the discussion about what L2 chip makers are willing to implement...

3) Only works for IP. Doesn't work for non-IP L2. Doesn't that only solve part of the problem?

4) High availability is complex (but presumably also necessary). It smells a lot like multi-chassis LAG (with the need to synchronize state between tightly-coupled peers). Although IEEE is working in this general area now, there are tons of vendor-specific solutions for doing this at L2. Do we really want to tackle standardizing this in the IETF?  Isn't the relevant expertise for this general area in IEEE?

5) This solution touchs on both L2 (NATing L2 addresses) and L3 (ARPing for IP addresses). We absolutely would need coordinate with IEEE before deciding to take on this work.

6) ARP caching will be a tradeoff. You want to cache responses for better performance, but long cache timeouts will result in black holes after VMs move. There is no great answer here. I expect long timeouts to be unacceptable operationally, which means that the benefits of caching will be limited (and these benefits are the key value proposition of this approach). It is really hard to estimate whether the benefits will be sufficient in practice. Gratuitous ARPs can help, but they are not reliable, so they will help, but have limitations...

7) I haven't fully worked this out, but I wonder if loops can form between proxies. There is the notion that when a VM moves, proxies will need to update their tables. But careful analysis will be needed to be sure that one can't ever have loops where proxies end up pointing to each other. And since the packets are L2 (with no TTL), such loops would be disasterous. (Note point 6: we are using heuristics (like gratuitous ARP) to get tables to converge after a change. Heuristics tend to have transient inconsistencies.. i.e., possibly leading to loops.)

Again, overall, I understand the generic problem and that it would be nice to have a solution. However, I don't see a simple solution here. I see a fair amount of complexity, and I'm skeptical that it's worth it (e.g., when the next gen of silicon will just have a larger FDB).

What I'd really like to see (before having the IETF commit to this
work) is:

1) operators who are feeling the pain described in the document stepping forward and saying they think the solution being proposed is something they would be willing to deploy and is better than other approaches they have in their toolkit.

2) Vendors (including silicon) saying they see the need for this and think the would implement it.

Thomas

_______________________________________________
Int-area mailing list
Int-area@ietf.org
https://www.ietf.org/mailman/listinfo/int-area


_______________________________________________
Int-area mailing list
Int-area@ietf.org
https://www.ietf.org/mailman/listinfo/int-area
_______________________________________________
Int-area mailing list
Int-area@ietf.org
https://www.ietf.org/mailman/listinfo/int-area