RE: [Ipoverib] Please read - proposed WG termination
Vivek Kashyap <kashyapv@us.ibm.com> Fri, 02 September 2005 01:07 UTC
Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1EB01R-0001My-Ja; Thu, 01 Sep 2005 21:07:13 -0400
Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1EB01O-0001Mq-Fn; Thu, 01 Sep 2005 21:07:11 -0400
Received: from ietf-mx.ietf.org (ietf-mx [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA10534; Thu, 1 Sep 2005 21:07:04 -0400 (EDT)
Received: from e33.co.us.ibm.com ([32.97.110.131]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1EB03P-0003S6-5i; Thu, 01 Sep 2005 21:09:16 -0400
Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.195.11]) by e33.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id j8216vFd196354; Thu, 1 Sep 2005 21:06:57 -0400
Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by westrelay02.boulder.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8216uvp459040; Thu, 1 Sep 2005 19:06:56 -0600
Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11/8.13.3) with ESMTP id j8216uwO030905; Thu, 1 Sep 2005 19:06:56 -0600
Received: from dyn9047022089.beaverton.ibm.com (dyn9047022089.beaverton.ibm.com [9.47.22.89]) by d03av03.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j8216s2w030873; Thu, 1 Sep 2005 19:06:55 -0600
Date: Thu, 01 Sep 2005 17:58:30 -0700
From: Vivek Kashyap <kashyapv@us.ibm.com>
X-X-Sender: kashyapv@localhost.localdomain
To: Michael Krause <krause@cup.hp.com>
Subject: RE: [Ipoverib] Please read - proposed WG termination
In-Reply-To: <6.2.0.14.2.20050901115228.028d4458@esmail.cup.hp.com>
Message-ID: <Pine.LNX.4.62.0509011446400.16505@localhost.localdomain>
References: <200509011818.j81IIPJV251312@jurassic.eng.sun.com> <6.2.0.14.2.20050901115228.028d4458@esmail.cup.hp.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"; format="flowed"
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 10d2fdecab7a7fa796e06e001d026c91
Cc: margaret@thingmagic.com, Bill_Strahm@McAfee.com, "H.K. Jerry Chu" <Jerry.Chu@eng.sun.com>, gdror@mellanox.co.il, ipoverib-bounces@ietf.org, ipoverib@ietf.org
X-BeenThere: ipoverib@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: IP over InfiniBand WG Discussion List <ipoverib.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ipoverib>, <mailto:ipoverib-request@ietf.org?subject=unsubscribe>
List-Post: <mailto:ipoverib@ietf.org>
List-Help: <mailto:ipoverib-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ipoverib>, <mailto:ipoverib-request@ietf.org?subject=subscribe>
Sender: ipoverib-bounces@ietf.org
Errors-To: ipoverib-bounces@ietf.org
On Thu, 1 Sep 2005, Michael Krause wrote: > At 11:19 AM 9/1/2005, H.K. Jerry Chu wrote: >> [co-chair hat off] >> >> ... >> <snip> >> >> >These performance problems are primarily implementation-specific and have >> >little to do with IB technology itself. In addition, nearly all IB >> >solutions use a 2KB not the smallest MTU to transfer data - no different >> >than Ethernet. >> >> Ethernet is adopting jumboframe to get more firing power. Where is IB's >> equivalent of jumboframe? > > Jumboframe causes QoS issues for multiple flows. The whole reason for the > variable MTU was to allow people to configure the different virtual lanes / > paths to meet QoS concerns. Examination of the hardware-based SAR provided > by IB illustrates that there is no definitive performance difference between > using a 2KB per and say a larger 9KB MTU. BTW, examination of many IP > workloads shows that majority of the units of work exchange fit quite nicely > in a 2KB payload. Those that are larger, are easily SAR without impacting > either side of the communication, i.e. requiring a connection to be > established. NFS will happily use 32K block size and it is the default on some NFS configs. There will be other examples though I agree that a large workload set is probably satisfied with the 2KB MTU. That is why IPoIB-CM portion is proposed as an optional, interoperable solution for networks that need it. > > Keep in mind that Ethernet is also doing other activities to solve various > issues - multi-path support to compensate for the spanning tree limitations, > conceptually similar virtual lanes are being investigated, improved security, > link-level congestion management, etc. All of these changes combined with de > facto use of large send / receive-side scaling, etc. will likely make the > issue of 9KB rather moot except for bulk data transfers in the WAN (even then > the QoS issues of multiple flows across a common link will likely limit the > use of 9KB). > > In the end, IB HCA implementations can adopt the same techniques as Ethernet > and deliver the same performance as a connected model without the overhead / however connected mode is already available to us in all HCAs. Why not use it as it is ? > complexity and ecosystem cost of a new specification. IB already has enough > problems getting the basics into the world today. Adding in yet another spec > isn't going to solve these problems or make life easier - too many things > within the ecosystem are impacted. > > >> >As I and others have raised over the years, the enablement >> >of IP over IB to perform well is a local HCA issue not a standards >> >issue. Addition of checksum off-load support to the HCA is rather trivial >> >and does not require standardization (this is what is done for Ethernet >> >today and is non-standard). Addition of large send off-load support is a >> >local HCA issue not a standards issue and effectively provides the same >> >benefit as connected mode. >> >> Yes LSO (or TSO as some call it) is relatively easy. But LRO (large receive >> offload) is a heck more difficult. IB connected transports already have all >> silicons to do it. Why not just use it? > > Large receive off-load isn't that hard - really an implementation problem > from most of the designs I've examined. I helped implement something > equivalent to TSO and LRO over 10 years ago and it wasn't that bad or > expensive. As for why not just use it? Simple, cost to specify, implement, > get adopted within the industry is already high for IB. Adding more to the > stack of things when there is questionable benefit compared to simply > re-using the existing Ethernet infrastructure seems like a poor choice of > resource expenditure. Of course, most people only care about making Linux > happen these days so I guess feel free to knock yourselves out if that is all > that is required. > > >> >The use of multiple QP to spread work across >> >CPU for both send / receive ala the multi-queue support I've worked with >> >various Ethernet IHV to get in place is again a local HCA issue (does not >> >have to be visible as part of the layer 2 address resolution). One can >> >construct a very nice performing IP over IB solution but there hasn't been >> >much public progress to implement these de facto capabilities found in >> >Ethernet solutions on IB. Getting these into a HCA implementation is a >> >heck of a lot easier and faster to do than to develop a standard and >> >getting all of the OS changes made (the HCA implementation issues can all >> >be done underneath the IP stack just like with Ethernet so no real OS >> impacts). >> >> I don't understand the large MTU issue to the OS (requiring continguous >> physical >> addresses). Aren't all decent hardware capable of scatter/gather these >> days? > > Nothing above indicates any issue regarding physical addresses. IB hardware > as does Ethernet comprehends V-to-P mapping and makes no requirements on it > being physically contiguous beyond the base physical page size. > > >> What's more hairy to the OS stack is the per-destination MTU and different >> MTU for multicast than for unicast inherited in IPoIB CM. Even as it is now a subnet can be set up with a common MTU and leave the 'per-connection mtu' to implementations that want to use it. Such setups have existed before in many stacks to support different MTU token-rings for example. The WG discussed making the IB mode a link characteristic i.e. IPoIB-UD and IPoIB-CM connectivity would be through a router. The WG, at the time chose (see late 2004/early 2005 threads) a common subnet for both. However, if we resurrect the separate subnet proposal a common MTU becomes easier across any given subnet. Multicast will stay limited to the physical MTU however. It is a tossup between using a different mechanism for multicasting or using what the medium provides. Also it is easier to make the unicast to multicast distinction in a stack. e.g. BSD stacks by default used to limit multicast to physical MTU whereas allowed IP packets to be larger than it (and those were fragmented in the case of ethernet). Vivek > > Agreed. This is one aspect of the additional complexity that is simply > easier to avoid. The cost / benefit of connected mode is quite questionable > compared to leveraging the existing Ethernet implementation concepts and > associated infrastructure. > > Mike > > >> Jerry >> >> > >> > >> >>For commercial clusters, if IB is used for storage, then you save a >> network >> >>by having fast IP performance and can use the IB network for both. Why >> use >> >>IB and another network for the commercial cluster, when the other network >> >>supports similar bandwidth for storage and IP. >> > >> >There will always be Ethernet in any cluster so the fabric is there. The >> >question is whether it is just for low-bandwidth / management services or >> >for applications. For storage, need to separate the discussion into >> >whether it is block or file. For block, IB gateways to Fibre Channel, >> etc. >> >can and are being used today quite nicely. Performance is reasonable and >> >the ecosystem costs, target availability, customer "pain", etc. are much >> >lower than attempting to move to native IB storage. The same applies to >> >file based where IB gateways to Ethernet which then attaches to file >> >servers works quite nicely. In fact, the original vision of IB was that >> of >> >an I/O fabric to create modular server solutions. The addition of IPC >> came >> >later in the process when it was found to be relatively low cost to >> >define. So, IB is successful in the HPC world and slowly entering some >> >commercial solutions. To state that its future relies on getting an IP >> >over IB RC solution is perhaps blowing it a bit out of proportion. The >> >easier path for all is to simply use the techniques I and others have >> >advocated for years now and solve the problems within the HCA >> >implementation. Much lower costs and will result in delivering a good >> >performance solution. >> > >> >BTW, RNIC / Ethernet solutions implement these techniques today. With the >> >arrival of 10 GbE and the lower prices of RNIC and 10 GbE switch ports, >> >lower latency switches (competitive enough with IB for commercial and many >> >HPC clusters), etc. the success of IB must lie elsewhere and not on an >> IETF >> >spec. This was noted at the recent IEEE Hot Interconnects conference as >> >well so isn't just my opinion. >> > >> >Mike >> > >> >>Implementing IPoIB-CM makes IB viable in the HPC cluster and some >> >>commercial clusters. Otherwise I don't think it competes economically >> with >> >>other network technologies. >> >> >> >>Regards. >> >> >> >>Bernie King-Smith >> >>IBM Corporation >> >>Server Group >> >>Cluster System Performance >> >>wombat2@us.ibm.com (845)433-8483 >> >>Tie. 293-8483 or wombat2 on NOTES >> >> >> >>"We are not responsible for the world we are born into, only for the >> world >> >>we leave when we die. >> >>So we have to accept what has gone before us and work to change the only >> >>thing we can, >> >>-- The Future." William Shatner >> >> >> >> >> >> >> >> Dror Goldenberg >> >> <gdror@mellanox.c >> >> o.il> >> To >> >> Sent by: kashyapv@us.ltcfwd.linux.ibm.com, >> >> ipoverib-bounces@ "H.K. Jerry Chu" >> >> ietf.org <Jerry.Chu@eng.sun.com> >> >> >> cc >> >> margaret@thingmagic.com, >> >> 08/30/2005 09:32 ipoverib@ietf.org, >> >> AM Bill_Strahm@McAfee.com >> >> >> Subject >> >> RE: [Ipoverib] Please read - >> >> proposed WG termination >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > From: Vivek Kashyap [mailto:kashyapv@us.ibm.com] >> >> > Sent: Tuesday, August 30, 2005 8:39 AM >> >> > >> >> > On Mon, 29 Aug 2005, H.K. Jerry Chu wrote: >> >> > >> >> >> >> >> >><snip> >> >> >> >> >> >> > > 1. IPoIB connected mode draft-ietf-ipoib-connected-mode-00.txt >> >> > > updated recently >> >> > >> >> > Well, in recent days there has been a discussion going on >> >> > based on Dror's input. I also made some updates after some >> >> > discussion on OpenIB (not on >> >> > IETF though). This draft itself became a working group draft >> >> > this february >> >> > after some lively discussion just before that. It appears to >> >> > me that we >> >> > should be possible to finalise this draft soon enough. >> >> > >> >> > 20th sept. might be long enough to know one way or the other... >> >> > >> >> > vivek >> >> > >> >> >> >> >> >>We would like to see IPoIB-CM being finalized in IETF. We see >> >>great value in having a standard for connected mode which effectively >> >>increases the MTU. We are willing to contribute to the standardization >> >>effort. We're also looking at the implementation of IPoIB-CM in Linux. >> >> >> >> >> >>-Dror _______________________________________________ >> >>IPoverIB mailing list >> >>IPoverIB@ietf.org >> >>https://www1.ietf.org/mailman/listinfo/ipoverib >> >> >> >> >> >> >> >> >> >> >> >>_______________________________________________ >> >>IPoverIB mailing list >> >>IPoverIB@ietf.org >> >>https://www1.ietf.org/mailman/listinfo/ipoverib >> >> >> _______________________________________________ >> IPoverIB mailing list >> IPoverIB@ietf.org >> https://www1.ietf.org/mailman/listinfo/ipoverib > _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib
- [Ipoverib] Please read - proposed WG termination H.K. Jerry Chu
- Re: [Ipoverib] Please read - proposed WG terminat… Michael Krause
- RE: [Ipoverib] Please read - proposed WG terminat… Yaron Haviv
- RE: [Ipoverib] Please read - proposed WG terminat… H.K. Jerry Chu
- RE: [Ipoverib] Please read - proposed WG terminat… Michael Krause
- RE: [Ipoverib] Please read - proposed WG terminat… Yaron Haviv
- RE: [Ipoverib] Please read - proposed WG terminat… Yaron Haviv
- RE: [Ipoverib] Please read - proposed WG terminat… Carl Hensler
- Re: [Ipoverib] Please read - proposed WG terminat… Vivek Kashyap
- RE: [Ipoverib] Please read - proposed WG terminat… Harald Tveit Alvestrand
- RE: [Ipoverib] Please read - proposed WG terminat… Vivek Kashyap
- RE: [Ipoverib] Please read - proposed WG terminat… Yaron Haviv
- RE: [Ipoverib] Please read - proposed WG terminat… Dror Goldenberg
- RE: [Ipoverib] Please read - proposed WG terminat… Michael Krause
- RE: [Ipoverib] Please read - proposed WG terminat… Michael Krause
- RE: [Ipoverib] Please read - proposed WG terminat… Yaron Haviv
- RE: [Ipoverib] Please read - proposed WG terminat… Bernard King-Smith
- RE: [Ipoverib] Please read - proposed WG terminat… Harald Tveit Alvestrand
- RE: [Ipoverib] Please read - proposed WG terminat… Michael Krause
- Re: [Ipoverib] Please read - proposed WG terminat… Vivek Kashyap
- RE: [Ipoverib] Please read - proposed WG terminat… H.K. Jerry Chu
- Re: FW: [Ipoverib] Please read - proposed WG term… Eitan Zahavi
- Re: [Ipoverib] Please read - proposed WG terminat… Roland Dreier
- Re: FW: [Ipoverib] Please read - proposed WG term… H.K. Jerry Chu
- RE: FW: [Ipoverib] Please read - proposed WG term… Sean Harnedy
- RE: FW: [Ipoverib] Please read - proposed WG term… H.K. Jerry Chu
- RE: FW: [Ipoverib] Please read - proposed WG term… bill
- RE: [Ipoverib] Please read - proposed WG terminat… Michael Krause
- Re: [Ipoverib] Please read - proposed WG terminat… Michael Krause
- RE: [Ipoverib] Please read - proposed WG terminat… Bernard King-Smith
- Re: [Ipoverib] Please read - proposed WG terminat… Vivek Kashyap
- RE: [Ipoverib] Please read - proposed WG terminat… Vivek Kashyap
- Why is MTU an issue? (RE: [Ipoverib] Please read … Harald Tveit Alvestrand
- Ecosystems cost of additional specs (RE: [Ipoveri… Harald Tveit Alvestrand
- Re: Why is MTU an issue? (RE: [Ipoverib] Please r… Mark Townsley
- Re: FW: [Ipoverib] Please read - proposed WG term… Eitan Zahavi
- RE: [Ipoverib] Please read - proposed WG terminat… Dror Goldenberg
- RE: [Ipoverib] Please read - proposed WG terminat… Dror Goldenberg
- RE: [Ipoverib] Please read - proposed WG terminat… Michael Krause
- RE: [Ipoverib] Please read - proposed WG terminat… Michael Krause
- RE: [Ipoverib] Please read - proposed WG terminat… Dror Goldenberg
- RE: [Ipoverib] Please read - proposed WG terminat… Michael Krause