RE: [Ipoverib] Please read - proposed WG termination

Vivek Kashyap <kashyapv@us.ibm.com> Fri, 02 September 2005 01:07 UTC

Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1EB01R-0001My-Ja; Thu, 01 Sep 2005 21:07:13 -0400
Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1EB01O-0001Mq-Fn; Thu, 01 Sep 2005 21:07:11 -0400
Received: from ietf-mx.ietf.org (ietf-mx [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA10534; Thu, 1 Sep 2005 21:07:04 -0400 (EDT)
Received: from e33.co.us.ibm.com ([32.97.110.131]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1EB03P-0003S6-5i; Thu, 01 Sep 2005 21:09:16 -0400
Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.195.11]) by e33.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id j8216vFd196354; Thu, 1 Sep 2005 21:06:57 -0400
Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by westrelay02.boulder.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8216uvp459040; Thu, 1 Sep 2005 19:06:56 -0600
Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11/8.13.3) with ESMTP id j8216uwO030905; Thu, 1 Sep 2005 19:06:56 -0600
Received: from dyn9047022089.beaverton.ibm.com (dyn9047022089.beaverton.ibm.com [9.47.22.89]) by d03av03.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j8216s2w030873; Thu, 1 Sep 2005 19:06:55 -0600
Date: Thu, 01 Sep 2005 17:58:30 -0700
From: Vivek Kashyap <kashyapv@us.ibm.com>
X-X-Sender: kashyapv@localhost.localdomain
To: Michael Krause <krause@cup.hp.com>
Subject: RE: [Ipoverib] Please read - proposed WG termination
In-Reply-To: <6.2.0.14.2.20050901115228.028d4458@esmail.cup.hp.com>
Message-ID: <Pine.LNX.4.62.0509011446400.16505@localhost.localdomain>
References: <200509011818.j81IIPJV251312@jurassic.eng.sun.com> <6.2.0.14.2.20050901115228.028d4458@esmail.cup.hp.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"; format="flowed"
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 10d2fdecab7a7fa796e06e001d026c91
Cc: margaret@thingmagic.com, Bill_Strahm@McAfee.com, "H.K. Jerry Chu" <Jerry.Chu@eng.sun.com>, gdror@mellanox.co.il, ipoverib-bounces@ietf.org, ipoverib@ietf.org
X-BeenThere: ipoverib@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: IP over InfiniBand WG Discussion List <ipoverib.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ipoverib>, <mailto:ipoverib-request@ietf.org?subject=unsubscribe>
List-Post: <mailto:ipoverib@ietf.org>
List-Help: <mailto:ipoverib-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ipoverib>, <mailto:ipoverib-request@ietf.org?subject=subscribe>
Sender: ipoverib-bounces@ietf.org
Errors-To: ipoverib-bounces@ietf.org

On Thu, 1 Sep 2005, Michael Krause wrote:

> At 11:19 AM 9/1/2005, H.K. Jerry Chu wrote:
>> [co-chair hat off]
>> 
>> ...
>> <snip>
>> 
>> >These performance problems are primarily implementation-specific and have
>> >little to do with IB technology itself.  In addition, nearly all IB
>> >solutions use a 2KB not the smallest MTU to transfer data - no different
>> >than Ethernet.
>> 
>> Ethernet is adopting jumboframe to get more firing power. Where is IB's
>> equivalent of jumboframe?
>
> Jumboframe causes QoS issues for multiple flows.  The whole reason for the 
> variable MTU was to allow people to configure the different virtual lanes / 
> paths to meet QoS concerns.  Examination of the hardware-based SAR provided 
> by IB illustrates that there is no definitive performance difference between 
> using a 2KB per and say a larger 9KB MTU.  BTW, examination of many IP 
> workloads shows that majority of the units of work exchange fit quite nicely 
> in a 2KB payload.  Those that are larger, are easily SAR without impacting 
> either side of the communication, i.e. requiring a connection to be 
> established.

NFS will happily use 32K block size and it is the default on some NFS configs.
There will be other examples though I agree that a large workload set is 
probably satisfied with the 2KB MTU. That is why IPoIB-CM portion is proposed 
as an optional, interoperable solution for networks that need it.

>
> Keep in mind that Ethernet is also doing other activities to solve various 
> issues - multi-path support to compensate for the spanning tree limitations, 
> conceptually similar virtual lanes are being investigated, improved security, 
> link-level congestion management, etc.  All of these changes combined with de 
> facto use of large send / receive-side scaling, etc. will likely make the 
> issue of 9KB rather moot except for bulk data transfers in the WAN (even then 
> the QoS issues of multiple flows across a common link will likely limit the 
> use of 9KB).
>
> In the end, IB HCA implementations can adopt the same techniques as Ethernet 
> and deliver the same performance as a connected model without the overhead /

however connected mode is already available to us in all HCAs. Why not use it
as it is ?

> complexity and ecosystem cost of a new specification.  IB already has enough 
> problems getting the basics into the world today.  Adding in yet another spec 
> isn't going to solve these problems or make life easier - too many things 
> within the ecosystem are impacted.
>
>
>> >As I and others have raised over the years, the enablement
>> >of IP over IB to perform well is a local HCA issue not a standards
>> >issue.  Addition of checksum off-load support to the HCA is rather trivial
>> >and does not require standardization (this is what is done for Ethernet
>> >today and is non-standard).  Addition of large send off-load support is a
>> >local HCA issue not a standards issue and effectively provides the same
>> >benefit as connected mode.
>> 
>> Yes LSO (or TSO as some call it) is relatively easy. But LRO (large receive
>> offload) is a heck more difficult. IB connected transports already have all
>> silicons to do it. Why not just use it?
>
> Large receive off-load isn't that hard - really an implementation problem 
> from most of the designs I've examined.  I helped implement something 
> equivalent to TSO and LRO over 10 years ago and it wasn't that bad or 
> expensive.  As for why not just use it?  Simple, cost to specify, implement, 
> get adopted within the industry is already high for IB.  Adding more to the 
> stack of things when there is questionable benefit compared to simply 
> re-using the existing Ethernet infrastructure seems like a poor choice of 
> resource expenditure.  Of course, most people only care about making Linux 
> happen these days so I guess feel free to knock yourselves out if that is all 
> that is required.
>
>
>> >The use of multiple QP to spread work across
>> >CPU for both send / receive ala the multi-queue support I've worked with
>> >various Ethernet IHV to get in place is again a local HCA issue (does not
>> >have to be visible as part of the layer 2 address resolution).  One can
>> >construct a very nice performing IP over IB solution but there hasn't been
>> >much public progress to implement these de facto capabilities found in
>> >Ethernet solutions on IB.  Getting these into a HCA implementation is a
>> >heck of a lot easier and faster to do than to develop a standard and
>> >getting all of the OS changes made (the HCA implementation issues can all
>> >be done underneath the IP stack just like with Ethernet so no real OS 
>> impacts).
>> 
>> I don't understand the large MTU issue to the OS (requiring continguous 
>> physical
>> addresses). Aren't all decent hardware capable of scatter/gather these 
>> days?
>
> Nothing above indicates any issue regarding physical addresses.  IB hardware 
> as does Ethernet comprehends V-to-P mapping and makes no requirements on it 
> being physically contiguous beyond the base physical page size.
>
>
>> What's more hairy to the OS stack is the per-destination MTU and different
>> MTU for multicast than for unicast inherited in IPoIB CM.

Even as it is now a subnet can be set up with a common MTU and leave the
'per-connection mtu' to implementations that want to use it. Such 
setups have existed before in many stacks to support different MTU 
token-rings for example.

The WG discussed making the IB  mode a link characteristic i.e. 
IPoIB-UD and IPoIB-CM connectivity would be through a  router.
The WG, at the time chose (see late 2004/early 2005 threads) a common subnet for
both. However, if we resurrect the separate subnet proposal a common MTU
becomes easier across any given subnet.

Multicast will stay limited to the physical MTU however. It is a tossup 
between using a different mechanism for multicasting or using what the medium 
provides. Also it is easier to make the unicast to multicast distinction in 
a stack. e.g. BSD stacks by default used to limit multicast to physical MTU
whereas allowed IP packets to be larger than it (and those were fragmented
in the case of ethernet).

Vivek

>
> Agreed.  This is one aspect of the additional complexity that is simply 
> easier to avoid.   The cost / benefit of connected mode is quite questionable 
> compared to leveraging the existing Ethernet implementation concepts and 
> associated infrastructure.
>
> Mike
>
>
>> Jerry
>> 
>> >
>> >
>> >>For commercial clusters, if IB is used for storage, then you save a 
>> network
>> >>by having fast IP performance and can use the IB network for both. Why 
>> use
>> >>IB and another network for the commercial cluster, when the other network
>> >>supports similar bandwidth for storage and IP.
>> >
>> >There will always be Ethernet in any cluster so the fabric is there.  The
>> >question is whether it is just for low-bandwidth / management services or
>> >for applications.  For storage, need to separate the discussion into
>> >whether it is block or file.  For block, IB gateways to Fibre Channel, 
>> etc.
>> >can and are being used today quite nicely.  Performance is reasonable and
>> >the ecosystem costs, target availability, customer "pain", etc. are much
>> >lower than attempting to move to native IB storage.  The same applies to
>> >file based where IB gateways to Ethernet which then attaches to file
>> >servers works quite nicely.  In fact, the original vision of IB was that 
>> of
>> >an I/O fabric to create modular server solutions.  The addition of IPC 
>> came
>> >later in the process when it was found to be relatively low cost to
>> >define.  So, IB is successful in the HPC world and slowly entering some
>> >commercial solutions.  To state that its future relies on getting an IP
>> >over IB RC solution is perhaps blowing it a bit out of proportion.   The
>> >easier path for all is to simply use the techniques I and others have
>> >advocated for years now and solve the problems within the HCA
>> >implementation.  Much lower costs and will result in delivering a good
>> >performance solution.
>> >
>> >BTW, RNIC / Ethernet solutions implement these techniques today.  With the
>> >arrival of 10 GbE and the lower prices of RNIC and 10 GbE switch ports,
>> >lower latency switches (competitive enough with IB for commercial and many
>> >HPC clusters), etc. the success of IB must lie elsewhere and not on an 
>> IETF
>> >spec.   This was noted at the recent IEEE Hot Interconnects conference as
>> >well so isn't just my opinion.
>> >
>> >Mike
>> >
>> >>Implementing IPoIB-CM makes IB viable in the HPC cluster and some
>> >>commercial clusters. Otherwise I don't think it competes economically 
>> with
>> >>other network technologies.
>> >>
>> >>Regards.
>> >>
>> >>Bernie King-Smith
>> >>IBM Corporation
>> >>Server Group
>> >>Cluster System Performance
>> >>wombat2@us.ibm.com    (845)433-8483
>> >>Tie. 293-8483 or wombat2 on NOTES
>> >>
>> >>"We are not responsible for the world we are born into, only for the 
>> world
>> >>we leave when we die.
>> >>So we have to accept what has gone before us and work to change the only
>> >>thing we can,
>> >>-- The Future." William Shatner
>> >>
>> >>
>> >>
>> >>              Dror Goldenberg
>> >>              <gdror@mellanox.c
>> >>              o.il> 
>> To
>> >>              Sent by:                  kashyapv@us.ltcfwd.linux.ibm.com,
>> >>              ipoverib-bounces@         "H.K. Jerry Chu"
>> >>              ietf.org                  <Jerry.Chu@eng.sun.com>
>> >> 
>> cc
>> >>                                        margaret@thingmagic.com,
>> >>              08/30/2005 09:32          ipoverib@ietf.org,
>> >>              AM                        Bill_Strahm@McAfee.com
>> >> 
>> Subject
>> >>                                        RE: [Ipoverib] Please read -
>> >>                                        proposed WG termination
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> > From: Vivek Kashyap [mailto:kashyapv@us.ibm.com]
>> >> > Sent: Tuesday, August 30, 2005 8:39 AM
>> >> >
>> >> > On Mon, 29 Aug 2005, H.K. Jerry Chu wrote:
>> >> >
>> >>
>> >>
>> >><snip>
>> >>
>> >>
>> >> > > 1. IPoIB connected mode draft-ietf-ipoib-connected-mode-00.txt
>> >> > > updated recently
>> >> >
>> >> > Well, in recent days there has been a discussion going on
>> >> > based on Dror's input. I also made some updates after some
>> >> > discussion on OpenIB (not on
>> >> > IETF though).  This draft itself became a working group draft
>> >> > this february
>> >> > after some lively discussion just before that.  It appears to
>> >> > me that we
>> >> > should be possible to finalise this draft soon enough.
>> >> >
>> >> > 20th sept. might be long enough to know one way or the other...
>> >> >
>> >> > vivek
>> >> >
>> >>
>> >>
>> >>We would like to see IPoIB-CM being finalized in IETF. We see
>> >>great value in having a standard for connected mode which effectively
>> >>increases the MTU. We are willing to contribute to the standardization
>> >>effort. We're also looking at the implementation of IPoIB-CM in Linux.
>> >>
>> >>
>> >>-Dror _______________________________________________
>> >>IPoverIB mailing list
>> >>IPoverIB@ietf.org
>> >>https://www1.ietf.org/mailman/listinfo/ipoverib
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>_______________________________________________
>> >>IPoverIB mailing list
>> >>IPoverIB@ietf.org
>> >>https://www1.ietf.org/mailman/listinfo/ipoverib
>> 
>> 
>> _______________________________________________
>> IPoverIB mailing list
>> IPoverIB@ietf.org
>> https://www1.ietf.org/mailman/listinfo/ipoverib
>

_______________________________________________
IPoverIB mailing list
IPoverIB@ietf.org
https://www1.ietf.org/mailman/listinfo/ipoverib