RE: [Ipoverib] Please read - proposed WG termination

Michael Krause <krause@cup.hp.com> Thu, 01 September 2005 19:11 UTC

Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1EAuT3-0008T0-QK; Thu, 01 Sep 2005 15:11:21 -0400
Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1EAuSv-0008OK-0A; Thu, 01 Sep 2005 15:11:19 -0400
Received: from ietf-mx.ietf.org (ietf-mx [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA24139; Thu, 1 Sep 2005 15:11:09 -0400 (EDT)
Received: from palrel10.hp.com ([156.153.255.245]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1EAuUu-0001Sd-87; Thu, 01 Sep 2005 15:13:17 -0400
Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel10.hp.com (Postfix) with ESMTP id 27A32288C; Thu, 1 Sep 2005 12:11:10 -0700 (PDT)
Received: from MK73191c.cup.hp.com ([15.244.205.99]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id MAA04944; Thu, 1 Sep 2005 12:03:18 -0700 (PDT)
Message-Id: <6.2.0.14.2.20050901115228.028d4458@esmail.cup.hp.com>
X-Mailer: QUALCOMM Windows Eudora Version 6.2.0.14
Date: Thu, 01 Sep 2005 12:04:30 -0700
To: "H.K. Jerry Chu" <Jerry.Chu@eng.sun.com>, wombat2@us.ibm.com, gdror@mellanox.co.il
From: Michael Krause <krause@cup.hp.com>
Subject: RE: [Ipoverib] Please read - proposed WG termination
In-Reply-To: <200509011818.j81IIPJV251312@jurassic.eng.sun.com>
References: <200509011818.j81IIPJV251312@jurassic.eng.sun.com>
Mime-Version: 1.0
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 9d7e8d783239e9f0c425c823a9c950ff
Cc: margaret@thingmagic.com, kashyapv@us.ibm.com, ipoverib-bounces@ietf.org, ipoverib@ietf.org, Bill_Strahm@McAfee.com
X-BeenThere: ipoverib@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: IP over InfiniBand WG Discussion List <ipoverib.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ipoverib>, <mailto:ipoverib-request@ietf.org?subject=unsubscribe>
List-Post: <mailto:ipoverib@ietf.org>
List-Help: <mailto:ipoverib-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ipoverib>, <mailto:ipoverib-request@ietf.org?subject=subscribe>
Content-Type: multipart/mixed; boundary="===============1237461479=="
Sender: ipoverib-bounces@ietf.org
Errors-To: ipoverib-bounces@ietf.org

At 11:19 AM 9/1/2005, H.K. Jerry Chu wrote:
>[co-chair hat off]
>
>...
><snip>
>
> >These performance problems are primarily implementation-specific and have
> >little to do with IB technology itself.  In addition, nearly all IB
> >solutions use a 2KB not the smallest MTU to transfer data - no different
> >than Ethernet.
>
>Ethernet is adopting jumboframe to get more firing power. Where is IB's
>equivalent of jumboframe?

Jumboframe causes QoS issues for multiple flows.  The whole reason for the 
variable MTU was to allow people to configure the different virtual lanes / 
paths to meet QoS concerns.  Examination of the hardware-based SAR provided 
by IB illustrates that there is no definitive performance difference 
between using a 2KB per and say a larger 9KB MTU.  BTW, examination of many 
IP workloads shows that majority of the units of work exchange fit quite 
nicely in a 2KB payload.  Those that are larger, are easily SAR without 
impacting either side of the communication, i.e. requiring a connection to 
be established.

Keep in mind that Ethernet is also doing other activities to solve various 
issues - multi-path support to compensate for the spanning tree 
limitations, conceptually similar virtual lanes are being investigated, 
improved security, link-level congestion management, etc.  All of these 
changes combined with de facto use of large send / receive-side scaling, 
etc. will likely make the issue of 9KB rather moot except for bulk data 
transfers in the WAN (even then the QoS issues of multiple flows across a 
common link will likely limit the use of 9KB).

In the end, IB HCA implementations can adopt the same techniques as 
Ethernet and deliver the same performance as a connected model without the 
overhead / complexity and ecosystem cost of a new specification.  IB 
already has enough problems getting the basics into the world 
today.  Adding in yet another spec isn't going to solve these problems or 
make life easier - too many things within the ecosystem are impacted.


> >As I and others have raised over the years, the enablement
> >of IP over IB to perform well is a local HCA issue not a standards
> >issue.  Addition of checksum off-load support to the HCA is rather trivial
> >and does not require standardization (this is what is done for Ethernet
> >today and is non-standard).  Addition of large send off-load support is a
> >local HCA issue not a standards issue and effectively provides the same
> >benefit as connected mode.
>
>Yes LSO (or TSO as some call it) is relatively easy. But LRO (large receive
>offload) is a heck more difficult. IB connected transports already have all
>silicons to do it. Why not just use it?

Large receive off-load isn't that hard - really an implementation problem 
from most of the designs I've examined.  I helped implement something 
equivalent to TSO and LRO over 10 years ago and it wasn't that bad or 
expensive.  As for why not just use it?  Simple, cost to specify, 
implement, get adopted within the industry is already high for IB.  Adding 
more to the stack of things when there is questionable benefit compared to 
simply re-using the existing Ethernet infrastructure seems like a poor 
choice of resource expenditure.  Of course, most people only care about 
making Linux happen these days so I guess feel free to knock yourselves out 
if that is all that is required.


> >The use of multiple QP to spread work across
> >CPU for both send / receive ala the multi-queue support I've worked with
> >various Ethernet IHV to get in place is again a local HCA issue (does not
> >have to be visible as part of the layer 2 address resolution).  One can
> >construct a very nice performing IP over IB solution but there hasn't been
> >much public progress to implement these de facto capabilities found in
> >Ethernet solutions on IB.  Getting these into a HCA implementation is a
> >heck of a lot easier and faster to do than to develop a standard and
> >getting all of the OS changes made (the HCA implementation issues can all
> >be done underneath the IP stack just like with Ethernet so no real OS 
> impacts).
>
>I don't understand the large MTU issue to the OS (requiring continguous 
>physical
>addresses). Aren't all decent hardware capable of scatter/gather these days?

Nothing above indicates any issue regarding physical addresses.  IB 
hardware as does Ethernet comprehends V-to-P mapping and makes no 
requirements on it being physically contiguous beyond the base physical 
page size.


>What's more hairy to the OS stack is the per-destination MTU and different
>MTU for multicast than for unicast inherited in IPoIB CM.

Agreed.  This is one aspect of the additional complexity that is simply 
easier to avoid.   The cost / benefit of connected mode is quite 
questionable compared to leveraging the existing Ethernet implementation 
concepts and associated infrastructure.

Mike


>Jerry
>
> >
> >
> >>For commercial clusters, if IB is used for storage, then you save a network
> >>by having fast IP performance and can use the IB network for both. Why use
> >>IB and another network for the commercial cluster, when the other network
> >>supports similar bandwidth for storage and IP.
> >
> >There will always be Ethernet in any cluster so the fabric is there.  The
> >question is whether it is just for low-bandwidth / management services or
> >for applications.  For storage, need to separate the discussion into
> >whether it is block or file.  For block, IB gateways to Fibre Channel, etc.
> >can and are being used today quite nicely.  Performance is reasonable and
> >the ecosystem costs, target availability, customer "pain", etc. are much
> >lower than attempting to move to native IB storage.  The same applies to
> >file based where IB gateways to Ethernet which then attaches to file
> >servers works quite nicely.  In fact, the original vision of IB was that of
> >an I/O fabric to create modular server solutions.  The addition of IPC came
> >later in the process when it was found to be relatively low cost to
> >define.  So, IB is successful in the HPC world and slowly entering some
> >commercial solutions.  To state that its future relies on getting an IP
> >over IB RC solution is perhaps blowing it a bit out of proportion.   The
> >easier path for all is to simply use the techniques I and others have
> >advocated for years now and solve the problems within the HCA
> >implementation.  Much lower costs and will result in delivering a good
> >performance solution.
> >
> >BTW, RNIC / Ethernet solutions implement these techniques today.  With the
> >arrival of 10 GbE and the lower prices of RNIC and 10 GbE switch ports,
> >lower latency switches (competitive enough with IB for commercial and many
> >HPC clusters), etc. the success of IB must lie elsewhere and not on an IETF
> >spec.   This was noted at the recent IEEE Hot Interconnects conference as
> >well so isn't just my opinion.
> >
> >Mike
> >
> >>Implementing IPoIB-CM makes IB viable in the HPC cluster and some
> >>commercial clusters. Otherwise I don't think it competes economically with
> >>other network technologies.
> >>
> >>Regards.
> >>
> >>Bernie King-Smith
> >>IBM Corporation
> >>Server Group
> >>Cluster System Performance
> >>wombat2@us.ibm.com    (845)433-8483
> >>Tie. 293-8483 or wombat2 on NOTES
> >>
> >>"We are not responsible for the world we are born into, only for the world
> >>we leave when we die.
> >>So we have to accept what has gone before us and work to change the only
> >>thing we can,
> >>-- The Future." William Shatner
> >>
> >>
> >>
> >>              Dror Goldenberg
> >>              <gdror@mellanox.c
> >>              o.il>                                                      To
> >>              Sent by:                  kashyapv@us.ltcfwd.linux.ibm.com,
> >>              ipoverib-bounces@         "H.K. Jerry Chu"
> >>              ietf.org                  <Jerry.Chu@eng.sun.com>
> >>                                                                         cc
> >>                                        margaret@thingmagic.com,
> >>              08/30/2005 09:32          ipoverib@ietf.org,
> >>              AM                        Bill_Strahm@McAfee.com
> >>                                                                    Subject
> >>                                        RE: [Ipoverib] Please read -
> >>                                        proposed WG termination
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> > From: Vivek Kashyap [mailto:kashyapv@us.ibm.com]
> >> > Sent: Tuesday, August 30, 2005 8:39 AM
> >> >
> >> > On Mon, 29 Aug 2005, H.K. Jerry Chu wrote:
> >> >
> >>
> >>
> >><snip>
> >>
> >>
> >> > > 1. IPoIB connected mode draft-ietf-ipoib-connected-mode-00.txt
> >> > > updated recently
> >> >
> >> > Well, in recent days there has been a discussion going on
> >> > based on Dror's input. I also made some updates after some
> >> > discussion on OpenIB (not on
> >> > IETF though).  This draft itself became a working group draft
> >> > this february
> >> > after some lively discussion just before that.  It appears to
> >> > me that we
> >> > should be possible to finalise this draft soon enough.
> >> >
> >> > 20th sept. might be long enough to know one way or the other...
> >> >
> >> > vivek
> >> >
> >>
> >>
> >>We would like to see IPoIB-CM being finalized in IETF. We see
> >>great value in having a standard for connected mode which effectively
> >>increases the MTU. We are willing to contribute to the standardization
> >>effort. We're also looking at the implementation of IPoIB-CM in Linux.
> >>
> >>
> >>-Dror _______________________________________________
> >>IPoverIB mailing list
> >>IPoverIB@ietf.org
> >>https://www1.ietf.org/mailman/listinfo/ipoverib
> >>
> >>
> >>
> >>
> >>
> >>_______________________________________________
> >>IPoverIB mailing list
> >>IPoverIB@ietf.org
> >>https://www1.ietf.org/mailman/listinfo/ipoverib
>
>
>_______________________________________________
>IPoverIB mailing list
>IPoverIB@ietf.org
>https://www1.ietf.org/mailman/listinfo/ipoverib
_______________________________________________
IPoverIB mailing list
IPoverIB@ietf.org
https://www1.ietf.org/mailman/listinfo/ipoverib