RE: [Ipoverib] Please read - proposed WG termination

Michael Krause <krause@cup.hp.com> Thu, 01 September 2005 17:48 UTC

Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1EAtAu-0008DT-SW; Thu, 01 Sep 2005 13:48:32 -0400
Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1EAtAs-0008CI-Ap; Thu, 01 Sep 2005 13:48:30 -0400
Received: from ietf-mx.ietf.org (ietf-mx [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id NAA16962; Thu, 1 Sep 2005 13:48:27 -0400 (EDT)
Received: from palrel10.hp.com ([156.153.255.245]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1EAtCp-0006NO-SA; Thu, 01 Sep 2005 13:50:34 -0400
Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel10.hp.com (Postfix) with ESMTP id C88BB2585; Thu, 1 Sep 2005 10:48:21 -0700 (PDT)
Received: from MK73191c.cup.hp.com ([15.244.205.99]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id KAA28326; Thu, 1 Sep 2005 10:40:18 -0700 (PDT)
Message-Id: <6.2.0.14.2.20050901102429.028498c8@esmail.cup.hp.com>
X-Mailer: QUALCOMM Windows Eudora Version 6.2.0.14
Date: Thu, 01 Sep 2005 10:38:50 -0700
To: Bernard King-Smith <wombat2@us.ibm.com>, Dror Goldenberg <gdror@mellanox.co.il>
From: Michael Krause <krause@cup.hp.com>
Subject: RE: [Ipoverib] Please read - proposed WG termination
In-Reply-To: <OFEAD798BC.5BD12489-ON8525706E.00447E83-8525706F.0000FE88@ us.ibm.com>
References: <506C3D7B14CDD411A52C00025558DED60893AD3C@mtlex01.yok.mtl.com> <OFEAD798BC.5BD12489-ON8525706E.00447E83-8525706F.0000FE88@us.ibm.com>
Mime-Version: 1.0
X-Spam-Score: 0.8 (/)
X-Scan-Signature: 4a96669441ad70ecf6aebb4b47b971cd
Cc: "H.K. Jerry Chu" <Jerry.Chu@eng.sun.com>, Bill_Strahm@McAfee.com, margaret@thingmagic.com, kashyapv@us.ibm.com, ipoverib-bounces@ietf.org, ipoverib@ietf.org
X-BeenThere: ipoverib@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: IP over InfiniBand WG Discussion List <ipoverib.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ipoverib>, <mailto:ipoverib-request@ietf.org?subject=unsubscribe>
List-Post: <mailto:ipoverib@ietf.org>
List-Help: <mailto:ipoverib-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ipoverib>, <mailto:ipoverib-request@ietf.org?subject=subscribe>
Content-Type: multipart/mixed; boundary="===============1220239762=="
Sender: ipoverib-bounces@ietf.org
Errors-To: ipoverib-bounces@ietf.org

At 05:10 PM 8/31/2005, Bernard King-Smith wrote:




>Having IPoIB-CM is a very important feature to make IB a viable
>interconnect in clustered systems. Without IPoIB-CM for HPC clusters, and
>commercial clusters using IB to SAN, you need two networks for good total
>cluster performance, one for IB ( non-IP traffic ) and an IP performance
>network like GigE.  This means that IB is not cost effective as the GigE (
>or 10 GigE) network which handles both types of traffic reasonably.
>
>In the HPC world most clusters use the cluster fabric ( and IB is the
>future direction ) for both MPI and IP traffic. The IP traffic is usually
>for parallel file systems and system management and control. This high
>bandwidth IP network is required in most production HPC clusters.  With the
>current IPoIB only using UD, the performance is dismal. Our simulations
>using the small packet MTU of IB says that the parallel file systems (
>GPFS, PVFS, Lustre etc ) can only get 25% of a 4X IB link today and at 12X
>it will be about 10%. The problem is that the IP drivers are single
>threaded per adapter. Also the CPU utilization of TCP/IP at a MTU of the IB
>link very high because of the per packet stack processing. Going to
>IPoIB-CM means we can cut down the number of TCP/IP stack traversals from
>32 to 1 for a 60K IP packet. This means that you have 30 times as much data
>transmitted per device driver call. This will enable IP to show similar
>bandwidth with multiple sockets as other protocols that can use RC or
>fragment within the device driver.

These performance problems are primarily implementation-specific and have 
little to do with IB technology itself.  In addition, nearly all IB 
solutions use a 2KB not the smallest MTU to transfer data - no different 
than Ethernet.   As I and others have raised over the years, the enablement 
of IP over IB to perform well is a local HCA issue not a standards 
issue.  Addition of checksum off-load support to the HCA is rather trivial 
and does not require standardization (this is what is done for Ethernet 
today and is non-standard).  Addition of large send off-load support is a 
local HCA issue not a standards issue and effectively provides the same 
benefit as connected mode.  The use of multiple QP to spread work across 
CPU for both send / receive ala the multi-queue support I've worked with 
various Ethernet IHV to get in place is again a local HCA issue (does not 
have to be visible as part of the layer 2 address resolution).  One can 
construct a very nice performing IP over IB solution but there hasn't been 
much public progress to implement these de facto capabilities found in 
Ethernet solutions on IB.  Getting these into a HCA implementation is a 
heck of a lot easier and faster to do than to develop a standard and 
getting all of the OS changes made (the HCA implementation issues can all 
be done underneath the IP stack just like with Ethernet so no real OS impacts).


>For commercial clusters, if IB is used for storage, then you save a network
>by having fast IP performance and can use the IB network for both. Why use
>IB and another network for the commercial cluster, when the other network
>supports similar bandwidth for storage and IP.

There will always be Ethernet in any cluster so the fabric is there.  The 
question is whether it is just for low-bandwidth / management services or 
for applications.  For storage, need to separate the discussion into 
whether it is block or file.  For block, IB gateways to Fibre Channel, etc. 
can and are being used today quite nicely.  Performance is reasonable and 
the ecosystem costs, target availability, customer "pain", etc. are much 
lower than attempting to move to native IB storage.  The same applies to 
file based where IB gateways to Ethernet which then attaches to file 
servers works quite nicely.  In fact, the original vision of IB was that of 
an I/O fabric to create modular server solutions.  The addition of IPC came 
later in the process when it was found to be relatively low cost to 
define.  So, IB is successful in the HPC world and slowly entering some 
commercial solutions.  To state that its future relies on getting an IP 
over IB RC solution is perhaps blowing it a bit out of proportion.   The 
easier path for all is to simply use the techniques I and others have 
advocated for years now and solve the problems within the HCA 
implementation.  Much lower costs and will result in delivering a good 
performance solution.

BTW, RNIC / Ethernet solutions implement these techniques today.  With the 
arrival of 10 GbE and the lower prices of RNIC and 10 GbE switch ports, 
lower latency switches (competitive enough with IB for commercial and many 
HPC clusters), etc. the success of IB must lie elsewhere and not on an IETF 
spec.   This was noted at the recent IEEE Hot Interconnects conference as 
well so isn't just my opinion.

Mike

>Implementing IPoIB-CM makes IB viable in the HPC cluster and some
>commercial clusters. Otherwise I don't think it competes economically with
>other network technologies.
>
>Regards.
>
>Bernie King-Smith
>IBM Corporation
>Server Group
>Cluster System Performance
>wombat2@us.ibm.com    (845)433-8483
>Tie. 293-8483 or wombat2 on NOTES
>
>"We are not responsible for the world we are born into, only for the world
>we leave when we die.
>So we have to accept what has gone before us and work to change the only
>thing we can,
>-- The Future." William Shatner
>
>
>
>              Dror Goldenberg
>              <gdror@mellanox.c
>              o.il>                                                      To
>              Sent by:                  kashyapv@us.ltcfwd.linux.ibm.com,
>              ipoverib-bounces@         "H.K. Jerry Chu"
>              ietf.org                  <Jerry.Chu@eng.sun.com>
>                                                                         cc
>                                        margaret@thingmagic.com,
>              08/30/2005 09:32          ipoverib@ietf.org,
>              AM                        Bill_Strahm@McAfee.com
>                                                                    Subject
>                                        RE: [Ipoverib] Please read -
>                                        proposed WG termination
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> > From: Vivek Kashyap [mailto:kashyapv@us.ibm.com]
> > Sent: Tuesday, August 30, 2005 8:39 AM
> >
> > On Mon, 29 Aug 2005, H.K. Jerry Chu wrote:
> >
>
>
><snip>
>
>
> > > 1. IPoIB connected mode draft-ietf-ipoib-connected-mode-00.txt
> > > updated recently
> >
> > Well, in recent days there has been a discussion going on
> > based on Dror's input. I also made some updates after some
> > discussion on OpenIB (not on
> > IETF though).  This draft itself became a working group draft
> > this february
> > after some lively discussion just before that.  It appears to
> > me that we
> > should be possible to finalise this draft soon enough.
> >
> > 20th sept. might be long enough to know one way or the other...
> >
> > vivek
> >
>
>
>We would like to see IPoIB-CM being finalized in IETF. We see
>great value in having a standard for connected mode which effectively
>increases the MTU. We are willing to contribute to the standardization
>effort. We're also looking at the implementation of IPoIB-CM in Linux.
>
>
>-Dror _______________________________________________
>IPoverIB mailing list
>IPoverIB@ietf.org
>https://www1.ietf.org/mailman/listinfo/ipoverib
>
>
>
>
>
>_______________________________________________
>IPoverIB mailing list
>IPoverIB@ietf.org
>https://www1.ietf.org/mailman/listinfo/ipoverib
_______________________________________________
IPoverIB mailing list
IPoverIB@ietf.org
https://www1.ietf.org/mailman/listinfo/ipoverib