RE: [Ips] Recent comments about FCoE and iSCSI

Michael Krause <krause@cup.hp.com> Mon, 30 April 2007 19:21 UTC

Return-path: <ips-bounces@ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1HibR5-0003ET-4I; Mon, 30 Apr 2007 15:21:23 -0400
Received: from ips by megatron.ietf.org with local (Exim 4.43) id 1HibR3-0003EJ-T5 for ips-confirm+ok@megatron.ietf.org; Mon, 30 Apr 2007 15:21:21 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1HibR3-0003EB-J0 for ips@ietf.org; Mon, 30 Apr 2007 15:21:21 -0400
Received: from palrel10.hp.com ([156.153.255.245]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1HibR0-00073G-Tz for ips@ietf.org; Mon, 30 Apr 2007 15:21:21 -0400
Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.13.191.130]) by palrel10.hp.com (Postfix) with ESMTP id 1C18534A22; Mon, 30 Apr 2007 12:21:14 -0700 (PDT)
Received: from MK73191c.cup.hp.com (je061164.ssr.hp.com [15.47.61.164]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id MAA04887; Mon, 30 Apr 2007 12:14:58 -0700 (PDT)
Message-Id: <6.2.0.14.2.20070430082328.02812c58@esmail.cup.hp.com>
X-Mailer: QUALCOMM Windows Eudora Version 6.2.0.14
Date: Mon, 30 Apr 2007 08:26:56 -0700
To: Julian Satran <Julian_Satran@il.ibm.com>, nab@kernel.org
From: Michael Krause <krause@cup.hp.com>
Subject: RE: [Ips] Recent comments about FCoE and iSCSI
In-Reply-To: <OF82ADC07C.2D166BD5-ON852572CA.0043E6BD-852572CA.00451739@ il.ibm.com>
References: <1177648868.5355.122.camel@haakon2.linux-iscsi.org> <OF82ADC07C.2D166BD5-ON852572CA.0043E6BD-852572CA.00451739@il.ibm.com>
Mime-Version: 1.0
X-Spam-Score: 0.5 (/)
X-Scan-Signature: 16a1775db2061587296285ba70384116
Cc: Eric Hall <ehall@ehsco.com>, ips@ietf.org, Mike Mazarick <mazarick@bellsouth.net>, nab@linux-iscsi.org, Zack Best <zbest28@yahoo.com>
X-BeenThere: ips@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: IP Storage <ips.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ips>, <mailto:ips-request@ietf.org?subject=unsubscribe>
List-Post: <mailto:ips@ietf.org>
List-Help: <mailto:ips-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ips>, <mailto:ips-request@ietf.org?subject=subscribe>
Content-Type: multipart/mixed; boundary="===============0827729375=="
Errors-To: ips-bounces@ietf.org

Just a data point:

The OpenGroup did specify an async sockets API about a year or so ago.   It 
was designed with RDMA interconnects / solutions in mind so provides 
explicit memory management, completion events, etc.   Unfortunately, while 
it was developed by a number of people with extensive Sockets development 
and implementation experience, this specification has yet to be implemented 
not because of a lack of interest but more due to business priorities.

In any case, there were also proposed extensions to eliminate the need for 
a ULP such as SDP as well through a few explicit RDMA calls.   People might 
want to check out the work that has already occurred.  There was a great 
deal of thought put into that API to support both IPC, block and filer storage.

Mike


At 05:34 AM 4/27/2007, Julian Satran wrote:

>Great comments. You are all certainly aware that sockets are also 
>undergoing transformation (asynchronous sockets) but even with synchronous 
>sockets and some care not to break existing application that use 
>synchronous sockets a restructuring of the stack may enable (as shown by 
>the Intel and IBM-Haifa work) great increases in performance.
>Software RDMA for the new class of of multicore engines is definetly an 
>interesting proposition (on highly multithreaded engines it should come 
>with not cost associated with it - or almost no cost).
>
>I wish I knew more about the decrease in latencies in the switch fabric 
>(it would be interesting if somebody could comment) as large Layer-2 
>fabrics have some inherent latency issues.
>
>FCoE is asking us to forget all athis and go back and pay the hardware 
>price for several more years and ignore the IP-land and nothing that I 
>heard convinced me that we should do so.
>
>Regards,
>Julo
>
>
>"Nicholas A. Bellinger" <nab@kernel.org>
>
>27/04/07 00:41
>Please respond to
>nab@kernel.org
>
>To
>Julian Satran/Haifa/IBM@IBMIL
>cc
>Zack Best <zbest28@yahoo.com>om>, ips@ietf.org, nab@linux-iscsi.org, Mike 
>Mazarick <mazarick@bellsouth.net>et>, Eric Hall <ehall@ehsco.com>
>Subject
>RE: [Ips] Recent comments about FCoE and iSCSI
>
>
>
>
>A quick comment in regards to the abundance large of computing resources
>available for initiator side software IP storage services..  Also Julo,
>many thanks for posting this great thread. :)
>
>--nab
>
>------------------------------------------------------------------------
>
>As the progress of the DDP TWG continues onward and 2nd generation
>hardware iWARP engines start to come online, the benefit of a hybrid
>software implementation with host OS software network stack
>modifications in kernel above TCP and SCTP starts to pose a question..
>What real savings can a hyrbid iSER nodes using software DDP?  What are
>those changes required to make high performance software DDP a
>reality..?
>
>As osc-iwarp has found out, there is a significant CPU overhead
>assoicated with sockets and software VERBS, but I think this can be
>minimized with the right set of changes.  Those changes are moving away
>from receieve side sockets for software iSER mode.  These changes will
>start to become attractive for new product designs as this will allow
>RNIC hardware engines to scale futher using a more sane method or less
>painfully (depending on who you ask, OFA uses a Hybrid IB-VERBs) than
>traditional TOEs with speciality engines.  Really taking advantage of
>what metadata in DDP and iWARP metadata is telling about the framed
>network transport can help in RDMA WRITE scenarios because the software
>RNIC would already have Stagged memory ready to go in the iSER case.
>Espically when it comes to the API for the iSER stack, having a single
>codebase with vendors writing hardware drivers instead of re-inventing
>the wheel with sockets.  I believe the smart software RNICs of the
>future will direct RDMA traffic directly into host OS SCSI memory
>buffers, and like today use something similar to sendpage() for TX.
>
>As multi-core microprocessor designs with large, intelligent shared
>caches, and CPU cache coherentecy and I/O interconnects that in the 90's
>where only available in the Alpha EV67 and highest of high end shared
>memory supercomputers and clusters are now starting to become the norm.
>Pushing software iSER to the next level and beyond is surely not going
>to happen with a 30 year old API (sockets).  Also for the data center
>story with a traditional tiered SAN architecture and software case, the
>hyrbid iWARP software stack on the initiator will not get a whole lot of
>interest until it can show improved performance and overhead that is
>acceptable to traditional iSCSI today.  For the 3rd generation IP
>storage stacks, typical multiport 1G workloads is what will really drive
>interest into areas where putting a hardware RNIC will not be cost
>feasable for some time.
>
>But just as with traditional iSCSI, we can also scale software iSER down
>towards towards platforms with more modest computing resources on
>low-power, wireless devices.  Even in the type of mobile devices that IP
>storage services have been prototyped on today, the benefit of being
>able to scale server side hardware RNICs more efficently is not software
>iSER's only benefit. On a side note, I think the transparency that
>connection recovery in traditional iSCSI and iSER allows to internexus
>multiplexing, as well as end user requirements for configuration and
>management scenarios.  Using a active-active recovery mechinism that is
>as close to completely transparent as possbile (which ERL=2 is IMHO) is
>I think what mobile IP storage services users need to be demanding from
>their transports.
>
>Thanks for listening!
>
>On Thu, 2007-04-26 at 21:16 -0400, Julian Satran wrote:
> >
> > Excellent comments. My take (if not obvious from the previous text) is
> > that data centers will be very large and compute power (as evidenced
> > by the multicore) and advances in stack implementation are bound to
> > improve substantialy the performance of the protocol stacks (see Intel
> > and our work) and layer 3 switching.
> > It is important also to point out that Ethernet has substantial
> > latencies if only bridging is using and replacement technologies (such
> > as Rbridges or others) may take some time to appear.
> >
> > Julo
> >
> >
> > Zack Best <zbest28@yahoo.com>
> >
> > 25/04/07 16:37
> >
> >
> >                To
> > ips@ietf.org
> >                cc
> >
> >           Subject
> > RE: [Ips] Recent
> > comments about
> > FCoE and iSCSI
> >
> >
> >
> >
> >
> >
> >
> >
> > The real debate here is between two types of networks.
> > The first is reliable at the link level and does not
> > drop packets under congestion.  The second is running
> > a reliable transport protocol (i.e. TCP) over an
> > unreliable link level network.
> >
> > I agree with the scaling argument.  For sufficiently
> > large networks, reliable link level doesn't work well
> > because network component failure, or chronically
> > congested links are not handled well.  For
> > sufficiently small networks, reliable link level has
> > some significant advantages in simplicity, low
> > hardware cost, performance, and worst case latency.
> >
> > My personal view is that the vast majority of
> > enterprise storage networks fall in the "sufficiently
> > small" category.  This view has to some extent been
> > vindicated by the continuing success of Fibre Channel
> > in this space and the inability of iSCSI to displace
> > FC in any significant way for enterprise storage.  Of
> > course, this may or may not change in the future.
> >
> > Whether FC is simpler than iSCSI depends largely on
> > your definition of simplicity.  If one defines
> > simplicity/complexity as the number of gates or lines
> > of code to reduce the protocol to hardware or
> > firmware, then my experience is that iSCSI is 2X to 3X
> > the complexity of FC.  This has implications in cost
> > and reliability.
> >
> > Particularly problematic with iSCSI is the
> > unpredictability of the performance.  Performance is
> > great with no packet drop.  However even a small
> > amount of congestion can cause a sudden large drop and
> > performance.  This can be difficult to predict as a
> > network that is almost but not quite congested can run
> > great, but a small incremental change of any sort can
> > cause the performance to become suddenly unacceptable.
> > For FC, or other protocol using link level flow
> > control, the reduction in performance is much more
> > graceful and incremental when the level of congestion
> > is small and intermittent.
> >
> > A second major problem with iSCSI is the unbounded
> > nature of worst case latency.  When a storage network
> > fails, it is desirable to detect the failure in a
> > fraction of a second and transition to a backup
> > network.  TCP, when implemented to the standards, can
> > take many seconds or minutes to determine that a
> > network has failed and close the connection.  RFC
> > 2988, for instance, requires that the minimum
> > retransmission be one second.  This means a single
> > dropped packet may add one second to the latency of
> > outstanding commands.  This is a huge amount of time
> > on a 10G link.  No doubt this could be mitigated by
> > drastically reducing the timeouts within TCP, but the
> > market seems to be surprisingly resistant to tampering
> > with accepted standards here.
> >
> > Overall, the FC and FCP protocol have a lot in common
> > with the Intel i86 instruction set architecture.  They
> > are overly complex, and rather poorly designed by
> > modern standards.  But they are good enough, and there
> > is a huge amount of value add that has been built on
> > top of them, and therefore little incentive to change.
> > FCoE is an interesting idea because it preserves 90%
> > of the existing value add of FC, unifies the physical
> > link with Ethernet, and uses the reliable link method
> > of packet delivery.
> >
> > There are two significant possibilities for iSCSI to
> > displace FC (or FCoE) in enterprise storage networks.
> > First is if the networks start to scale to large
> > enough size that FC can't be made sufficiently
> > reliable, and second if CPU compute cycles become
> > sufficiently cheap that the iSCSI protocol can be run
> > in host software with no negative performance impact.
> > Barring either of these, it seems that iSCSI will have
> > an uphill battle, and FCoE may have a place.
> >
> > -----Original Message-----
> > From: Julian Satran [mailto:Julian_Satran@il.ibm.com]
> > Sent: Tuesday, April 24, 2007 3:10 PM
> > To: ips@ietf.org
> > Subject: [Ips] Recent comments about FCoE and iSCSI
> >
> >
> >
> > Dear All,
> >
> > The trade press is lately full with comments about the
> > latest and greatest reincarnation of Fiber Channel
> > over ethernet.
> > It made me try and summarize all the long and hot
> > debates that preceded the advent of iSCSI.
> > Although FCoE proponents make it look like no debate
> > preceded iSCSI that was not so - FCoE was considered
> > even then and was dropped as a dumb idea.
> >
> > Here is a summary (as afar as I can remember) of the
> > main arguments. They are not bad arguments even in
> > retrospect and technically FCoE doesn't look better
> > than it did then.
> >
> > Feel free to use this material in a nay form. I expect
> > this group to seriously  expand my arguments and make
> > them public - in personal or collective form.
> >
> > And do not forget - it is a technical dispute -
> > although we all must have some doubts about the way it
> > is pursued.
> >
> > Regards,
> > Julo
> >
> > ---------------------------------------------------------------------
> >
> >
> > What a piece of nostalgia :-)
> >
> > Around 1997 when a team at IBM Research (Haifa and
> > Almaden) started looking at connecting storage to
> > servers using the "regular network" (the ubiquitous
> > LAN) we considered many alternatives (another team
> > even had a look at ATM - still a computer network
> > candidate at the time). I won't get you over all of
> > our rationale (and we went over some of them again at
> > the end of 1999 with a team from CISCO before we
> > convened the first IETF BOF in 2000 at Adelaide that
> > resulted in iSCSI and all the rest) but some of the
> > reasons we choose to drop Fiber Channel over raw
> > Ethernet where multiple:
> >
> > Fiber Channel Protocol (SCSI over Fiber Channel Link)
> > is "mildly" effective because:
> > it implements endpoints in a dedicated engine
> > (Offload)
> > it has no transport layer (recovery is done at the
> > application layer under the assumption that the error
> > rate will be very low)
> > the network is limited in physical span and logical
> > span (number of switches)
> > flow-control/congestion control is achieved with a
> > mechanism adequate for a limited span network
> > (credits). The packet loss rate is almost nil and that
> > allows FCP to avoid using a transport (end-to-end)
> > layer
> > FCP she switches are simple (addresses are local and
> > the memory requirements cam be limited through the
> > credit mechanism)
> > However FCP endpoints are inherently costlier than
> > simple NICs – the cost argument (initiators are more
> > expensive)
> > The credit mechanisms is highly unstable for large
> > networks (check switch vendors planning docs for the
> > network diameter limits) – the scaling argument
> > The assumption of low losses due to errors might
> > radically change when moving from 1 to 10 Gb/s – the
> > scaling argument
> > Ethernet has no credit mechanism and any mechanism
> > with a similar effect increases the end point cost.
> > Building a transport layer in the protocol stack has
> > always been the preferred choice of the networking
> > community – the community argument
> > The "performance penalty" of a complete protocol stack
> > has always been overstated (and overrated). Advances
> > in protocol stack implementation and finer tuning of
> > the congestion control mechanisms make conventional
> > TCP/IP performing well even at 10 Gb/s and over.
> > Moreover the multicore processors that become dominant
> > on the computing scene have enough compute cycles
> > available to make any "offloading" possible as a mere
> > code restructuring exercise (see the stack reports
> > from Intel, IBM etc.)
> > Building on a complete stack makes available a wealth
> > of operational and management mechanisms built over
> > the years by the networking community (routing,
> > provisioning, security, service location etc.) – the
> > community argument
> > Higher level storage access over an IP network is
> > widely available and having both block and file served
> > over the same connection with the same support and
> > management structure is compelling – the community
> > argument
> > Highly efficient networks are easy to build over IP
> > with optimal (shortest path) routing while Layer 2
> > networks use bridging and are limited by the logical
> > tree structure that bridges must follow. The effort to
> > combine routers and bridges (rbridges) is promising to
> > change that but it will take some time to finalize
> > (and we don't know exactly how it will operate).
> > Untill then the scale of Layer 2 network is going to
> > seriously limited – the scaling argument
> >
> >
> > As a side argument – a performance comparison made in
> > 1998 showed SCSI over TCP (a predecessor of the later
> > iSCSI) to perform better than FCP at 1Gbs for block
> > sizes typical for OLTP (4-8KB). That was what
> > convinced us to take the path that lead to iSCSI – and
> > we used plain vanilla x86 servers with plain-vanilla
> > NICs and Linux (with similar measurements conducted on
> > Windows).
> > The networking and storage community acknowledged
> > those arguments and developed iSCSI and the companion
> > protocols for service discovery, boot etc.
> >
> > The community also acknowledged the need to support
> > existing infrastructure and extend it in a reasonable
> > fashion and developed 2 protocols iFCP (to support
> > hosts with FCP drivers and IP connections to connect
> > to storage by a simple conversion from FCP to TCP
> > packets) FCPIP to extend the reach of FCP through IP
> > (connects FCP islands through TCP links). Both have
> > been
> > implemented and their foundation is solid.
> >
> > The current attempt of developing a "new-age" FCP over
> > an Ethernet link is going against most of the
> > arguments that have given us iSCSI etc.
> >
> > It ignores the networking layering practice, build an
> > application protocol directly above a link and thus
> > limits scaling, mandates elements at the link layer
> > and application layer that make applications more
> > expensive and leaves aside the whole "ecosystem" that
> > accompanies TCP/IP (and not Ethernet).
> >
> > In some related effort (and at a point also when
> > developing iSCSI) we considered also moving away from
> > SCSI (like some "no standardized" but popular in some
> > circles software did – e.g., NBP) but decided against.
> > SCSI is a mature and well understood access
> > architecture for block storage and is implemented by
> > many device vendors. Moving away from it would not
> > have been justified at the time.
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam protection around
> > http://mail.yahoo.com
> >
> >
> >
> > _______________________________________________
> > Ips mailing list
> > Ips@ietf.org
> > https://www1.ietf.org/mailman/listinfo/ips
> >
> > _______________________________________________
> > Ips mailing list
> > Ips@ietf.org
> > https://www1.ietf.org/mailman/listinfo/ips
>
>
>_______________________________________________
>Ips mailing list
>Ips@ietf.org
>https://www1.ietf.org/mailman/listinfo/ips
_______________________________________________
Ips mailing list
Ips@ietf.org
https://www1.ietf.org/mailman/listinfo/ips