RE: [nfsv4] pnfs: efficient file-layout striping proposal

"Noveck, Dave" <Dave.Noveck@netapp.com> Fri, 07 July 2006 10:33 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1Fynek-0001GL-Nj; Fri, 07 Jul 2006 06:33:54 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1Fynei-0001GG-Uh for nfsv4@ietf.org; Fri, 07 Jul 2006 06:33:52 -0400
Received: from mx2.netapp.com ([216.240.18.37]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1Fynei-0005UB-7O for nfsv4@ietf.org; Fri, 07 Jul 2006 06:33:52 -0400
Received: from smtp1.corp.netapp.com ([10.57.156.124]) by mx2.netapp.com with ESMTP; 07 Jul 2006 03:33:52 -0700
X-IronPort-AV: i="4.06,216,1149490800"; d="scan'208"; a="391576349:sNHT77639772"
Received: from svlexc03.hq.netapp.com (svlexc03.corp.netapp.com [10.57.156.149]) by smtp1.corp.netapp.com (8.13.1/8.13.1/NTAP-1.6) with ESMTP id k67AXkSV024877; Fri, 7 Jul 2006 03:33:47 -0700 (PDT)
Received: from exsvlrb02.hq.netapp.com ([10.56.8.63]) by svlexc03.hq.netapp.com with Microsoft SMTPSVC(6.0.3790.0); Fri, 7 Jul 2006 03:33:46 -0700
Received: from exnane01.hq.netapp.com ([10.97.0.61]) by exsvlrb02.hq.netapp.com with Microsoft SMTPSVC(6.0.3790.1830); Fri, 7 Jul 2006 03:33:45 -0700
X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: RE: [nfsv4] pnfs: efficient file-layout striping proposal
Date: Fri, 07 Jul 2006 06:33:43 -0400
Message-ID: <C98692FD98048C41885E0B0FACD9DFB8028CE8CF@exnane01.hq.netapp.com>
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
Thread-Topic: [nfsv4] pnfs: efficient file-layout striping proposal
Thread-Index: Acahg9f45tQV4pYRQnq0b4PNTWfN7QAI9rOQ
From: "Noveck, Dave" <Dave.Noveck@netapp.com>
To: Marc Eshel <eshel@almaden.ibm.com>, "Goodson, Garth" <Garth.Goodson@netapp.com>
X-OriginalArrivalTime: 07 Jul 2006 10:33:45.0179 (UTC) FILETIME=[D2EE5AB0:01C6A1B0]
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 4bb0e9e1ca9d18125bc841b2d8d77e24
Cc: nfsv4@ietf.org
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
Errors-To: nfsv4-bounces@ietf.org

Marc Eshel wrote:
> I would vote against the last option to have the client share parts
> of the layouts among different files, it sounds to complicated for 
> the client to manage and it is a big change to introduce at this
point. 

I want to clarify something here.  As I understand it, this proposal
does
not have any client-managed sharing, which would indeed be complicated.
The proposal here is that the server defines any sharing (typically
among layouts for all files on a given fs) and presents the layouts
to the client with the common/shared portions factored out and made
referenceable as shared quasi-devices.  I don't see any additional
management difficulty for the client here.  He gets the layout
information (now smaller than it was), fetches the device info 
for devices mentioned in the layout (now bigger than it was), if 
not already fetched before for a different file and proceeds 
just as he did.  The thing is that if he accesses more than one
file, the total data that he had to fetch is considerably smaller,
given the sharing.

> I am not sure that the second option is saving much more so I 
> would vote for skipping the additional complexity unless you 
> have some realistic examples that show the actual space saving.

Let me give you a real example from a real file system, that it
just so happens that I know about, what our marketing has dubbed 
the ONTAP GX High-Performance Option.  My expectation, though,
is that this issue will be seen by other file systems that need
to stripe data, and the intention is to be able to solve the issue
for all such file systems.

When you stripe data according to a fixed rotation among devices,
you are subject to problems when applications just happen to
access data repetitively through the file and the interval happens
to match the stripe size times the number of stripes.  There goes
the benefits of striping and the application might take ten times
as long as the customer expected.

One answer is to give people control over the stripe size and 
the number of stripes and then when the application runs much 
more slowly than expected, you can say, "We gave you the rope 
and you hung yourself.  Gosh, that's too bad".  Often, however, 
customers are reluctant to see things in that light, and so 
there is a need for a different solution.

One answer here is to add a degree of quasi-randomness to the 
rotation among devices.  Then any particular pattern of data
access that an application will exhibit, may hit the same device
for a bit but won't do so for long.  

So if you do that and want to pnfs-ize access, then either the
client and the server have to agree on the quasi-random algorithm, 
which would complicate pnfs unduly (since different file systems 
want slightly different algorithms), or the server sends a table
that reflects the algorithm, which is obviously better from
a specification/interoperability point of view.

The problem that Garth is addressing here is that that table
can, and normally will be, big.  If it is small, then you 
basically have a fixed stripe rotation again, albeit one
that is a little bit larger than the number of devices.  So
file systems will want this to be rather large to avoid the
possibility of applications hitting particular strides that
wind up on the same devices.  And in fact, in this particular
file system, the table for each file would be four thousand
entries, which depending on the details of how we decide to
do the encoding is between 16K and maybe 32K or so of layout
data, for each file.  That's what leads to the desire to share 
among files.

It is true that you could shorten the table for small files but
that adds lots of complexity since files can be extended.  The
file variant works best when all clients have the layout 
information readily at hand, which makes partial layouts 
very undesirable.

So the saving grace here, in this file system, and I expect in
others, is that because the same algorithm generated the table
for each file, the tables will be essentially the same for 
different files, making the benefits of sharing immense.  In
the case of this file system, the only difference between the
device striping patterns is that each file starts in a different
place (because you don't want everybody's stripe zero to be in
the same place) in the same table and I would imagine that would
be a common pattern.  So now you have a single-quasi random 
pattern which is many K data, which you can ship over in each
layout or ship that over once and have individual layouts with 
very small numbers of bytes referring to this large shared 
entities, in Garth's proposal in the form of a quasi-device.  
What is left in the layout is so small compared to what you save 
that the saving is basically some number like 16K or 32K
times the number of different files the client is referencing 
minus one, which can get to be a pretty big number.
 

-----Original Message-----
From: Marc Eshel [mailto:eshel@almaden.ibm.com] 
Sent: Friday, July 07, 2006 1:12 AM
To: Goodson, Garth
Cc: Noveck, Dave; nfsv4@ietf.org
Subject: Re: [nfsv4] pnfs: efficient file-layout striping proposal

Garth,
There is no question that number one (Simplest) is a big space saver. 
Repeating the FH for each is segment in a one to one mapping is silly.
So I definitely vote for the first option. I am not sure that the second
option is saving much more so I would vote for skipping the additional
complexity unless you have some realistic examples that show the actual
space saving. I would vote against the last option to have the client
share parts of the layouts among different files, it sounds to
complicated for the client to manage and it is a big change to introduce
at this point.
Marc.

Garth Goodson <Garth.Goodson@netapp.com> wrote on 07/06/2006 06:12:38
PM:

> Our striped file-system implementation uses a complex algorithm to 
> generate a large table (1000s of entries) where each entry contains a 
> data server ID.  The order of these entries dictate the striping 
> order; data server IDs can be repeated multiple times within this 
> table and the

> order may be pseudo random.  For example:
> 
> 5 Data servers: 0, 1, 2, 3, 4
> 
> Stripe table example: 0 1 3 1 4 2 0 3 2 4 ...
> 
> To efficiently represent large stripe rules like this the client will 
> use a huge amount of memory to store these layouts given the current 
> representation; currently each table entry holds the <device_id, 
> equivalent device_ids; filehandle>.
> 
> There are a number of optimizations that can be made to increase the 
> amount of the layout that can be shared between files.  I'll describe 
> two such optimizations, going from simplest to more aggressive.  
> Lastly,

> I'll describe another optimization that can be made putting these
together.
> 
> 1) Simplest:
> ------------
> 
> struct nfsv4_file_layout {
>          pnfs_deviceid4          dev_id<>;
>          nfs_fh4                 fh;
> };
> 
> struct nfsv4_file_layouttype4 {                /* Per file */
>      stripetype4             stripe_type;
>      bool                    commit_through_mds;
>      length4                 stripe_unit;
>      length4                 file_size;
>      uint32_t                stripe_devs<>;    <---- NEW
>      nfsv4_file_layout       dev_list<>;
> };
> 
> This modification changes the nfsv4_file_layouttype4 structure by 
> introducing a new array.  This array ('stripe_devs') is simply an 
> array of indices into the 'dev_list' array.
> 
>  From the example above, the stripe_devs array holds: 0 1 3 1 4 2 0 3 
> 2
> 4 ...  Each entry in the stripe_devs array indexes into the dev_list 
> array.  The dev_list array is of length 5 and holds the device IDs and

> Filehandles for each of the 5 devices.  As one can see, depending on 
> the

> length of stripe_devs, there can be a significant space reduction by 
> using this structure; Filehandles aren't repeated multiple times.
> 
> If the size of stripe_devs is zero, the layout reverts to its previous

> layout behavior (striping across the entries listed in dev_list).
> 
> 2) More agressive space optimization:
> -------------------------------------
> 
> struct nfsv4_file_layout_simple_device4 {
>      string  r_netid<>; /* network ID */
>      string  r_addr<>;  /* universal address */ };
> 
> union nfsv4_file_layout_device4 switch (file_layout_device_type) {
>      case SIMPLE:                                 <--- NEW
>          nfsv4_file_layout_simple_device4 dev;    <--- NEW
>      case COMPLEX:                                <--- NEW
>          pnfs_deviceid4 dev_list<>;               <--- NEW
>      default:
>          void;
> };
> 
> This step adds slightly more complexity.  It replaces the device ID 
> for file layouts with a union of either a 'SIMPLE' or a 'COMPLEX' 
> device.  A

> simple device is the same as the device structure we had in the past 
> ({r_netid<>, r_addr<>}).  We have added a COMPLEX device which can be 
> constructed out of multiple SIMPLE devices.  My initial feeling is 
> that the COMPLEX device is simply an array of other SIMPLE device IDs.
> 
> We could extend this such that COMPLEX devices may be made up of other

> COMPLEX devices.  In this case there MUST be no loops, each device 
> must resolve down to a simple device.  My feeling is that we won't 
> have complicated structures of devices, just COMPLEX devices made up 
> of simple devices.
> 
> There are also two ways of striping over these COMPLEX devices.  1) 
> distributing the stripe_unit uniformly across all devices once the 
> device list has been flattened.  2) distributing the stripe_unit 
> across the top level devices and let the next level take a share of 
> the top level node's.
> 
> E.g.  With Complex devices 1 and 4 constructed as such:
> 1 -> {2,3} 4 -> {5,6}
> 
> Device tree:
>    1       4
>   / \     / \
> 2   3   5   6  <- leaf devices (actual data servers)
> 
> If stripe unit is 256KB, then in (1) 2, 3, 5, 6 uniformly striped with

> 256KB on each node.  In (2) 2,3 each have 128KB (half of 1's 256KB), 
> 5,6

> also each have 128KB (half of 4's 256KB).
> 
> I think my preference is for the simplicity of (1). The main idea is 
> that once the client has this set of devices, it flattens it and uses 
> it

> as the device list over which to stripe data.
> 
> 3) One more optimization -- offsets into COMPLEX devices...
> ------------------------
> 
> It may also be useful to shift the layouts of files slightly for each 
> new file.  For instance using the above example of: 0 1 3 1 4 2 0 3 2
4.

>   The second file created may use the same layout, shifted 1, or 
> starting at offset 1, wrapping offset 0 to the end: 1 3 1 4 2 0 3 2 4
0.
> Today, these two files would have two totally different layouts -- 
> although there really is a lot of sharing between them.
> 
> In order to achieve the greatest degree of sharing it is necessary to 
> encode the starting offset within the table.  The offset can either be

> given in the layout itself as an offset within a complex device (in 
> the 'nfsv4_file_layout' structure), or it can be given within the 
> COMPLEX device structure; this allows complex devices to be defined in

> terms of another (e.g., device B is device A with a shift of X).
> 
> My preference is the former: adding an offset to the
'nfsv4_file_layout' 

> structure indicating the starting offset within the complex device. 
> This works well if we either a) limit ourselves to complex devices 
> comprised of simple devices, b) flatten complex devices completely 
> before use.
> 
> 
> Lastly: Device Equivalence
> --------------------------
> 
> COMPLEX devices make it harder to think about device equivalence for 
> multi-pathing.  Device equivalence is currently indicated by an array 
> of

> device ids ('dev_id<>') within the 'nfsv4_file_layout' structure.
> 
> Since COMPLEX devices can be constructed from multiple devices, it may

> not make sense for device equivalence to be conveyed through the
layout.
> 
> My suggestion is to push it down to the device structure
> ('nfsv4_file_layout_device4') by changing the SIMPLE device type to be

> an array of equivalent SIMPLE devices.
> 
> So, the following changes:
> 
> struct nfsv4_file_layout {
>          pnfs_deviceid4          dev_id;  <--- Changed
>          nfs_fh4                 fh;
> };
> 
> union nfsv4_file_layout_device4 switch (file_layout_device_type) {
>      case SIMPLE:
>          nfsv4_file_layout_simple_device4 dev<>;  <--- Changed
>      case COMPLEX:
>          pnfs_deviceid4 dev_list<>;
>      default:
>          void;
> };
> 
> 
> Conclusion
> ----------
> 
> For our implementation these optimizations can result in a huge space 
> savings at a small increase in complexity.  It allows the client to 
> share layouts and device striping patterns much more efficiently.  It 
> also allows for complex stripe patterns.
> 
> Please read this carefully and provide constructive feedback.
> 
> PS. Thanks to Dave Noveck for helping with this design.
> 
> -Garth
> 
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www1.ietf.org/mailman/listinfo/nfsv4

_______________________________________________
nfsv4 mailing list
nfsv4@ietf.org
https://www1.ietf.org/mailman/listinfo/nfsv4