RE: [nfsv4] pnfs: efficient file-layout striping proposal

Marc Eshel <eshel@almaden.ibm.com> Fri, 07 July 2006 18:16 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1Fyusn-0006Mx-AD; Fri, 07 Jul 2006 14:16:53 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1Fyusl-0006Ms-Ii for nfsv4@ietf.org; Fri, 07 Jul 2006 14:16:51 -0400
Received: from e4.ny.us.ibm.com ([32.97.182.144]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1Fyusj-0004Wd-Ue for nfsv4@ietf.org; Fri, 07 Jul 2006 14:16:51 -0400
Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e4.ny.us.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id k67IGnwk019589 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL) for <nfsv4@ietf.org>; Fri, 7 Jul 2006 14:16:49 -0400
Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay04.pok.ibm.com (8.13.6/NCO/VER7.0) with ESMTP id k67IGnnX281818 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for <nfsv4@ietf.org>; Fri, 7 Jul 2006 14:16:49 -0400
Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id k67IGmLW013994 for <nfsv4@ietf.org>; Fri, 7 Jul 2006 14:16:48 -0400
Received: from d01ml604.pok.ibm.com (d01ml604.pok.ibm.com [9.56.227.90]) by d01av03.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id k67IGmgR013987; Fri, 7 Jul 2006 14:16:48 -0400
In-Reply-To: <C98692FD98048C41885E0B0FACD9DFB8028CE8CF@exnane01.hq.netapp.com>
To: "Noveck, Dave" <Dave.Noveck@netapp.com>
Subject: RE: [nfsv4] pnfs: efficient file-layout striping proposal
MIME-Version: 1.0
X-Mailer: Lotus Notes Release 7.0 HF85 November 04, 2005
Message-ID: <OF7204E1BF.529FA258-ON882571A4.00571D13-882571A4.00646C15@us.ibm.com>
From: Marc Eshel <eshel@almaden.ibm.com>
Date: Fri, 07 Jul 2006 11:16:45 -0700
X-MIMETrack: Serialize by Router on D01ML604/01/M/IBM(Release 7.0.1HF269 | June 22, 2006) at 07/07/2006 14:16:48, Serialize complete at 07/07/2006 14:16:48
Content-Type: text/plain; charset="US-ASCII"
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 8eeb555810cda1f2c5989480370dc4ca
Cc: "Goodson, Garth" <Garth.Goodson@netapp.com>, nfsv4@ietf.org
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
Errors-To: nfsv4-bounces@ietf.org

"Noveck, Dave" <Dave.Noveck@netapp.com> wrote on 07/07/2006 03:33:43 AM:

> Marc Eshel wrote:
> > I would vote against the last option to have the client share parts
> > of the layouts among different files, it sounds to complicated for 
> > the client to manage and it is a big change to introduce at this
> point. 
> 
> I want to clarify something here.  As I understand it, this proposal
> does
> not have any client-managed sharing, which would indeed be complicated.
> The proposal here is that the server defines any sharing (typically
> among layouts for all files on a given fs) and presents the layouts
> to the client with the common/shared portions factored out and made
> referenceable as shared quasi-devices.  I don't see any additional
> management difficulty for the client here.  He gets the layout
> information (now smaller than it was), fetches the device info 
> for devices mentioned in the layout (now bigger than it was), if 
> not already fetched before for a different file and proceeds 
> just as he did.  The thing is that if he accesses more than one
> file, the total data that he had to fetch is considerably smaller,
> given the sharing.
 
Where does the client hang the shared parts of the layout? How long should 
the client keep them around? How does the server know that the client 
still holds the shared part? These are few issues that will make the 
client management of layouts more complicated. 

> > I am not sure that the second option is saving much more so I 
> > would vote for skipping the additional complexity unless you 
> > have some realistic examples that show the actual space saving.
> 
> Let me give you a real example from a real file system, that it
> just so happens that I know about, what our marketing has dubbed 
> the ONTAP GX High-Performance Option.  My expectation, though,
> is that this issue will be seen by other file systems that need
> to stripe data, and the intention is to be able to solve the issue
> for all such file systems.
> 
> When you stripe data according to a fixed rotation among devices,
> you are subject to problems when applications just happen to
> access data repetitively through the file and the interval happens
> to match the stripe size times the number of stripes.  There goes
> the benefits of striping and the application might take ten times
> as long as the customer expected.
> 
> One answer is to give people control over the stripe size and 
> the number of stripes and then when the application runs much 
> more slowly than expected, you can say, "We gave you the rope 
> and you hung yourself.  Gosh, that's too bad".  Often, however, 
> customers are reluctant to see things in that light, and so 
> there is a need for a different solution.
> 
> One answer here is to add a degree of quasi-randomness to the 
> rotation among devices.  Then any particular pattern of data
> access that an application will exhibit, may hit the same device
> for a bit but won't do so for long. 
> 
> So if you do that and want to pnfs-ize access, then either the
> client and the server have to agree on the quasi-random algorithm, 
> which would complicate pnfs unduly (since different file systems 
> want slightly different algorithms), or the server sends a table
> that reflects the algorithm, which is obviously better from
> a specification/interoperability point of view.

I agree that file system use some algorithm for the allocation, but at 
least in our case it is not guaranteed that it will be followed. It is 
possible that one device will run out of space and you will need to 
redirect allocations to another device, or a file is re-striped for better 
space management. My point is that the planed pattern was broken and you 
need to check the actual mapping to see where each segment really resides. 
So I don't know about other file systems but ours will not be able to use 
this option.
 
> The problem that Garth is addressing here is that that table
> can, and normally will be, big.  If it is small, then you 
> basically have a fixed stripe rotation again, albeit one
> that is a little bit larger than the number of devices.  So
> file systems will want this to be rather large to avoid the
> possibility of applications hitting particular strides that
> wind up on the same devices.  And in fact, in this particular
> file system, the table for each file would be four thousand
> entries, which depending on the details of how we decide to
> do the encoding is between 16K and maybe 32K or so of layout
> data, for each file.  That's what leads to the desire to share 
> among files.
> 
> It is true that you could shorten the table for small files but
> that adds lots of complexity since files can be extended.  The
> file variant works best when all clients have the layout 
> information readily at hand, which makes partial layouts 
> very undesirable.
 
We can have different layout types to keep the real simple simpler.

> So the saving grace here, in this file system, and I expect in
> others, is that because the same algorithm generated the table
> for each file, the tables will be essentially the same for 
> different files, making the benefits of sharing immense.  In
> the case of this file system, the only difference between the
> device striping patterns is that each file starts in a different
> place (because you don't want everybody's stripe zero to be in
> the same place) in the same table and I would imagine that would
> be a common pattern.  So now you have a single-quasi random 
> pattern which is many K data, which you can ship over in each
> layout or ship that over once and have individual layouts with 
> very small numbers of bytes referring to this large shared 
> entities, in Garth's proposal in the form of a quasi-device. 
> What is left in the layout is so small compared to what you save 
> that the saving is basically some number like 16K or 32K
> times the number of different files the client is referencing 
> minus one, which can get to be a pretty big number.

Maybe if we have this pattern per file system, it will be simple enough to 
implement, and it might be useful for creating new files, where the 
pattern is a recommendation (in our case). I still think that this is a 
big changes to take at this point.
Marc. 
 
 
> -----Original Message-----
> From: Marc Eshel [mailto:eshel@almaden.ibm.com] 
> Sent: Friday, July 07, 2006 1:12 AM
> To: Goodson, Garth
> Cc: Noveck, Dave; nfsv4@ietf.org
> Subject: Re: [nfsv4] pnfs: efficient file-layout striping proposal
> 
> Garth,
> There is no question that number one (Simplest) is a big space saver. 
> Repeating the FH for each is segment in a one to one mapping is silly.
> So I definitely vote for the first option. I am not sure that the second
> option is saving much more so I would vote for skipping the additional
> complexity unless you have some realistic examples that show the actual
> space saving. I would vote against the last option to have the client
> share parts of the layouts among different files, it sounds to
> complicated for the client to manage and it is a big change to introduce
> at this point.
> Marc.
> 
> Garth Goodson <Garth.Goodson@netapp.com> wrote on 07/06/2006 06:12:38
> PM:
> 
> > Our striped file-system implementation uses a complex algorithm to 
> > generate a large table (1000s of entries) where each entry contains a 
> > data server ID.  The order of these entries dictate the striping 
> > order; data server IDs can be repeated multiple times within this 
> > table and the
> 
> > order may be pseudo random.  For example:
> > 
> > 5 Data servers: 0, 1, 2, 3, 4
> > 
> > Stripe table example: 0 1 3 1 4 2 0 3 2 4 ...
> > 
> > To efficiently represent large stripe rules like this the client will 
> > use a huge amount of memory to store these layouts given the current 
> > representation; currently each table entry holds the <device_id, 
> > equivalent device_ids; filehandle>.
> > 
> > There are a number of optimizations that can be made to increase the 
> > amount of the layout that can be shared between files.  I'll describe 
> > two such optimizations, going from simplest to more aggressive. 
> > Lastly,
> 
> > I'll describe another optimization that can be made putting these
> together.
> > 
> > 1) Simplest:
> > ------------
> > 
> > struct nfsv4_file_layout {
> >          pnfs_deviceid4          dev_id<>;
> >          nfs_fh4                 fh;
> > };
> > 
> > struct nfsv4_file_layouttype4 {                /* Per file */
> >      stripetype4             stripe_type;
> >      bool                    commit_through_mds;
> >      length4                 stripe_unit;
> >      length4                 file_size;
> >      uint32_t                stripe_devs<>;    <---- NEW
> >      nfsv4_file_layout       dev_list<>;
> > };
> > 
> > This modification changes the nfsv4_file_layouttype4 structure by 
> > introducing a new array.  This array ('stripe_devs') is simply an 
> > array of indices into the 'dev_list' array.
> > 
> >  From the example above, the stripe_devs array holds: 0 1 3 1 4 2 0 3 
> > 2
> > 4 ...  Each entry in the stripe_devs array indexes into the dev_list 
> > array.  The dev_list array is of length 5 and holds the device IDs and
> 
> > Filehandles for each of the 5 devices.  As one can see, depending on 
> > the
> 
> > length of stripe_devs, there can be a significant space reduction by 
> > using this structure; Filehandles aren't repeated multiple times.
> > 
> > If the size of stripe_devs is zero, the layout reverts to its previous
> 
> > layout behavior (striping across the entries listed in dev_list).
> > 
> > 2) More agressive space optimization:
> > -------------------------------------
> > 
> > struct nfsv4_file_layout_simple_device4 {
> >      string  r_netid<>; /* network ID */
> >      string  r_addr<>;  /* universal address */ };
> > 
> > union nfsv4_file_layout_device4 switch (file_layout_device_type) {
> >      case SIMPLE:                                 <--- NEW
> >          nfsv4_file_layout_simple_device4 dev;    <--- NEW
> >      case COMPLEX:                                <--- NEW
> >          pnfs_deviceid4 dev_list<>;               <--- NEW
> >      default:
> >          void;
> > };
> > 
> > This step adds slightly more complexity.  It replaces the device ID 
> > for file layouts with a union of either a 'SIMPLE' or a 'COMPLEX' 
> > device.  A
> 
> > simple device is the same as the device structure we had in the past 
> > ({r_netid<>, r_addr<>}).  We have added a COMPLEX device which can be 
> > constructed out of multiple SIMPLE devices.  My initial feeling is 
> > that the COMPLEX device is simply an array of other SIMPLE device IDs.
> > 
> > We could extend this such that COMPLEX devices may be made up of other
> 
> > COMPLEX devices.  In this case there MUST be no loops, each device 
> > must resolve down to a simple device.  My feeling is that we won't 
> > have complicated structures of devices, just COMPLEX devices made up 
> > of simple devices.
> > 
> > There are also two ways of striping over these COMPLEX devices.  1) 
> > distributing the stripe_unit uniformly across all devices once the 
> > device list has been flattened.  2) distributing the stripe_unit 
> > across the top level devices and let the next level take a share of 
> > the top level node's.
> > 
> > E.g.  With Complex devices 1 and 4 constructed as such:
> > 1 -> {2,3} 4 -> {5,6}
> > 
> > Device tree:
> >    1       4
> >   / \     / \
> > 2   3   5   6  <- leaf devices (actual data servers)
> > 
> > If stripe unit is 256KB, then in (1) 2, 3, 5, 6 uniformly striped with
> 
> > 256KB on each node.  In (2) 2,3 each have 128KB (half of 1's 256KB), 
> > 5,6
> 
> > also each have 128KB (half of 4's 256KB).
> > 
> > I think my preference is for the simplicity of (1). The main idea is 
> > that once the client has this set of devices, it flattens it and uses 
> > it
> 
> > as the device list over which to stripe data.
> > 
> > 3) One more optimization -- offsets into COMPLEX devices...
> > ------------------------
> > 
> > It may also be useful to shift the layouts of files slightly for each 
> > new file.  For instance using the above example of: 0 1 3 1 4 2 0 3 2
> 4.
> 
> >   The second file created may use the same layout, shifted 1, or 
> > starting at offset 1, wrapping offset 0 to the end: 1 3 1 4 2 0 3 2 4
> 0.
> > Today, these two files would have two totally different layouts -- 
> > although there really is a lot of sharing between them.
> > 
> > In order to achieve the greatest degree of sharing it is necessary to 
> > encode the starting offset within the table.  The offset can either be
> 
> > given in the layout itself as an offset within a complex device (in 
> > the 'nfsv4_file_layout' structure), or it can be given within the 
> > COMPLEX device structure; this allows complex devices to be defined in
> 
> > terms of another (e.g., device B is device A with a shift of X).
> > 
> > My preference is the former: adding an offset to the
> 'nfsv4_file_layout' 
> 
> > structure indicating the starting offset within the complex device. 
> > This works well if we either a) limit ourselves to complex devices 
> > comprised of simple devices, b) flatten complex devices completely 
> > before use.
> > 
> > 
> > Lastly: Device Equivalence
> > --------------------------
> > 
> > COMPLEX devices make it harder to think about device equivalence for 
> > multi-pathing.  Device equivalence is currently indicated by an array 
> > of
> 
> > device ids ('dev_id<>') within the 'nfsv4_file_layout' structure.
> > 
> > Since COMPLEX devices can be constructed from multiple devices, it may
> 
> > not make sense for device equivalence to be conveyed through the
> layout.
> > 
> > My suggestion is to push it down to the device structure
> > ('nfsv4_file_layout_device4') by changing the SIMPLE device type to be
> 
> > an array of equivalent SIMPLE devices.
> > 
> > So, the following changes:
> > 
> > struct nfsv4_file_layout {
> >          pnfs_deviceid4          dev_id;  <--- Changed
> >          nfs_fh4                 fh;
> > };
> > 
> > union nfsv4_file_layout_device4 switch (file_layout_device_type) {
> >      case SIMPLE:
> >          nfsv4_file_layout_simple_device4 dev<>;  <--- Changed
> >      case COMPLEX:
> >          pnfs_deviceid4 dev_list<>;
> >      default:
> >          void;
> > };
> > 
> > 
> > Conclusion
> > ----------
> > 
> > For our implementation these optimizations can result in a huge space 
> > savings at a small increase in complexity.  It allows the client to 
> > share layouts and device striping patterns much more efficiently.  It 
> > also allows for complex stripe patterns.
> > 
> > Please read this carefully and provide constructive feedback.
> > 
> > PS. Thanks to Dave Noveck for helping with this design.
> > 
> > -Garth
> > 
> > _______________________________________________
> > nfsv4 mailing list
> > nfsv4@ietf.org
> > https://www1.ietf.org/mailman/listinfo/nfsv4
> 
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www1.ietf.org/mailman/listinfo/nfsv4


_______________________________________________
nfsv4 mailing list
nfsv4@ietf.org
https://www1.ietf.org/mailman/listinfo/nfsv4