[nfsv4] issue 69: file-layout file striping proposal

I'm attaching the proposal for a more efficient/flexible scheme for 
file-layout file striping.  It is based on my previous posts on this topic.

Most of section 17.1 has been changed, one line of 17.1.3 has been 
changed.  Feedback and clarity recommendations for the text are most 
welcome.

-Garth

17.  The NFSv4 File Layout Type

    This section describes the semantics and format of NFSv4 file-based
    layouts.

17.1.  File Striping and Data Access

    The file layout type describes a method for striping data across
    multiple devices.  The data for each stripe unit is stored within an
    NFSv4 file located on a particular storage device.

    Before discussing the file layout, it is necessary to describe the
    file layout device type; the structures are as follows:

     struct nfsv4_file_layout_simple_device4 {
		   string  r_netid<>; /* network ID */
		   string  r_addr<>;  /* universal address */
     };

     union nfsv4_file_layout_device4 switch (file_layout_device_type) {
		case SIMPLE:
			nfsv4_file_layout_simple_device4 dev_list<>;
		case COMPLEX:
			pnfs_deviceid4 dev_list<>;
		default:
			void;
	};

	The "nfsv4_file_layout_device4" structure is a union composed of
         a SIMPLE or a COMPLEX device type.  A Simple device is composed
         of an array of nfsv4_file_layout_simple_device4 structures.  All
	devices identified by a Simple device must be 'equivalent' and
         are used for device multipathing; see Section 17.1.3 -- Device
	Multipathing for more details on equivalent devices.  Simple
	devices always refer to actual physical devices.  On the
	otherhand, a Complex device is a virtual device that is
	constructed of multiple Simple devices.  Each device within the
	Complex device list is identified by its device ID.  A Complex
	device MUST NOT reference other Complex devices; only Simple
	devices are to be referenced.  This enables multiple physical
	devices to be identified through a single device ID and provides
         a space efficient mechanism by which to identify multiple
         devices within a layout.  Complex devices can be thought of as a
         table of devices.  Complex and Simple devices share the same
         device ID space and should be cached similarly by the client.

    The structures used to describe the stripe layout are as follows:

     enum stripetype4 {
            STRIPE_SPARSE = 1,
            STRIPE_DENSE = 2
     };

     struct nfsv4_file_layouthint {
            stripetype4             stripe_type;
            length4                 stripe_unit;
            uint32_t                stripe_width;
     };

     struct nfsv4_file_layout {                   /* Per data stripe */
            pnfs_deviceid4          dev_id;
		   uint32_t				   dev_index;	
            nfs_fh4                 fh;
     };

     struct nfsv4_file_layouttype4 {              /* Per file */
            stripetype4             stripe_type;
            bool                    commit_through_mds;
            length4                 stripe_unit;
            length4                 file_size;
            uint32_t                stripe_devs<>;
            nfsv4_file_layout       dev_list<>;
     };

    At a high level, the file layout specifies an ordered array of
    <deviceID, filehandle> tuples, as well as the stripe size, type of
    stripe layout (discussed later), and the file's current size as of
    LAYOUTGET time.

    The "dev_list" array within the nfsv4_file_layouttype4 contains a
    list of nfsv4_file_layout structures ("dev_list").  Each of these
    structures describes one or more physical devices that contribute
    to a stripe of the file.  The "stripe_devs" array contains a list
    of indices into the "dev_list" array; an index of zero specifies
    the first "dev_list" entry.  Each successive index selects a
    "dev_list" entry whose file handle and device id are to be used
    next in sequence for that stripe.  This allows an arbitrary
    sequencing through the possible devices to be encoded compactly.
    When the "stripe_devs" array is of zero length, the elements of the
    "dev_list" array are simply used in order, so that the portion of
    the stripe held by the corresponing entry is determined by its
    position within the device list.

    Each "dev_list" entry (the nfsv4_file_layout structure) contains a
    filehandle, device ID, and device index.  The filehandle, "fh",
    identifies the file on the storage device identified".  The device
    ID ("dev_id") may refer to either a Simple or a Complex device; see
    the description of the nfsv4_file_layout_device4 for details.  When
    a Complex device is referenced by the "dev_id", the "dev_index"
    field specifies the starting index within the device's "dev_list".
    If the "dev_id" references a Simple device, the "dev_index" has no
    meaning and should be zero.  The "dev_index" plays a critical role
    in the flattening of a Complex device.

    The client is expected to construct a 'flat' list of devices over
    which the file is striped.  A flat device list can be constructed
    by concatenating each device encountered while traversing
    "stripe_devs" (or "dev_list" in the case of a zero sized
    "stripe_devs" array), while expanding out each Complex device.  The
    flat device list must contain only Simple devices.  The client must
    expand the Complex device's device list by starting at the device
    indexed by "dev_index", ending with the device prior to
    "dev_index".  All devices in the device list must be consumed; this
    may require wrapping around the end of the array if "dev_index" is
    non-zero.  The stripe width is determined by the stripe unit size
    multiplied by the number of device entries within the flattened
    stripe.

    Consider the following example:

    Given a set of devices as follows:

    1->{simple}; 2->{complex, dev_list=<3,4>}; 3->{simple}; 4->{simple}

    Device ID 1,3,4 and 5 are Simple devices.  Device ID 2 is a Complex
    device constructed of Simple devices 3 and 4.

    Within the nfsv4_file_layouttype4, imagine a "dev_list" constructed
    of <device ID, device index, FH> tuples:

    dev_list = [<1, 0, 0x12>, <2, 0, 0x13>, <3, 0, 0x14>, <4, 0, 0x15>]

    And a "stripe_devs" array containing the following indices:

    stripe_devs = [2, 3, 0, 1]

    Using the stripe_devs as indices into the dev_list, we get the
    following ordered list of nfsv4_file_layouts:

    [<3, 0, 0x14>, <4, 0, 0x15>, <1, 0, 0x12>, <2, 0, 0x13>]

    Continuing to flatten the Complex devices gives us the following
    list of 5 simple <device ID, FH> tuples.  Note device 2 is a
    Complex device that gets replaced with devices 3 and 4:

    [<3, 0x14>, <4, 0x15>, <1, 0x12>, <3, 0x13>, <4, 0x13>]

    The flattened device list specifies the order over which the
    devices must be striped.  It also specifies the filehandle to be
    used for each stripe unit.  Data must be written in increments of
    the stripe unit size.  Devices may be repeated multiple times
    within the flattened device list.  However, if a dense stripe type
    is used (described later), the same filehandle MUST NOT be used on
    the same device for different stripe units of the same file.

    A data file stored on a storage device MUST map to a single file as
    defined by the metadata server; i.e., data from two files as viewed
    by the metadata server MUST NOT be stored within the same data file
    on any storage device.

    The "stripe_type" field specifies how the data is laid out within the
    data file on a storage device.  It allows for two different data
    layouts: sparse and dense or packed.  The stripe type determines the
    calculation that must be made to map the client visible file offset
    to the offset within the data file located on the storage device.

    The layout hint structure is described in more detail in
    Section 4.15.  It is used, by the client, as by the FILE_LAYOUT_HINT
    attribute to specify the type of layout to be used for a newly
    created file.

17.1.3.  Device Multipathing

    The NFSv4 file layout supports multipathing to 'equivalent' devices.
    Device-level multipathing is primarily of use in the case of a data
    server failure --- it allows the client to switch to another storage
    device that is exporting the same data stripe, without having to
    contact the metadata server for a new layout.

    To support device multipathing, an array of device IDs is encoded
+  within the SIMPLE case of the nfsv4_file_layout_device4 union.  This
    array represents an ordered list of devices where the first element
    has the highest priority.  Each device in the list MUST be
    'equivalent' to every other device in the list and each device must
    be attempted in the order specified.

    Equivalent devices MUST export the same system image (e.g., the
    stateids and filehandles that they use are the same) and must provide
    the same consistency guarantees.  Two equivalent storage devices must
    also have sufficient connections to the storage, such that writing to
    one storage device is equivalent to writing to another, this also
    applies to reading.  Also, if multiple copies of the same data exist,
    reading from one must provide access to all existing copies.  As
    such, it is unlikely that multipathing will provide additional
    benefit in the case of an I/O error.

_______________________________________________
nfsv4 mailing list
nfsv4@ietf.org
https://www1.ietf.org/mailman/listinfo/nfsv4