[nfsv4] issue 69: file-layout file striping proposal
Garth Goodson <Garth.Goodson@netapp.com> Fri, 28 July 2006 00:47 UTC
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1G6GWF-0001UW-El; Thu, 27 Jul 2006 20:47:59 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1G6GWE-0001UO-7a for nfsv4@ietf.org; Thu, 27 Jul 2006 20:47:58 -0400
Received: from mx2.netapp.com ([216.240.18.37]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1G6GWD-0001fj-Lx for nfsv4@ietf.org; Thu, 27 Jul 2006 20:47:58 -0400
Received: from smtp2.corp.netapp.com ([10.57.159.114]) by mx2.netapp.com with ESMTP; 27 Jul 2006 17:47:57 -0700
X-IronPort-AV: i="4.07,190,1151910000"; d="scan'208"; a="396031858:sNHT20034528"
Received: from [10.34.24.132] (loderunner.hq.netapp.com [10.34.24.132]) by smtp2.corp.netapp.com (8.13.1/8.13.1/NTAP-1.6) with ESMTP id k6S0luFm014142; Thu, 27 Jul 2006 17:47:56 -0700 (PDT)
Message-ID: <44C95EBC.6020602@netapp.com>
Date: Thu, 27 Jul 2006 17:47:56 -0700
From: Garth Goodson <Garth.Goodson@netapp.com>
User-Agent: Mail/News 1.5 (X11/20060228)
MIME-Version: 1.0
To: nfsv4@ietf.org, Dave Noveck <dnoveck@netapp.com>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 612a16ba5c5f570bfc42b3ac5606ac53
Cc:
Subject: [nfsv4] issue 69: file-layout file striping proposal
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
Errors-To: nfsv4-bounces@ietf.org
I'm attaching the proposal for a more efficient/flexible scheme for file-layout file striping. It is based on my previous posts on this topic. Most of section 17.1 has been changed, one line of 17.1.3 has been changed. Feedback and clarity recommendations for the text are most welcome. -Garth 17. The NFSv4 File Layout Type This section describes the semantics and format of NFSv4 file-based layouts. 17.1. File Striping and Data Access The file layout type describes a method for striping data across multiple devices. The data for each stripe unit is stored within an NFSv4 file located on a particular storage device. Before discussing the file layout, it is necessary to describe the file layout device type; the structures are as follows: struct nfsv4_file_layout_simple_device4 { string r_netid<>; /* network ID */ string r_addr<>; /* universal address */ }; union nfsv4_file_layout_device4 switch (file_layout_device_type) { case SIMPLE: nfsv4_file_layout_simple_device4 dev_list<>; case COMPLEX: pnfs_deviceid4 dev_list<>; default: void; }; The "nfsv4_file_layout_device4" structure is a union composed of a SIMPLE or a COMPLEX device type. A Simple device is composed of an array of nfsv4_file_layout_simple_device4 structures. All devices identified by a Simple device must be 'equivalent' and are used for device multipathing; see Section 17.1.3 -- Device Multipathing for more details on equivalent devices. Simple devices always refer to actual physical devices. On the otherhand, a Complex device is a virtual device that is constructed of multiple Simple devices. Each device within the Complex device list is identified by its device ID. A Complex device MUST NOT reference other Complex devices; only Simple devices are to be referenced. This enables multiple physical devices to be identified through a single device ID and provides a space efficient mechanism by which to identify multiple devices within a layout. Complex devices can be thought of as a table of devices. Complex and Simple devices share the same device ID space and should be cached similarly by the client. The structures used to describe the stripe layout are as follows: enum stripetype4 { STRIPE_SPARSE = 1, STRIPE_DENSE = 2 }; struct nfsv4_file_layouthint { stripetype4 stripe_type; length4 stripe_unit; uint32_t stripe_width; }; struct nfsv4_file_layout { /* Per data stripe */ pnfs_deviceid4 dev_id; uint32_t dev_index; nfs_fh4 fh; }; struct nfsv4_file_layouttype4 { /* Per file */ stripetype4 stripe_type; bool commit_through_mds; length4 stripe_unit; length4 file_size; uint32_t stripe_devs<>; nfsv4_file_layout dev_list<>; }; At a high level, the file layout specifies an ordered array of <deviceID, filehandle> tuples, as well as the stripe size, type of stripe layout (discussed later), and the file's current size as of LAYOUTGET time. The "dev_list" array within the nfsv4_file_layouttype4 contains a list of nfsv4_file_layout structures ("dev_list"). Each of these structures describes one or more physical devices that contribute to a stripe of the file. The "stripe_devs" array contains a list of indices into the "dev_list" array; an index of zero specifies the first "dev_list" entry. Each successive index selects a "dev_list" entry whose file handle and device id are to be used next in sequence for that stripe. This allows an arbitrary sequencing through the possible devices to be encoded compactly. When the "stripe_devs" array is of zero length, the elements of the "dev_list" array are simply used in order, so that the portion of the stripe held by the corresponing entry is determined by its position within the device list. Each "dev_list" entry (the nfsv4_file_layout structure) contains a filehandle, device ID, and device index. The filehandle, "fh", identifies the file on the storage device identified". The device ID ("dev_id") may refer to either a Simple or a Complex device; see the description of the nfsv4_file_layout_device4 for details. When a Complex device is referenced by the "dev_id", the "dev_index" field specifies the starting index within the device's "dev_list". If the "dev_id" references a Simple device, the "dev_index" has no meaning and should be zero. The "dev_index" plays a critical role in the flattening of a Complex device. The client is expected to construct a 'flat' list of devices over which the file is striped. A flat device list can be constructed by concatenating each device encountered while traversing "stripe_devs" (or "dev_list" in the case of a zero sized "stripe_devs" array), while expanding out each Complex device. The flat device list must contain only Simple devices. The client must expand the Complex device's device list by starting at the device indexed by "dev_index", ending with the device prior to "dev_index". All devices in the device list must be consumed; this may require wrapping around the end of the array if "dev_index" is non-zero. The stripe width is determined by the stripe unit size multiplied by the number of device entries within the flattened stripe. Consider the following example: Given a set of devices as follows: 1->{simple}; 2->{complex, dev_list=<3,4>}; 3->{simple}; 4->{simple} Device ID 1,3,4 and 5 are Simple devices. Device ID 2 is a Complex device constructed of Simple devices 3 and 4. Within the nfsv4_file_layouttype4, imagine a "dev_list" constructed of <device ID, device index, FH> tuples: dev_list = [<1, 0, 0x12>, <2, 0, 0x13>, <3, 0, 0x14>, <4, 0, 0x15>] And a "stripe_devs" array containing the following indices: stripe_devs = [2, 3, 0, 1] Using the stripe_devs as indices into the dev_list, we get the following ordered list of nfsv4_file_layouts: [<3, 0, 0x14>, <4, 0, 0x15>, <1, 0, 0x12>, <2, 0, 0x13>] Continuing to flatten the Complex devices gives us the following list of 5 simple <device ID, FH> tuples. Note device 2 is a Complex device that gets replaced with devices 3 and 4: [<3, 0x14>, <4, 0x15>, <1, 0x12>, <3, 0x13>, <4, 0x13>] The flattened device list specifies the order over which the devices must be striped. It also specifies the filehandle to be used for each stripe unit. Data must be written in increments of the stripe unit size. Devices may be repeated multiple times within the flattened device list. However, if a dense stripe type is used (described later), the same filehandle MUST NOT be used on the same device for different stripe units of the same file. A data file stored on a storage device MUST map to a single file as defined by the metadata server; i.e., data from two files as viewed by the metadata server MUST NOT be stored within the same data file on any storage device. The "stripe_type" field specifies how the data is laid out within the data file on a storage device. It allows for two different data layouts: sparse and dense or packed. The stripe type determines the calculation that must be made to map the client visible file offset to the offset within the data file located on the storage device. The layout hint structure is described in more detail in Section 4.15. It is used, by the client, as by the FILE_LAYOUT_HINT attribute to specify the type of layout to be used for a newly created file. 17.1.3. Device Multipathing The NFSv4 file layout supports multipathing to 'equivalent' devices. Device-level multipathing is primarily of use in the case of a data server failure --- it allows the client to switch to another storage device that is exporting the same data stripe, without having to contact the metadata server for a new layout. To support device multipathing, an array of device IDs is encoded + within the SIMPLE case of the nfsv4_file_layout_device4 union. This array represents an ordered list of devices where the first element has the highest priority. Each device in the list MUST be 'equivalent' to every other device in the list and each device must be attempted in the order specified. Equivalent devices MUST export the same system image (e.g., the stateids and filehandles that they use are the same) and must provide the same consistency guarantees. Two equivalent storage devices must also have sufficient connections to the storage, such that writing to one storage device is equivalent to writing to another, this also applies to reading. Also, if multiple copies of the same data exist, reading from one must provide access to all existing copies. As such, it is unlikely that multipathing will provide additional benefit in the case of an I/O error. _______________________________________________ nfsv4 mailing list nfsv4@ietf.org https://www1.ietf.org/mailman/listinfo/nfsv4
- [nfsv4] issue 69: file-layout file striping propo… Garth Goodson