Re: [nfsv4] Review of draft-ietf-nfsv4-flex-files-08 (part two of three)

Thomas Haynes <loghyr@primarydata.com> Mon, 17 July 2017 21:08 UTC

From: Thomas Haynes <loghyr@primarydata.com>
To: Dave Noveck <davenoveck@gmail.com>
CC: Thomas Haynes <loghyr@primarydata.com>, Benny Halevy <bhalevy@gmail.com>, "nfsv4@ietf.org" <nfsv4@ietf.org>
Thread-Topic: Review of draft-ietf-nfsv4-flex-files-08 (part two of three)
Thread-Index: AQHRs2kmj/twGT5A/US/exgHa/VaS6JbGrUA
Date: Mon, 17 Jul 2017 21:08:00 +0000
Message-ID: <D9A4169C-8E61-4F84-AF42-B9D9C76596F8@primarydata.com>
References: <CADaq8jePBxsJxBwV-KkPdNjGJdBGwDsgxesayOuOF6k=O3u9Gw@mail.gmail.com>
In-Reply-To: <CADaq8jePBxsJxBwV-KkPdNjGJdBGwDsgxesayOuOF6k=O3u9Gw@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-originalarrivaltime: 17 Jul 2017 21:08:00.6522 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 03193ed6-8726-4bb3-a832-18ab0d28adb7
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY2PR1101MB1094
X-MC-Unique: eeldSYLZNpCyS-OD1u2njA-1
Content-Type: multipart/alternative; boundary="_000_D9A4169C8E614F84AF42B9D9C76596F8primarydatacom_"
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/i4CIa8D1EkEcl1ZvofNJulUr4YM>
Subject: Re: [nfsv4] Review of draft-ietf-nfsv4-flex-files-08 (part two of three)
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 17 Jul 2017 21:08:14 -0000

On May 21, 2016, at 4:00 PM, David Noveck <davenoveck@gmail.com<mailto:davenoveck@gmail.com>> wrote:

Review Structure

This email is the second part of a three-part review.

Note that the overall comments are contained in the first part of this review. These contain:

* Background of Review
* General Evaluation
*
Issues Blocking Working Group Last Call
*
Other Noteworthy Issues

Per-section Comments (From Section 2.3 through Section 5.1.1)

2.3. State and Locking Models:

This section consists of two parts:

* The first part describes a locking model which I presume is the locking model that applies in the loose coupling case.
* The second part, the last two paragraphs, describes how certain features of the environment govern which locking model is to be selected.

The problem with this structure is that the second part should be at the start and you would then be in a position to describe each of the locking models. I think the better structure would be to start with what are now the two final paragraphs and then have subsections that describe the two locking models.

There are a number of editorial issues in the last two paragraphs:

* In the last sentence of the last paragraph, "described in [RFC5661]" is wrong since there is no protocol described there.
* Using "NFSv4" to mean "NFSv4.0" is a likely source of confusion.
* In many cases, mention of NFSv3 is missing.

I propose rewriting the current final two paragraphs as follows:

The choice of locking models is governed by the following rules:

* Data storage devices implementing the NFSv3 and NFSv4.0 protocols are always treated as loosely coupled.
* NFSv4.1+ storage devices that do not return the EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are indicating that they are to be treated as loosely coupled. As such, they are treated, from the locking viewpoint, in the same way as NFSv4.0 storage devices.
* NFSv4.1+ storage devices that do identify themselves with the EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are considered strongly coupled. They will be using a back-end control protocol as provided for in [RFC5661] to implement the global stateid model as defined there.
With regard to the tight coupling case, I presume that the appropriate locking model is that described in Chapter 13 of RFC5661 but think there these should be some discussion what exactly this means in practice and of how the new/different features of the mapping type interact with locking model.

Now to go back to the first paragraphs, the second sentence of the second paragraph is wrong and needs to be changed as it contradicts what is written about stateids in 5.1. ff_layout4. Based on my discussion with Tom, I am assuming that anonymous stateids will be used to do IO in the loosely coupled case.

I don’t see anyone doing a NFSv4.0 storage device.

And due to the issue with ffm_stateid, either the anonymous stateid is used or we restrict the size of ffds_fh_vers to be 1.

Once that issue is resolved, there needs to be some discussion of how the fact that all IO will be stateid-anonymous will be dealt with. I am going to be assuming the it will be in this section, rather than in 5.1. ff_layout4.

With regard to mandatory byte-range locking we need an explicit statement tht this is not (i.e. cannot be) supported with loose coupling.

Agreed
With regard to mandatory locking due to share reservations one doesn't have the option of simply not supporting the functionality. The spec will have to clearly explain how it is to be done. Some likely elements:

* In the case in which each of the clients with a particular file opened, has the same IO rights, the MDS has to ensure via layout recalls (and potential indicating layouts are unavailable) that no client which has no owner allowed to a particular form of IO has no layouts that allow that form of IO to be done. (it may already say that but it probably needs clarification).
* In the case in which a particular client has multiple owners with different levels IO rights, the spec either has to ask the pNFS client to do the enforcement itself, or it has to provide that layouts are to be unavailable to this client and require the client to perform the IO via the MDS.

Once that is addressed, we have to face the fundamental problem with this section. It has to to with the stateids that are returned to clients, rather than the ones that appear (or don't) in layouts.

From what is written there now, it is hard to determine what is actually intended. A lot of confusion results from the multiple and uncertain meanings of the preposition "against".

In the first sentence, the phrase "against the metadata server", simply indicates that the operations in question are directed to the metadata server. As this paragraph, unlike the following one, applies to both loose and tight coupling,it should stay where it is. I suggest redrafting it as follows:

Clients always perform locking-relating operations by interacting with the metadata server. These include operations related to open files (OPEN, OPEN_DOWNGRADE, and CLOSE), byte-range locking (LOCK, LOCKT, and LOCKU), delegation management (DELEGRETURN), and stateid management (TEST_STATEID and REMOVE_STATEID). Delegation recall is effected by the metadata server sending a callback to the client.

In all cases, the stateids that result from executing these operations are returned by the metadata server to the client and client uses these stateids in subsequent locking-related operations. The means by which these stateids are maintained and the handling of IO operations differ with the coupling strength in effect for the connnection.

The existing second paragraph is not clear but, for a number of reasons, I don't believe that it is a good basis for an eventual subsection describing the loose coupling locking model

* Although the introductory sentences mention OPEN, LOCK, DELEGATION, the rest of the discussion focuses on opens, leaving it very unclear how byte-range locks and delegations will/should/might be dealt with. I think this is primarily an editorial problem although there are potential interactions with choices regarding fundamental technical choices as far as NFSv4.x.
* When mirroring and/or striping is in effect, doing open "against" the data files will result in mulitple stateid's.
* In the loose-coupling case, the three NFS protocols are treated as essentially the same, despite their very real differences. This is, in part, an editorial problem, but it appears to me that once the editorial problems are addressed, one could face significant technical issues, See below for details.
At this point I can't figure out the locking models that are actually intended but, as a way of continuing the discussion, I draft some descriptions below of something that I believe is workable in the context. Although I may not be right in my guesses about how this will work, it seems to me that the items that are mentioned have to be addressed somehow to clearly describe a locking model.

Here is what I've come up with for Section 2.3.1. Loose-coupling Locking Model:

When locking-related operations are requested, they are primarily dealt with by the metadata server, who generates the appropriate stateids. When an NFSv4 version is used as the data access protocol, the metadata server may make stateid-related requests of the data storage devices. However, it is not required to do so and the resulting stateids are known only to the metadata server and the data storage device.

Given this basic structure, locking-related operations are handled as follows:

* OPENs are dealt with primarily on the metadata server. Stateids are selected by the metadata server and associated with the client id describing the client's connection to the metadata server. The metadata server may need to interact with the data storage device to locate the file to be opened, but no locking-related functionality need be used on the data storage device.

OPEN_DOWNGRADE and CLOSE only require local execution on the metadata sever.

For reasons explained below, mandatory byte-range locks are not supported when loose coupling is in effect.

* Delegations are assigned by the metadata server who initiates recalls when conflicting OPENs are processed. No data storage device involvement is required.
* TEST_STATEID and FREE_STATEID are processed locally on the metadata server, without data storage device involvement.
All IO operations to the data storage device are done using the anonymous stateid. As a result, the data storage device has no information about the openowner and lockowner responsible for issuing a particular IO operation. As a result,

* Mandatory byte-range locking cannot be supported because the data storage device has no way of distinguishing IOs done on behalf of the lock owner from those done by others.
* Enforcement of share reservations is the responsibility of the client. Even though IO is done using the anonymous stateid, the client must ensure that it has a valid stateid associated with the openowner, that allows the IO being done before issuing the IO.

In the event that a stateid is revoked, the metadata server is responsible for preventing client access, since the metadata server has no way of being sure that the client is aware that the stateid in question has been revoked.

As the client never receives a stateid generated by the data storage device, there is no client lease on the data storage device and no prospect of lease expiration, even when NFSv4 protocols are used to access the data storage device. Clients will have leases on the metadata server, which are subject to expiration. In dealing with lease expiration, the metadata server my need to use fencing to prevent revoked stateids from being relied upon by a client unaware of the fact that they have been revoked.

Adapted

Here is what I've come up with for Section 2.3.2. Tight-coupling Locking Model:

When locking-related operations are requested, they are primarily dealt with by the metadata server, who generates the appropriate stateids. These stateids must be made known to the data storage device using control protocol facilities, the details of which are not discussed in this document.

Given this basic structure, locking-related operations are handled as follows:

* OPENs are dealt with primarily on the metadata server. Stateids are selected by the metadata server and associated with the client id describing the client's connection to the metadata server. The metadata server needs to interact with the data storage device to locate the file to be opened, and to make the data storage device aware of the association between the metadata-sever-chosen stateid and the client and openowner that it represents.

OPEN_DOWNGRADE and CLOSE are executed initially on the metadata server but the state change made must be propagated to the data storage device.

* Advisory byte-range locks can be implemented locally on the metadata server. As in the case of OPENs, the stateids associated with byte-range locks, are assigned by the metadata server and are available for use on the metadata server. Because IO operations are allowed to present lock stateids, the metadata server needs the ability to make the data storage device aware of the association between the metadata-sever-chosen stateid and the corresponding open stateid it is associated with.

* Mandatory byte-range locks can be supported when both the metadata server and the data storage devices has the appropriate support. As in the case of advisory byte-range locks, these are assigned by the metadata server and are available for use on the metadata server. To enable mandatory lock enforcement on the data storage device, the metadata server needs the ability to make the data storage device aware of the association between the metadata-sever-chosen stateid and the client, openowner, and lock (i.e., lockowner, byte-range, lock-type0 that it represents. Because IO operations are allowed to present lock stateids, this information needs to be propagated to all data storage devices to which IO might be directed rather than only to daya storage device that contain the locked region.

* Delegations are assigned by the metadata server who initiates recalls when conflicting OPENs are processed. Because IO operations are allowed to present delegation stateids, the metadata server requires the ability to make the data storage device aware of the association between the metadata-server-chosen stateid and the filehandle and delegation type it represents, and to break such an association.
* TEST_STATEID is processed locally on the metadata server, without data storage device involvement.
* FREE_STATEID is processed on the metadata server but the metadata server requires the ability to propagate the request to the corresponding data storage devices.

Because the client will possess and use stateids valid on the data storage device, there will be a client lease on the data storage device and the possibility of lease expiration does exist. The best approach for the data storage device is to retain these locks as a courtesy. However, if it does not do so, control protocol facilities need to provide the means to synchronize lock state between the metadata server and data storage device.

Clients will also have leases on the metadata server, which are subject to expiration. In dealing with lease expiration, the metadata server would be expected to use control protocol facilities enabling it to invalidate revoked stateids on the data storage device. In the event the data storage device is not responsive, the metadata server may need to use fencing to prevent revoked stateids from being acted upon by the data storage device.

The last sentence does not make sense. If the storage device is not responsive, then it is not going to be able to react to a revoked stateid. Unless you mean that the metadata server changes the stateid that the client has, then it subsequently uses that stateid in communicating to the storage device. In that case, though, this would not be fencing as it is not the client being denied access, it is the storage device.

I’m going to read this as “In event the client is not responsive, …"

As a result of describing the tight coupling locking model in parallel with the the loose coupling locking model, I've come to the conclusion that the phrase "global stateid id model", while a useful and compact summary, has made the function of the control protocol seem more difficult/mysterious than it needs to be. Since the goal is to make it clear what is needed to implement flex-file, including the tight-coupling option, I think it would be helpful if the flex-files spec, retained the additional detail that appears above.

Now I'm getting beyond the scope of a review of the flex-files spec but I'd like to note that the flex-files layout work has already made it clear that a large part of control protocol functionality is already present in the NFSv4 base protocol. Perhaps an NFSv4.x extension could be defined to provide the remainder and be usable for both the RFC5661-specified files layout and the flex-files layout with tight coupling. Perhaps this could be discussed in Berlin?

4.1. ff_device_addr4:

In the third non-CODE paragraph suggest, the following changes, primarily to reflect the fact that pNFS client use of layouts is never mandatory:

Either the client access the file through non-pNFS access to the metadata server or it uses pNFS to access the file through the storage device. Since we are specifying the use of ff_device_addr4, the client has already made the decision to use pNFS and hence MUST is correct versus MAY ONLY.

• In the penultimate sentence, suggest replacing "MUST access the storage device" by "MAY ONLY access the storage device directly"
• In the last sentence, suggest replacing "MUST access the storage device using NFSv4" by "MAY ONLY access the storage device directly using the corresponding minor version of NFSv4"

Tom believes that the two suggestions above imply that the client can use an unsupported protocol version.

Where do I state that belief? I am quite firm in stating only specific protocol versions can be used.

I disagree. This issue needs to be resolved.

There is no issue,

5. Flexible File Layout type
"type" needs to be capitalized in the title. This is a new issue introduced by a change in -08.

5.1. ff_layout4:

There are two remaining issues in this section that were in the the -06.

* The contradiction between this section and Section 2.3. State and Locking Models.
* The fact that there is no use for ffds_stateid, since the anonymous stateid is used in the loose-coupling case and a globally valid one is used in the tight coupling case.

In addition, the new FF_FLAGS_NO_IO_THROUGH_MDS in -07 raises some issues that need to be addressed:

* First of all, it isn't clear that "SHOULD" is actually intended/appropriate. According to RFC2119, this means "that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course". In particular, the text does not give one a basis to understand the implications of choosing to do IO using the MDS, when this flag is present. Perhaps "should" is more appropriate here?

I believe the definition of SHOULD fits the use here. The implementation of the metadata server has decided for some reason of its own that it wants the client to route data requests directly to the storage device and unless the client implementation is aware of those reasons, it SHOULD really, really avoid sending it to the metadata server.

It isn’t a MUST because it is a hint.

* The statement "even if a storage device is partitioned from the client, the client SHOULD not try to proxy the IO through the metadata server" raises additional issues. I assume that partitioning might happen after the layout in question is recalled and is part of the revocation process for the layout in question. Thurs this flag seems to be giving directions regarding metadata-directed after the layout in question no longer applies. ????

partitioned in the sense of “network partition” - i.e., happening before the recall.

Made this a bit clearer in draft 10.

* Given that base NFSv4 IO does not require use of layouts, it isn't clear that the client would actually use layouts and, even if it did, it would not require one for areas to which it is doing IO directed at the metadata server. Because of this, a client might not see the RECOMMENDATION/recommendation before doing the IO being warned against.

I don’t want to point out everywhere that pNFS need not be used. By asking for a LAYOUT in the first place, the client is doing pNFS. Yes, it may decide against using pNFS once it sees that the metadata server is advocating FF_FLAGS_NO_IO_THRU_MDS, but the use of non-pNFS here is not interesting.

Although how this might be dealt with is going to depend on the resolution of the should-vs.-SHOULD question mentioned above, I'm concerned that someone contributing to this specification, not necessarily one of the authors, is assuming a level of metadata server direction with regard to client IO that is inconsistent with the pNFS model. Within pNFS, a client's ability to do IO to the metadata server is defined by the base NFSv4.1 semantics, while the layout type may impose, using layouts, any restrictions it wants for IO through the data storage devices.

I don’t follow how this differs from what I have presented. The text states that the flex files layout type will act a certain way, it makes no requirements on the more general pNFS model.

5.1.1. Error codes from LAYOUTGET

I'm doubtful about the use of "SHOULD" in the cases for NFS4ERR_{LAYOUTTRYLATER, DELAY}. It seems to me that the author is telling me, that, when the client has a layout, it is either desirable or undesirable for me to continue to use it. But there is no basis given for considering this an interoperability issue, or letting the reader understand the consequences of taking the choice considered undesirable. I think these "SHOULD"s should be "should”s.

We seem to disagree upon SHOULD/should.

To me, SHOULD means I’d really love to make this a MUST, but there exists enough prior art to prevent that.

Okay, in thinking about it, the use of FF_FLAGS_NO_IO_THROUGH_MDS is a SHOULD and here it is a should. I.e., the new flag is to allow for the server to define specific behavior for flex files, but these are “interpretations”.

[nfsv4] Fwd: Review of draft-ietf-nfsv4-flex-file… David Noveck
[nfsv4] Review of draft-ietf-nfsv4-flex-files-08 … David Noveck
Re: [nfsv4] Review of draft-ietf-nfsv4-flex-files… Thomas Haynes
Re: [nfsv4] Review of draft-ietf-nfsv4-flex-files… David Noveck