[storm] Review comments on draft-ietf-storm-rdmap-ext

Tom Talpey <ttalpey@microsoft.com> Wed, 10 July 2013 16:10 UTC

Return-Path: <ttalpey@microsoft.com>
X-Original-To: storm@ietfa.amsl.com
Delivered-To: storm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7E9A121F9956 for <storm@ietfa.amsl.com>; Wed, 10 Jul 2013 09:10:49 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -103.599
X-Spam-Level:
X-Spam-Status: No, score=-103.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, RCVD_IN_DNSWL_LOW=-1, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Z0+2uRgzZlfi for <storm@ietfa.amsl.com>; Wed, 10 Jul 2013 09:10:43 -0700 (PDT)
Received: from na01-bl2-obe.outbound.protection.outlook.com (mail-bl2lp0206.outbound.protection.outlook.com [207.46.163.206]) by ietfa.amsl.com (Postfix) with ESMTP id 9DB4F21F9B52 for <storm@ietf.org>; Wed, 10 Jul 2013 09:10:42 -0700 (PDT)
Received: from BY2FFO11FD010.protection.gbl (10.1.15.200) by BY2FFO11HUB018.protection.gbl (10.1.14.92) with Microsoft SMTP Server (TLS) id 15.0.717.3; Wed, 10 Jul 2013 16:10:35 +0000
Received: from TK5EX14HUBC102.redmond.corp.microsoft.com (131.107.125.37) by BY2FFO11FD010.mail.protection.outlook.com (10.1.14.74) with Microsoft SMTP Server (TLS) id 15.0.717.3 via Frontend Transport; Wed, 10 Jul 2013 16:10:35 +0000
Received: from TK5EX14MBXC285.redmond.corp.microsoft.com ([169.254.3.149]) by TK5EX14HUBC102.redmond.corp.microsoft.com ([157.54.7.154]) with mapi id 14.03.0136.001; Wed, 10 Jul 2013 16:10:30 +0000
From: Tom Talpey <ttalpey@microsoft.com>
To: "storm@ietf.org" <storm@ietf.org>
Thread-Topic: Review comments on draft-ietf-storm-rdmap-ext
Thread-Index: Ac5865W+3+PDQi76TQWRLH3sDqnoew==
Date: Wed, 10 Jul 2013 16:10:29 +0000
Message-ID: <614F550557B82C44AC27C492ADA391AA148BF8B1@TK5EX14MBXC285.redmond.corp.microsoft.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [157.54.51.33]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Forefront-Antispam-Report: CIP:131.107.125.37; CTRY:US; IPV:CAL; IPV:NLI; EFV:NLI; SFV:NSPM; SFS:(51704005)(199002)(189002)(74662001)(50466002)(81542001)(23726002)(4396001)(47446002)(56816003)(63696002)(50986001)(6806003)(76482001)(76786001)(31966008)(49866001)(47736001)(77982001)(74876001)(44976004)(76796001)(80022001)(81342001)(59766001)(54356001)(79102001)(47976001)(47776003)(53806001)(69226001)(20776003)(33656001)(74366001)(46102001)(65816001)(54316002)(74706001)(56776001)(51856001)(74502001)(16406001)(83072001)(66066001)(46406003)(55846006)(77096001)(76176001); DIR:OUT; SFP:; SCL:1; SRVR:BY2FFO11HUB018; H:TK5EX14HUBC102.redmond.corp.microsoft.com; CLIP:131.107.125.37; RD:InfoDomainNonexistent; A:1; MX:1; LANG:en;
X-OriginatorOrg: microsoft.onmicrosoft.com
X-O365ENT-EOP-Header: Message processed by - O365_ENT: Allow from ranges (Engineering ONLY)
X-Forefront-PRVS: 0903DD1D85
Subject: [storm] Review comments on draft-ietf-storm-rdmap-ext
X-BeenThere: storm@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Storage Maintenance WG <storm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/storm>, <mailto:storm-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/storm>
List-Post: <mailto:storm@ietf.org>
List-Help: <mailto:storm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/storm>, <mailto:storm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Jul 2013 16:10:49 -0000

I've read and reviewed the latest RDMA Extensions draft (draft-ietf-storm-rdmap-ext-04) and have the following comments.

I have two high-level concerns. These extensions to the RDMAP protocol are made without any versioning or capability detection between peers at the RDMAP level. There is a statement deep in section 5 that "discovery" of Atomics needs to be performed by the upper layer or application, and no similar statement is made in section 6 regarding Immediate Data.

This seems problematic, for two reasons. First, it imposes an interoperability burden on upper layers to perform this negotiation. That's perhaps acceptable, but more discussion as to this requirement needs to be made, and in particular some discussion of why it is *not* being performed by the RDMAP protocol in this extension.

Second, there is no discussion of why the chosen set of operations is necessary, or sufficient. The Introduction states that (unspecified) applications are able to use Atomics and Immediate Data on "other" (unspecified) RDMA transports, but there is no discussion of whether the operations are sufficient. And, when comparing the Atomics in this document to, for example, Infiniband, there is an additional operation ("Swap") which is not part of the IB specification. What other operations, or semantics, are possible? Are those left for future work?

In short, I would like to see a succinct Problem Statement, and a discussion of how the extension is to interoperate with existing implementations. Discussion of how the RDMAP protocol may be extended in the future would be good to have, too.


Specific issues on sections follow.

+ Section 1, paragraph 1

>   The RDMA Protocol [RFC5040] provides capabilities for zero copy and
>   kernel bypass data communications.  This document specifies the

There is much more to RDMAP than zero copy, I think. And the concept of "kernel bypass" is mentioned just once in RFC5040, and in the abstract; it seems irrelevant here and should be omitted. Paragraph 3 goes on to say:

>   Other RDMA transport protocols define the functionality added by
>   these extensions leading to differences in RDMA applications and/or

which makes no specific reference for either the protocols or the applications. In other words, what is the detailed problem statement that the extensions are proposed to solve?

It would be preferable to see specific reasons motivating these extensions, to justify them individually.


+ Editorial in section 1, is "following" the right word? Since the Immediate Data is carried by a new message, it seems more like a tinygram-style message in itself.

>   o  Immediate Data messages allow the ULP at the sender to provide a
>      small amount of data following an RDMA Message.


+ Section 3 "Atomic Operation"

>   Atomic Operation - is an operation that results in an execution of a
>   64-bit operation at a specific address on a remote node. The

The "64-bit" qualifier is not necessary for the glossary and can be removed here. Also, isn't it a specific "registered" address?

>   consumer can use Atomic Operations to read, modify and write at the
>   destination address while at the same time guarantee that no other
>   read or write operation will occur across any other RDMAP Streams on
>   an RNIC at the Data Sink.

Generalize the last sentence to "The consumer can use Atomic Operations to perform operations on the contents of the destination address with certain atomicity guarantees." I have other comments on the "across any other RDMAP Streams" later.

Even if you disagree with the prior sentence, the term "Data Sink" should be deleted. Atomics source and sink data at the same peer. In fact, the glossary goes on to define "Requester" and "Responder" just a few lines later.


+ Section 4 first paragraph

>   The control information of RDMA Messages is included in DDP protocol
>   [RFC5041] defined header fields, with the following new formats:

This would appear to propose an extension to RFC5041, as written. This should be stated, or made more clear. Any requirements made of other protocols are critical to capture.


+ Section 4.1, first paragraph

>   The RDMA Messages defined by this specification use all 8 bits of
>   the RDMAP Control Field. The first octet reserved for ULP use in the
>   DDP Protocol MUST be used by the RDMAP to carry the RDMAP Control

This statement appears to have been imported from RFC5040, and therefore is not part of the extension. This should be reworded to make it clear what is being extended and changed.


+ Section 4.1, second paragraph

>    0                   1                   2                   3
>    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>                                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>                                   |T|L| Resrv | DV| RV|Rsv| Opcode|
>   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>   |                     Invalidate STag                           |
>   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
>               Figure 1 DDP Control and RDMAP Control Fields

There are no differences in these fields from RFC5040. What is being extended in them? (Just the Opcodes, right?)


+ Section 4.1, table

> -------+-----------+-------------------------------------------------
> 1011b  | Atomic    | 0     | N/A  | 3     | N/A       | Yes
>        | Response  |       |      |       |           |
> -------+-----------+-------------------------------------------------

The newly-defined use of RDMAP queue #3 is quite surprising. What's the purpose of the assignment? What semantics are being enforced? The document is silent on them, and as noted in later comments, possibly inconsistent.


+ Section 5 third paragraph

>   Atomic operations as specified in this document execute a 64-bit
>   operation at a specified destination address on a remote node. The

It's much more than "specified", it's a _registered_ address, represented by an STag and thereby under the control of the RFC5042 security model. It is critical to state this, and make the reference.

Also, the address has other possible restrictions, with respect to alignment. The protocol should state whether these are required by the protocol itself, or if they are subject to local constraints imposed by the registration. If they are protocol-imposed, then there will be implications for the "offset" fields with regard to byte granularity, and how and where they will be checked. Either way, discussion here is required.

>   operations atomically read, modify and write back the contents of

Are read, modify and write the only actions possible in an atomic? It's not clear how the proposed "swap" operations apply here. The R/M/W meaning should be made clear.

>   the destination address and guarantee that Atomic Operations on this
>   address by other RDMAP Streams on the same RNIC do not occur between
>   the read and the write. 

This seems to be significantly less of an atomicity guarantee than expected. Normally, atomics protect all concurrent modifications to the target location, including those from local access via the CPU at the responder, and other RNICs. Is the scope intentionally narrowed to a single RNIC here? Whatever the guarantee, some statement of RNIC requirements to achieve this is needed, because the proposed protocol itself does not address it.

>   document MAY be implemented. The discovery of whether the Atomic
>   Operations are implemented or not is outside the scope of this
>   specification and it should be handled by the ULPs or applications.

This is the first mention in the document of this requirement, and it needs quite a bit more discussion on exactly how an upper layer is to proceed. Why is the "should" not normative here, and is it a MUST? See earlier comment.


+ Section 5.2 first paragraph

>   The Atomic Operation Request and Response are RDMA Messages.  An
>   Atomic Operation makes use of the DDP Untagged Buffer Model. Atomic
>   Operations use the same Queue Number as RDMA Read Requests (QN=1).

Earlier, atomic *requests* are specified to use QN=1, but atomic *responses* used QN=3. Which is correct? Note, section 5.2.2 bullet #3 also makes this statement.


+ Section 5.2.1 Atomic operation values

> ---------+-----------+----------+----------+---------+---------
>   0011b    |           |
>   to       | Reserved  |            Not Specified
>   1111b    |           |
>   ---------+-----------+-----------------------------------------

... and

>   4. At the Responder, when an invalid Atomic Operation Request
>      Message is delivered to the Remote Peer's RDMAP layer, an error
>      is surfaced.

Are the "Reserved" fields invalid?


+ Section 5.2.2 points #5 and #7

>   5. The RDMAP layer MUST Deliver the Atomic Operation Response
>      Message to the ULP.

>   7. The Responder RDMAP layer MUST pass Atomic Operation Response
>      Messages to the DDP layer, in the order that the Atomic Operation
>      Request Messages were received by the RDMAP layer, at the
>      Responder.

These steps appear out of order. I suggest switching them, so the sending of the response precedes its processing by the receiver.


+ Section 5.3. Atomicity Guarantees

>   Atomicity of the Read-Modify-Write (RMW) on the Responder's node by
>   the Atomic Operation MUST be assured in the presence of concurrent
>   atomic accesses by other RDMAP Streams on the same RNIC. 

Similar comment as earlier - are these the only types of concurrent access guarantees?


+ Section 5.4 Atomic completion rules

>   1. For an Atomic operation, the contents of the Tagged Buffer at the
>      Responder MAY be indeterminate until the Atomic Operation
>      Response Message has been Delivered at the Requester.

I don't understand this statement. The protocol defines no means for the Requester to acknowledge the Delivery of the Atomic Response, so how can this in any way influence the behavior at the Responder?

>   4. Atomic Operation Response Message processing at the Responder
>      MUST be started only after the Atomic Operation Request Message
>      has been Delivered by the DDP layer (thus, all previous RDMA
>      Messages have been properly submitted for ordered Placement).

Again, I don't understand. The DDP protocol uses Queue Numbers to enforce this ordering, and since the Atomic Request uses QN=1, the requirement for "all previous RDMA Messages" is not sufficiently clear. Is this making a requirement that previous RDMA Writes to the Atomic location be Placed (there's no mention of this in section 7)? Or, is the Atomic required to perform some sort of memory barrier against all other access?


+ Section 6. Immediate Data

>   The Immediate Data operation is used in conjunction with an RDMA
>   Operation to improve ULP processing efficiency by allowing 8 bytes
>   of immediate data to be delivered with the completion of the
>   previous operation after the previous operation has been delivered
>   at the Remote Peer.

I definitely do not understand this. The text appears to state that any immediate data is added to some *prior* completion. How can an operation, which has already been delivered, be modified by a later one?


+ Section 6.3 Immediate data handling

>   .  The Remote Peer MUST check that Immediate Data and Immediate Data
>      with SE Messages include exactly 8 bytes of data from the DDP
>      layer.

How does the receiver make this test? There is no length field in the Immediate Data message, it's a fixed 8-byte buffer.


+ Section 7 Ordering table

>----------+------------+-------------+-------------+-------------------
>Atomic    | Atomic     | No Placement| No Placement| Second Atomic
>          |            | Guarantee   | Guarantee   | Response
>          |            | between two | between two | Message will
>          |            | Atomic      | Atomic      | not be
>          |            | Requests    | Responses   | generated
>          |            |             |             | until first
>          |            |             |             | Atomic Response
>          |            |             |             | has been
>          |            |             |             | generated
>----------+------------+-------------+-------------+-------------------

This appears to be incorrect. Earlier, there are specific requirements that the RMW operation be performed prior to replying to a previous Atomic, and in fact all Atomic requests are serialized on a single QN (1). I'm not sure what application semantic would be satisfied by this rather weak guarantee as stated.


+ Section 11.1

[RFC5042] (RDMA security model) is missing. Needed for discussion of Atomics steering tag protection.

[RFC5226] is not a normative reference. Move to 11.2, if desired, but more likely, just delete.


+ Section 11.2

Would like to see references to Infiniband specifications here, as relevant for Atomics and other behavior.