Re: [tsvwg] RDMA Support by UDP FRAG Option

Tom Herbert <tom@herbertland.com> Sat, 19 June 2021 17:11 UTC

Return-Path: <tom@herbertland.com>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 11FA93A192C for <tsvwg@ietfa.amsl.com>; Sat, 19 Jun 2021 10:11:49 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.896
X-Spam-Level:
X-Spam-Status: No, score=-1.896 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=herbertland-com.20150623.gappssmtp.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Wg1oRum3RHwv for <tsvwg@ietfa.amsl.com>; Sat, 19 Jun 2021 10:11:44 -0700 (PDT)
Received: from mail-ed1-x52d.google.com (mail-ed1-x52d.google.com [IPv6:2a00:1450:4864:20::52d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E29C13A192B for <tsvwg@ietf.org>; Sat, 19 Jun 2021 10:11:43 -0700 (PDT)
Received: by mail-ed1-x52d.google.com with SMTP id i24so2276503edx.4 for <tsvwg@ietf.org>; Sat, 19 Jun 2021 10:11:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=herbertland-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=wkdu/JXrzyQdYZBdvaFEDHIzckzcjEIX0AX9jPXrtrQ=; b=HnYhouiy4v/GyY6YZuxs1n7gQ7nf9c5Jr8VOoVMtWyLQtkwBBUzEaneQ4Pf5nAL8eX 6lqfQCv4XDx3miHkcVBjlLXsHD5uhPjt1b8tiJDgFXW5UT3HuyDu9zBnujDBefHVyfKn XR8epLG4RrKCCqQ8YHDtKad/wP228s6xxdOrtmV07dvzi1qQ5AQqhtmReJYxaVMoxLgN B/5QnMIOaU6UakEfEcHzup/bedgl7WsEdraAjK59dLB/Kb52ikyp34I5QbpBbXHc3rX9 RSaaex+PVR8aGrbG01oQuckKuKbFW12QPQDbFhQM8FM6MZJjMBJcEif5TxeOAuZtu17q lvEw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=wkdu/JXrzyQdYZBdvaFEDHIzckzcjEIX0AX9jPXrtrQ=; b=Wn1TC/vvbABL01Lw33uz2K11kmuh1bt978aQ+ovzg2zUJirMENc8M/WkE7cKivHWWV jBgdzBFqqaTnj67ZfGuuKeucvKLW3CuT+s7B3mX/cEXjTzb9LFkXhV1+B3+CbNOnQovx pyZBPv/IO/lL9Z0AsceyrJXUvZD032j4v3CQPa1hr8bGZRZS4/Ffu9iD7GDXDQNs/qNz eNp9pF9HI64QOmpfQEQI4Ac39MFZ6C6AnKWkPzex48QjoTSIQ0tFp48az6F6tQPeukTA ND4htHrhJQBHvbZR4Zv6awToxtwY98gGeHfHiZZpMpGwJiVUlcbCHMXtB8NXHdNCJsPG RTbA==
X-Gm-Message-State: AOAM530XRaH7+KWK1EDY0dxJxuBn/Ox/sNU7VMljgs95R2jGiyuuiySE juGp6lbnbQqYw6X4bBWMsfjvzd9WvEd7CRC0cYNw5Q==
X-Google-Smtp-Source: ABdhPJxUsn+mbeOeT9HOXwZDQ9oPv9DfWB3bHT7EZYC0haKYlK+WcbegoRp8I/OjWTukrM/XYTkqd3qkBYfdPts8Ru4=
X-Received: by 2002:aa7:d818:: with SMTP id v24mr11584456edq.22.1624122700705; Sat, 19 Jun 2021 10:11:40 -0700 (PDT)
MIME-Version: 1.0
References: <CACL_3VEyLdQZ-3hvzXxyA8ehtWs2hXESZ2OqyAx+BeSg85+-cA@mail.gmail.com> <CACL_3VFE4TjKvmkfZjvNpWo6vVfKjz5w85=Q+yqnYZKcwbYLmQ@mail.gmail.com> <63FFC34B-2179-47F1-B325-21CAC3D1543A@strayalpha.com> <CACL_3VHTfxWaBj7TFEmBXBqovrrAj7XuFEZFUag_iBHr3Hx09g@mail.gmail.com> <0EBFC9B0-591A-4860-B327-6E617B83F4D1@strayalpha.com> <CALx6S34pT81TbfQDk2vKF8wBrXL312As79K=rEzUQ3Lmg7UvpA@mail.gmail.com> <7C51D926-9DBB-41F5-93B2-10F716F672B1@strayalpha.com> <CALx6S37uN8TsXQZ3cv5jmxwxSyBRjK=-GQ_MsWxPWSs21XoGHw@mail.gmail.com> <CACL_3VEx7+VnLz7OLdXyhZU41e+-oBz3dc8JdMV_7pLMfic6=w@mail.gmail.com> <fcc8762f-c042-7999-d2e4-f28384950a19@erg.abdn.ac.uk> <CALx6S36sWGcZmFpAhF4DfOMyf6Z0w5F9bemNfeM1yWV-r0M+BA@mail.gmail.com> <8af3abf9-943f-13c1-e239-5efca27cf68c@erg.abdn.ac.uk> <CACL_3VHdyLAmzMbWsTVfJD+4tTzsMvcTzKS1B1CAdZ3k5U957g@mail.gmail.com>
In-Reply-To: <CACL_3VHdyLAmzMbWsTVfJD+4tTzsMvcTzKS1B1CAdZ3k5U957g@mail.gmail.com>
From: Tom Herbert <tom@herbertland.com>
Date: Sat, 19 Jun 2021 10:11:29 -0700
Message-ID: <CALx6S34DUrUBYd94LPPg4Hgh0FnZYZjZ4eKEYuaxb-7zbzb=pQ@mail.gmail.com>
To: "C. M. Heard" <heard@pobox.com>
Cc: Gorry Fairhurst <gorry@erg.abdn.ac.uk>, Joseph Touch <touch@strayalpha.com>, TSVWG <tsvwg@ietf.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/ICNmQrOt2xprx9Ksgoi3Hw_F9DE>
Subject: Re: [tsvwg] RDMA Support by UDP FRAG Option
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 19 Jun 2021 17:11:49 -0000

On Fri, Jun 18, 2021 at 1:18 PM C. M. Heard <heard@pobox.com> wrote:
>
> On Fri, Jun 18, 2021 at 2:36 AM Gorry Fairhurst wrote:
>>
>> On 17/06/2021 18:30, Tom Herbert wrote:
>> > On Thu, Jun 17, 2021 at 7:41 AM Gorry Fairhurst wrote:
>> >> When [the] dust has settled, I expect we should see an updated
>> >> fragmentation text.  I suspect we should consider the final format for
>> >> suitability for offload (perhaps Tom Herbert would look?) and for ease
>> >> of processing in message nbuffere (Tom Jones offered to look).
>> >
>> > Please look at draft-herbert-udp-space-hdr-01 for my proposal as to
>> > how the UDP surplus space should be formatted. If there is interest, I
>> > could update draft to include considerations for UDP options
>> > fragmentation and reassembly offload as well as header/data split
>> > which is needed for zero-copy receive (i.e. packet headers are DMA'ed
>> > into one set of buffers to be processed by the stack, payload is
>> > DMA'ed to another set of buffers to be processed by the application).
>>
>>
>> Thanks, I saw that a while ago, thanks for reminding people.
>
>
> Indeed. I wanted to undertake a detailed analysis of the effects of the primary competing formats on checksum offload, and I was looking around for the treatment in Appendix A of draft-herbert-udp-space-hdr-01,  especially the discussion of how common offload hardware works, when the reminder came in.
>
> Since draft-herbert-udp-space-hdr-01 does not take into account the OCS pseudo-header (which is one 16-bit word consisting of the surplus area length), I'm going to do the analysis from scratch in order to account for this point. My main conclusions are that for the purposes of checksum offload, IT DOESN'T MATTER whether OCS is at a fixed position in an option header or in a TLV. The differences are very minor.
>
> What follows is a fairly detailed comparison of how the offload logic differs between the format proposed in draft-ietf-tsvwg-udp-options-12 and that in draft-herbert-udp-space-hdr-01. Those who do not care to pick through the entire analysis should skip down to the CONCLUSIONS section below.
>
> Tom's draft describes typical hardware support for transmit checksums as follows:
>
>    In generic checksum offload, for each packet the host indicates to
>    the device the starting offset where the checksum calculation begins
>    and the offset of the field to write the resultant checksum. The
>    extent of the checksum coverage is assumed to be the end of the
>    packet. In particular, this means that even if the UDP checksum is
>    being offloaded, the UDP surplus space is included in the device's
>    computation.
>
>
> As the draft-herbert-udp-space-hdr-01 notes, there are (in general) two transmit checksums to be computed: the UDP checksum, and the OCS. The idea is to use the hardware support to compute whichever checksum covers more data and relegate the other to the host CPU.
>
> When the packet is a standard UDP datagram with an options trailer in the surplus area, OCS is typically the one that covers fewer bytes, so the host CPU prepares and formats the option trailer either as described in draft-ietf-tsvwg-udp-options-12 or as in draft-herbert-udp-space-hdr-01 and computes the OCS as specified in draft-ietf-tsvwg-udp-options-12 (i.e., including the 1-word pseudo-header). Then the host CPU calculates a modified UDP pseudo-header where the length is the IP payload length (not the UDP length) and puts this in the UDP checksum field. Finally, the host instructs the offload engine to calculate the Internet checksum over the entire IP payload and store the complement in the UDP checksum field. Clearly, nothing in the offload logic depends on the details of how the trailer is formatted, only that the 1s complement sum of all 16-bit words therein adds up to the 1s complement of the trailer length.
>
Mike,

There is another serious problem with transport checksums and use of
the UDP surplus area. Most of this discussion has presumed that the
UDP checksum is the one in the packet being offloaded, but that may
not be the case. Consider the case where a sender sends a TCP packet
that is encapsulated GRE/UDP or VXLAN (a very common use case in
virtual networks where VMs send packets on their virtual networks).
The stack will attempt to offload the innermost checksum which is the
TCP checksum. The TCP checksum is the one offloaded regardless of
whether or not the outer UDP checksum is zero (if it's non-zero then
the stack would set it using local checksum offload (LCO)). The
offloaded TCP checksum computation would start the computation at the
first byte of the TCP header through the end of the whole packet. So
unless the surplus area is properly checksummed then the computed TCP
checksum will be invalid and the packet will be dropped at the
receiver. This is not just a problem for offload for offload, I
believe that this wouldn't work properly in existing software stacks
without some major changes.

So to make all uses of transport checksum computation and offload
reasonably robust, when the UDP surplus area is being used both the
UDP checksum and checksum over the surplus area MUST always be set.

FYI, here is some nice background checksum offload is
https://www.kernel.org/doc/html/latest/networking/checksum-offloads.html

Tom

> When the packet is a UDP fragment (including a degenerate one that is both an initial and terminal fragment), the host CPU calculates the UDP header checksum in the standard way (including using the UDP length of 8 in the pseudo-header). It then formats the trailer area either as described in draft-ietf-tsvwg-udp-options-12 or as in draft-herbert-udp-space-hdr-01 and populates the OCS field with the surplus area length (i.e., IP payload length - 8). Note that in the case when OCS is a TLV (draft-ietf-tsvwg-udp-options-12) the issue of alignment may arise: the OCS pseudo-header needs to be byte-swapped if the OCS checksum field is on an odd byte boundary. This can be avoided by judicious use of NOPs. The host CPU then instructs the offload engine to calculate the Internet checksum over the surplus area and store the complement in the OCS checksum field. Some offload hardware may not be able to store at an arbitrary byte boundary, but any hardware alignment requirements can again be accommodated by judicious use of NOPs. In this case some details of the offload logic are dependent on the surplus area format, but it is clear, I think, that either of the competing formats draft-ietf-tsvwg-udp-options-12 or draft-herbert-udp-space-hdr-01) can be accommodated.
>
> It's easy to accommodate optional UDP checksum (UDP CS=0) and (conditionally) optional OCS: just omit the unneeded checksum computation(s). Conditionally optional OCS could be accommodated in the format proposed in draft-herbert-udp-space-hdr-01 by allowing OCS (therein called USH checksum) equal to zero to indicate absence.
>
>  The draft describes typical hardware support for receive checksums as follows:
>
>    In the most generic form of receive checksum offload, a device
>    performs a running checksum calculation across a packet as it is
>    received. That is, it performs a running ones complement addition
>    over two byte words as they are received. The device then provides
>    the computed value, referred to as the "checksum complete" value, to
>    the host in the meta data (receive descriptor) for the packet. The
>    host can use this value to verify one or more packet checksums
>    contained in the packet.
>
>
> In order to use the raw checksum provided in this way. the host CPU needs to do the following:
>
> 1.) Compensate for any lower layer headers that may have been included in the raw checksum. This could
> the IPv4/IPv6 header and any IPv6 extension headers that the NIC does not strip off. I'm not going to dwell further on this point because the effects are the same under the different approaches to UDP options.
>
> 2.) Determine whether the received packet is a standard UDP packet with an options trailer and also whether the UCP checksum is present or absent, and if the UDP checksum is absent, whether OCS is also absent. Determining whether the a standard UDP packet with an options trailer or a UDP fragment (possibly a degenerate one that is both an initial and terminal fragment) is easy: just check to see if UDP Length = 8. Determining whether the UDP checksum is present is easy: just check to see if UDP checksum = 0. Determining whether OCS is present is a bit more difficult with the TLV format in draft-ietf-tsvwg-udp-options-12, but not much so, PROVIDED THAT OCS IS ALWAYS THE FIRST OPTION, apart from any NOPs that may be present for alignment.
>
> Let's consider first the case where both checksums are present. The first step for the host CPU is to compute the ones complement sum of a pseudo-header that is similar to the standard UDP pseudo-header but uses the IP payload length in place of the UDP length. When that is added to the raw checksum coming from the offload engine (following compensation for lower layer headers) the result will be zero if both the UDP checksum and the OCS are correct or both have errors that happen to cancel (see draft-fairhurst-udp-options-cco-00). So if this check passes, the host CPU needs to separately verify either the UDP checksum or the OCS in order to ensure that both are valid.
>
> When the packet is a standard UDP datagram with an options trailer in the surplus area, OCS is usually the cheaper one to verify, and the calculation required to do so is very little different whether the trailer uses the TLV format in draft-ietf-tsvwg-udp-options-12 or the format in draft-herbert-udp-space-hdr-01. Assertions to the contrary notwithstanding, IT IS NOT NECESSARY TO PARSE the OCS TLV in order to perform the OCS validation. One need only compute the Internet checksum over the trailer area using the trailer length as the pseudo-header, same as for the trailer format described in draft-herbert-udp-space-hdr-01.
>
> When the packet is a UDP fragment (including a degenerate one that is both an initial and terminal fragment), the host CPU validates the UDP header checksum in the standard way (including using the UDP length of 8 in the pseudo-header). There is no difference in the two formats for this.
>
> If UDP CS=0 and OCS is present, then the host CPU needs to validate OCS only. When the packet is a standard UDP datagram with an options trailer in the surplus area, the cheapest thing to do is likely just to calculate OCS directly. When the packet is a UDP fragment, then the host CPU can profitably use the output of the offload engine by adding the 1's complement of the 1's complement checksum of the UDP header and then adding the OCS pseudo-header. In both cases the amount of work is identical (or nearly so) for both formats.
>
> CONCLUSIONS:
>
> 1.) For checksum offload, the differences in the packet formats in draft-ietf-tsvwg-udp-options-12 and format in draft-herbert-udp-space-hdr-01 vary only at the edges. For the former it is necessary to impose a constraint on the placement of the OCS TLV in order for a receiver to safely determine whether it is present in the case where UDP CS=0. For the latter this placement constraint is automatic, but on the other hand there is now an alignment requirement when the options are in a trailer, which is something that the receiver needs to check. To me, that's six of one or half a dozen of the other, a wash. I do concede that there is a question of what happens if an OCS TLV shows up later in the packet when UDP CS = 0. My sense is that the receiver should just discard the option in that case and not check OCS.
>
> 2.) In both cases the receive logic would be cleaner if we reduced this to two cases: UDP CS<>0 and OCS is required (as in draft-ietf-tsvwg-udp-options-12) or UCP CS=0 and OCS is not required (and ignored if present). Then there would be no need to constrain the placement of the OCS TLV (since, as I point out above, it's always possible to check it without parsing the option TLVs). The alignment annoyance with the herbert-udp-space-hdr-01 trailer format remains. I personally have trouble seeing a lot of value in UDP CS=0 with OCS present, no matter how hard I squint; but if the WG really wants it, I'm OK to leave it in.
>
> 3.) If both checksums are always present whenever either is, a substantive simplification suggests itself, namely, not to worry about offsetting errors in UDP CS and OCS. That weakens the check, but not by a lot, and it streamlines both implementations that use offload and those that do not. Again, though, this is just a suggestion; it is not a point on which I feel strongly.
>
> Gorry also wrote:
>>
>> I think it would be good for the WG to focus on how to finish
>> draft-ietf-tsvwg-udp-options, but I do seem some opportunities to use
>> some of these ideas for making the fragment header - because that also
>> places all data in the "option".
>
>
> I agree that we need to focus on getting this deliverable finished. I'm rapidly coming to the conclusion that to some extent at least -- and certainly for checksum offload -- the disagreements that Joe, Tom, and I have about the packet formats amount largely to bike-shedding. I'm going to shut up for a while, review the 50+ messages in the "A review of draft-ietf-tsvwg-udp-options-12," and come back with some substantive suggestions for convergence.
>
> Mike Heard