Re: [nvo3] One comment for draft-dt-nvo3-encap-01

Lizhong Jin <lizho.jin@gmail.com> Thu, 25 May 2017 19:58 UTC

Return-Path: <lizho.jin@gmail.com>
X-Original-To: nvo3@ietfa.amsl.com
Delivered-To: nvo3@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id EF1E512E050 for <nvo3@ietfa.amsl.com>; Thu, 25 May 2017 12:58:38 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vTow9WHYgqwW for <nvo3@ietfa.amsl.com>; Thu, 25 May 2017 12:58:36 -0700 (PDT)
Received: from mail-pf0-x244.google.com (mail-pf0-x244.google.com [IPv6:2607:f8b0:400e:c00::244]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8DA291294E7 for <nvo3@ietf.org>; Thu, 25 May 2017 12:58:36 -0700 (PDT)
Received: by mail-pf0-x244.google.com with SMTP id f27so41199050pfe.0 for <nvo3@ietf.org>; Thu, 25 May 2017 12:58:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=ku3vcF6cUaL7HvRfMQ5kINBiYRIGhP80JEzeSBt4388=; b=RlTDE+sBEQkZraVzsh2y4hvGfg36IQCl0/6KIKSW0LTvNO2uaQQTSsG42jyfMdk9I0 4ORQU+4eLQtYgdpn0gjKhJAqTWAzNxj62nJKckrkBjTR2MsPnISZIVFoxexRMSlxtTs0 2MdngUncbcK1D6jKYcAgFY4DRULDkTnLih7LCCFTBFMMYuFpLgRXcN+SUQuBqxmu1OA1 IUftEzQCaSMLqLLyzz/BxaT7ad4+DPTExqOjvlyYa93HmClrNiPD971OD1dz3X+W7/2P rvvF+nVPOUCLcnFG7+VkX7SfvbuRbNJffxuz8OOG4usayXMb4PXP6e9I5luf2tncaBQA e1Vg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=ku3vcF6cUaL7HvRfMQ5kINBiYRIGhP80JEzeSBt4388=; b=Fee/+zO1yaUdk5Q6dkWh/u4y1RjqWOosOwduUyDt4BcVpYY2q+1kj6TIEPIgqP1H/n zJUZMOtPJ/hOMDN43twH8p3hEmRAAyuLOUW7SaIObZt4twfm4X/pcbLghc15yi1mU3LO yhAGkpFkTWuvG1DwWz1UvWxieq7MWzVhlN8UJLMGduJ439lMIDNBcG+vT04QXPFawxrR Ws3VDveHKL7U3SuaXUPlPOhySyqqQxArXaDwlBUAC0fWGWivQmQPISdvVCmTQ5rkjaOL AlAR2Kmn23ezPiknQfFnLJlWEbYyBhAEO+hibPP2QcCydSs4GAg0il/PCKkqZG6IvIvW HUng==
X-Gm-Message-State: AODbwcCvTogmIXtdDwTMslc2j54wHnPH/sJIeagSCDQ9v3l0iE7L3UUk S5gT6QoRTUY8bOpd2kDYoDOnkmRyMA==
X-Received: by 10.98.155.28 with SMTP id r28mr46959899pfd.198.1495742316152; Thu, 25 May 2017 12:58:36 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.100.159.12 with HTTP; Thu, 25 May 2017 12:58:35 -0700 (PDT)
In-Reply-To: <CALx6S37PRMTdW=upUXczfkdtUs9PA6Oe112MCRjjCHuC4r_N3A@mail.gmail.com>
References: <CALx6S34nJd9VsGoaS4xuKv0-vR81VcwuDCXrRY034EdSC3k35Q@mail.gmail.com> <DEFD2911-803B-4EBB-9A61-9A2F171A7077@gmail.com> <CALx6S3474SiCypkSTEFdPqSCWH7ms5=y0460E9T2kOccLneTHw@mail.gmail.com> <1D6ECA75-5203-4DA4-896B-DCA9BA385B7A@gmail.com> <CALx6S344vjfcwvjFoz9OqTYTbFDjGoDG6XQw2-7QiG6YCB6wbA@mail.gmail.com> <DEE54A16-9A4D-410E-A303-2153E9A3EADE@gmail.com> <CALx6S37PRMTdW=upUXczfkdtUs9PA6Oe112MCRjjCHuC4r_N3A@mail.gmail.com>
From: Lizhong Jin <lizho.jin@gmail.com>
Date: Fri, 26 May 2017 03:58:35 +0800
Message-ID: <CAH==cJz3WP36_LKw4v-EvA-4g1W7kayFZ_LtfB8j_Pu6-14B-Q@mail.gmail.com>
To: Tom Herbert <tom@herbertland.com>
Cc: "nvo3@ietf.org" <nvo3@ietf.org>
Content-Type: multipart/alternative; boundary="94eb2c142c0c6e110405505ea5fc"
Archived-At: <https://mailarchive.ietf.org/arch/msg/nvo3/RPaoifVaAT0XfEg_e2xGKKIpmps>
Subject: Re: [nvo3] One comment for draft-dt-nvo3-encap-01
X-BeenThere: nvo3@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "Network Virtualization Overlays \(NVO3\) Working Group" <nvo3.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nvo3>, <mailto:nvo3-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nvo3/>
List-Post: <mailto:nvo3@ietf.org>
List-Help: <mailto:nvo3-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nvo3>, <mailto:nvo3-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 25 May 2017 19:58:39 -0000

Hi Tom,

Sorry for the late reply, I finally get the time to read your document.
Yes, you are right for the Linux RFS implementation, where RFS is indexed
with hash value. But for the NIC hardware accelerated RFS, it is not the
case. The flow is indexed not by hash value, but 5/4/3/2-tuple exact match
which will improve the performance flow steering. As we know, there will be
collision when using hash value. You could refer some NIC datasheet for the
detail. Then if NIC could not parse the inner header, it will fail to have
same flow steering as currently doing.



Regards

Lizhong

On Sun, May 7, 2017 at 12:32 AM, Tom Herbert <tom@herbertland.com> wrote:

> On Sat, May 6, 2017 at 9:15 AM, lizho.jin <lizho.jin@gmail.com> wrote:
> > Tom, see inline below.
> >
> >
> > Regards
> > Lizhong
> >
> > On 05/6/2017 23:45,Tom Herbert<tom@herbertland.com> wrote:
> >
> > On Sat, May 6, 2017 at 8:37 AM, lizho.jin <lizho.jin@gmail.com> wrote:
> >> I am not referring RSS, but RFS with HW acceleration. What I
> >>
> >> proposed is to use hash value instead of 5-tuple to do flow steering.
> >>
> > RFS works as is also. The only requirement for RFS is that the hash is
> > reasonably consistent for a flow. The host should never need to
> > reverse engineer the hash a NIC does.
> >
> > [Lizhong] but the consistent requirement will not be met sometimes. Way
> of
> > generating
> >
> > the source UDP port is privately designed. For example, what will be the
> >
> > rule to generate the source UDP port for the first TCP/UDP fragment
> packet.
> >
> > Some may use 5-tuple while some may use 3-tuple.
> >
> Or they may use the same port all the time and get no entropy at all.
> But, all the UDP encapsulation drafts say to set UDP source port with
> flow entry and the reference implementation (Linux) does this
> automatically for such protocols. UDP source port without flow entry
> is an implementation edge case that I don't think justifies the
> complexity to solve in hardware. UDP hash work today across commodity
> hardware to give us RSS, RPS, and RFS. Note, checksum offload is
> similarly solves in a protocol agnostic way so we don't need explicit
> support in NICs for that either.
>
> Please see https://people.netfilter.org/pablo/netdev0.1/papers/UDP-
> Encapsulation-in-Linux.pdf
> for details.
>
> Tom
>
> > And because of hash confliction, many hardware accelerated RFS do not
> >
> > use hash to select the CPU core, but use 5-tuple to select the CPU core.
> > While
> >
> > some privately designed method of source UDP port generation use very
> small
> > port
> >
> > range which will worse the hash confliction.
> >
> >
> >
> > Tom
> >
> >> Sorry for the misunderstanding.
> >>
> >>
> >> Regards
> >> Lizhong
> >>
> >> On 05/6/2017 23:24,Tom Herbert<tom@herbertland.com> wrote:
> >>
> >> On Fri, May 5, 2017 at 6:39 PM, lizho.jin <lizho.jin@gmail.com> wrote:
> >>> Tom, thanks for the reply, see inline below.
> >>>
> >>> Regards
> >>> Lizhong
> >>>
> >>> On 05/6/2017 00:14,Tom Herbert<tom@herbertland.com> wrote:
> >>>
> >>> [Lizhong] Total option length will not solve the parser buffer issue.
> >>> The parser buffer is located before parser, and for Geneve, implement
> >>> 512Byte is the only way since the longest of Geneve header is
> >>> 260Bytes. At least in some implementations as I know, hardware will
> >>> firstly receive enough 512Bytes per packets, and send the 512Bytes to
> >>> parser. Then parse will be able to skip over options to get inner
> >>> payload. Did I have any misunderstanding?
> >>>
> >>> [Tom] Skipping header is useful so that transit devices can find the
> >>> inner headers. The fact that there is no way to skip over an IPv6
> >>> extension header chain to find the transport headers of a packet has
> >>> been a source of unhappiness.
> >>>
> >>>
> >>> [Lizhong] That's correct, and if we have not any working around way,
> >>>
> >>> some device may fail to get inner header, just like IPv6 with too many
> >>>
> >>> extension headers fails to parse transport header. Currently many chips
> >>>
> >>> have this IPv6 extension header limitation.
> >>>
> >>>
> >>> [Tom] The parser buffer limit applies to all headers a device wishes
> >>> to inspect (some devices still may have less than 512 byte buffers
> >>> also). The best way to deal with this is to minimize the length of
> >>> headers. Geneve TLVs each have four bytes of overhead so they are less
> >>> compact that other TLVs at similar layer (IP options, TCP options,
> >>> IPv6 options each have two bytes overhead). The tradeoff made here is
> >>> probably to simply alignment (I really don't see any rationale for
> >>> needing 24 bits to identify options). Bit-fields are still better in
> >>> this regard for being compact since there is no additional overhead
> >>> per each option.
> >>>
> >>>
> >>> [Lizhong] I suspect, a 260Bytes long Geneve header is an overload
> design.
> >>>
> >>> Since one of the purpose of NIC to parse inner header is to get a hash
> >>> value
> >>>
> >>> to do flow steering, one way is to define a Geneve TLV which SHOULD be
> >>>
> >>> at the first one to carry the hash value of inner 5-tuple, and also
> hash
> >>> algorithm.
> >>>
> >>> Then NIC may only need to parse to the first Geneve TLV.
> >>>
> >>> Note that the source UDP port could not serve that purpose since that
> >>> port
> >>>
> >>> number could not be able to be predicted by the receiver.
> >>>
> >> Using the entropy in the UDP port number works perfectly well to get
> >> ECMP or RSS  for any UDP encapsulation including Geneve, VXLAN, GUE,
> >> etc. If the UDP port number  weren't good enough then the IPv6 flow
> >> label can be used (and that works for _any_ protocol not just UDP!).
> >>
> >>
> >> The goal should be to discourage intermediate devices from doing DPI
> >> into transport layer payloads. It requires a bunch of protocol
> >> specific logic and any interpretation may be completely wrong since
> >> port numbers don't have global meaning (e.g. if a device see a UDP
> >> port destined to port 6081 in the network it may or may not be
> >> Geneve).
> >>
> >> Tom
> >>
> >>>
> >>>
> >>>
>