Re: [iccrg] BBR and acking stagnant windows

Neal Cardwell <ncardwell@google.com> Wed, 20 November 2019 22:37 UTC

Return-Path: <ncardwell@google.com>
X-Original-To: iccrg@ietfa.amsl.com
Delivered-To: iccrg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E7430120836 for <iccrg@ietfa.amsl.com>; Wed, 20 Nov 2019 14:37:08 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -17.501
X-Spam-Level:
X-Spam-Status: No, score=-17.501 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, USER_IN_DEF_DKIM_WL=-7.5, USER_IN_DEF_SPF_WL=-7.5] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=google.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RkAIQtUQwFyO for <iccrg@ietfa.amsl.com>; Wed, 20 Nov 2019 14:37:06 -0800 (PST)
Received: from mail-ot1-x332.google.com (mail-ot1-x332.google.com [IPv6:2607:f8b0:4864:20::332]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A68EA12082D for <iccrg@irtf.org>; Wed, 20 Nov 2019 14:37:06 -0800 (PST)
Received: by mail-ot1-x332.google.com with SMTP id b16so1112636otk.9 for <iccrg@irtf.org>; Wed, 20 Nov 2019 14:37:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Tt7H8mC3NY3Pj3WWpjARr6DnCtfjD0H3w8l/44D5/Ts=; b=pks3C/Q/27fCrozNMFpACCWHmr8U5dNIdbW5wzZFEQVjkYkRvm9jYDvuoCPOP/hMiB geCz9TG88SyhdB2Clpv4GTZ0Pd1XmoMbrFoK/FtVxITApguxjs4eV5nl4CNTXWya182V fwe06JvNu2J/t35CJLa7k86Md2oullbqivDRubigCezP70ejTwRyR+lzWMKW84Lh3F2C OosGsZSIICXVEV/SFlnrvQPrKN9IXpi8Z2F0pPSE56ms1zQvklc5elUfuoULM3erxN4k XnQnjfMtpiyagotNePI8dN2iscKRMdNWd/iRt/4z3BHlHNF3Y3VYAsIdlHB6UFP20U4M Lcvg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Tt7H8mC3NY3Pj3WWpjARr6DnCtfjD0H3w8l/44D5/Ts=; b=fTZYOyLG9mZwJEdOpy+grC4npY9T29XclICDzPJ9a441MAc5xYIK6So3b1hvElx2bP cfHhMqhn8XyQ9w5FnYG9z1jSjGK6eopdhmnw9Ex+NIS1YOUsvRx89EW+FFrWaMdAomsj /JIlp8XdwhlwRdIQGxrYbAvIpwN/2Z1svqUpnxskpLIuSLcmobEuZNEZ4GZu0YOMHeye xsef1kjGxfJjFSmfQMyuWy5jhfdQIk9Frmk9IEhRLmYLefYwnQipbP9wMRk8TeGdUH7r KaZt1Q9eIGIsWV4x4m6fKR+UylME8GNMc7uPF9defeb2uEzxmRFIg3Nq1Wlqx7kozagN jA1Q==
X-Gm-Message-State: APjAAAUQjsEABmMzJirrRReuByMur1O4QarEo1YNhLApf03s8zmFGBK5 PICUI+XV4RqbUp7J8dCG6xGbMZU4SHoGtzfjxMFkhQ==
X-Google-Smtp-Source: APXvYqxbYK0xhukQNSegNZ0xkejOFASwMOqC7kIMJXulj+/uxj/7KE5EubE0azyrtzV1y70W+QXPteak/+qWFHnZh4w=
X-Received: by 2002:a05:6830:1af7:: with SMTP id c23mr3794353otd.247.1574289425489; Wed, 20 Nov 2019 14:37:05 -0800 (PST)
MIME-Version: 1.0
References: <CAM4esxS3wSaccYYmCHH6O9k+0FZ-f2fdpMoHG1oDBbKAU5piHw@mail.gmail.com>
In-Reply-To: <CAM4esxS3wSaccYYmCHH6O9k+0FZ-f2fdpMoHG1oDBbKAU5piHw@mail.gmail.com>
From: Neal Cardwell <ncardwell@google.com>
Date: Wed, 20 Nov 2019 17:36:49 -0500
Message-ID: <CADVnQykOaeMTdsRf4=Mj5LYA=ox9fG2x980eXCo4HORjJUf03A@mail.gmail.com>
To: Martin Duke <martin.h.duke@gmail.com>
Cc: iccrg IRTF list <iccrg@irtf.org>, Yuchung Cheng <ycheng@google.com>, Priyaranjan Jha <priyarjha@google.com>, Soheil Hassas Yeganeh <soheil@google.com>
Content-Type: text/plain; charset="UTF-8"
Archived-At: <https://mailarchive.ietf.org/arch/msg/iccrg/7Gd4hXxS0g4UUQHNpovYA-tALPE>
Subject: Re: [iccrg] BBR and acking stagnant windows
X-BeenThere: iccrg@irtf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Discussions of Internet Congestion Control Research Group \(ICCRG\)" <iccrg.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/iccrg>, <mailto:iccrg-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/iccrg/>
List-Post: <mailto:iccrg@irtf.org>
List-Help: <mailto:iccrg-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/iccrg>, <mailto:iccrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Nov 2019 22:37:09 -0000

On Mon, Nov 18, 2019 at 11:59 PM Martin Duke <martin.h.duke@gmail.com>; wrote:
>
> Neal,
>
> I'd like to follow up on our discussion in Singapore. I have no special knowledge of why Linux doesn't ack when the window doesn't move, but perhaps you can help me reason about it.
>
> From the most abstract principle, it seems sensible that if an endpoint can only process 100kbps, I wouldn't give it more bandwidth than that. That's what the current code does.
>
> However, considering the practical case, what would be the impact of removing this line of code?
> 1) if the connection is almost done, it can complete faster by consuming the relevant rwnd.
> 2) the RTT estimate will be closer to min_rtt, for algorithms that care.
> 3) if not done, we will end up rwnd-limited rather than cwnd-limited. I am less comfortable reasoning about this, but this may lead to burstier traffic and periods of idleness waiting for zero-window probes.
>
> Anyway, you alluded to some particular RPC use cases that causes more serious problems. Can you elaborate?

Hi Martin,

Thanks for following up.

This issue first showed up with an RPC workload with many client connections
sending large RPC requests to a server, which replied with small RPC responses.
Relative to the previous CC, the bbr2 flows showed latency increases at the
tail of the distribution:

                       % change from baseline to bbr2
       ==============================
       Latency (Avg)     3.5 +/- 0.1
       Latency (50%)    -3.9 +/- 0.1
       Latency (95%)    64.4 +/- 0.3
       Latency (99%)   103.3 +/- 0.8

As best as I can tell from analyzing the packet traces (like the one in the
BBR v2 IETF 106  ICCRG slide deck):

(1) With an immediate ACK for cases where unacked data > RCV.MSS,
    the connection stays pipelined:

  o there are almost always two offload bursts in flight

  o as soon as the receiver receives one of the offload bursts,
    it ACKs immediately

  o as soon as the sender receives that ACK, it sends the next data burst

(2) With the ACK delayed until the next application read(), the connection
    almost always stops pipelined behavior and degenerates into "stop and wait"
    behavior:

  o there is typically just one offload burst (the whole cwnd) in flight

  o when the receiver receives one of the data bursts,
    it waits until the app reads the data before ACKing

  o the data arrives in bursts, so often it takes the app a while
    to get around to reading a particular connection

  o while waiting for the app to read the data, and then for the
    ACK to reach the data sender, and then for the next chunk of
    data to be sent, the network link may go idle

AFAICT, by waiting for the application to read the data out of the buffer
before generating the ack, the receive buffer loses its ability to be a good,
well, buffer. It is no longer a "buffer" to deal with temporary rate mismatches
between network and receiving application. Instead, the behavior has tightly
coupled the timing of the network sending process to the details of the timing
of the application reading process. And the reading process may be delayed or
bursty, due to many threads servicing a bursty client workload and having to
get around to reading from the many sockets.

We saw similar performance problems with the previous TCP CC when we tried
operating without faster ACKs and instead relying on the upstream linux "ack if
new rwin >= old rwin" policy.

I agree there is a possibility that the receiver delays feed into a reducing in
pacing rate, as I believe Martin mentioned as a possibility. If so, then this
effect is apparently happening both for BBR-style pacing and for the pacing in
the base Linux TCP stack, which is a function of: pacing_rate = k * cwnd / srtt.

Whatever the exact set of dynamics leading to the performance problem,
the problem is not limited to BBRv2.

thanks,
neal