[quicwg/base-drafts] QUIC PTO is too conservative, causing a measurable regression in tail latency (#3526)

ianswett <notifications@github.com> Mon, 16 March 2020 13:44 UTC

Return-Path: <noreply@github.com>
X-Original-To: quic-issues@ietfa.amsl.com
Delivered-To: quic-issues@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7DABB3A0869 for <quic-issues@ietfa.amsl.com>; Mon, 16 Mar 2020 06:44:18 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.695
X-Spam-Level:
X-Spam-Status: No, score=-1.695 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_IMAGE_ONLY_28=1.404, HTML_MESSAGE=0.001, MAILING_LIST_MULTI=-1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=github.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fzo7T8S6rK1e for <quic-issues@ietfa.amsl.com>; Mon, 16 Mar 2020 06:44:17 -0700 (PDT)
Received: from out-22.smtp.github.com (out-22.smtp.github.com [192.30.252.205]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1FE833A083F for <quic-issues@ietf.org>; Mon, 16 Mar 2020 06:44:17 -0700 (PDT)
Received: from github-lowworker-0eea13f.ash1-iad.github.net (github-lowworker-0eea13f.ash1-iad.github.net [10.56.109.26]) by smtp.github.com (Postfix) with ESMTP id 44D00A039F for <quic-issues@ietf.org>; Mon, 16 Mar 2020 06:44:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=github.com; s=pf2014; t=1584366256; bh=FIa8dc1/jacQIrmr2JUT4+70HE+7P71HgG9/+uHkBN4=; h=Date:From:Reply-To:To:Cc:Subject:List-ID:List-Archive:List-Post: List-Unsubscribe:From; b=XGjnPPgYqeyd+crQQrXd7qhjU6HHHvxJuhs26CXQm59Ac8zlwdgOKlQnvv4i+oHsG 5F1jDgQ0A7oBNOli/0e+siiqxvPV0XDeqgtS741cJfRC1E/jGb4vV/12JQkyatL2i9 UzBm4LDyTNzKpVNtBc+7hhSbXn9cH6o013YXfTp0=
Date: Mon, 16 Mar 2020 06:44:16 -0700
From: ianswett <notifications@github.com>
Reply-To: quicwg/base-drafts <reply+AFTOJK7HWI4BYCEAMT3W6SF4PNR3BEVBNHHCFNKKLA@reply.github.com>
To: quicwg/base-drafts <base-drafts@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <quicwg/base-drafts/issues/3526@github.com>
Subject: [quicwg/base-drafts] QUIC PTO is too conservative, causing a measurable regression in tail latency (#3526)
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="--==_mimepart_5e6f82b035136_c1e3fde8facd96c121560"; charset="UTF-8"
Content-Transfer-Encoding: 7bit
Precedence: list
X-GitHub-Sender: ianswett
X-GitHub-Recipient: quic-issues
X-GitHub-Reason: subscribed
X-Auto-Response-Suppress: All
X-GitHub-Recipient-Address: quic-issues@ietf.org
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic-issues/I16Bxu44MHCL0NFwI89cu4N814A>
X-BeenThere: quic-issues@ietf.org
X-Mailman-Version: 2.1.29
List-Id: Notification list for GitHub issues related to the QUIC WG <quic-issues.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/quic-issues>, <mailto:quic-issues-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/quic-issues/>
List-Post: <mailto:quic-issues@ietf.org>
List-Help: <mailto:quic-issues-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/quic-issues>, <mailto:quic-issues-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 16 Mar 2020 13:44:19 -0000

Google and Chrome have been running experiments with the new IETF QUIC PTO algorithms post-handshake for a few months with Google QUIC, and it appears the first PTO is too conservative relative to Google QUIC.  On the positive side, the rate of spurious TLPs/PTOs is MUCH lower, so that's WAI.

This slows down detection of tail losses and causes a regression in 95%(and higher) latency metrics.  It's also visible as an increase in rebuffers for YouTube.  The changes aren't enormous, but they're large enough I can't launch this as is. 

I believe it's possible to make the first PTO slightly earlier, and stay aligned with the principles of TCP TLP and RTO.  Specifically, my suggestion is to set first PTO expiry from the left edge(see Section 5 of 6398), not the right edge, but there's an additional safety measure of waiting 1.5RTTs since the last ACK-eliciting packet was sent(1.5 is from TLP).  

It'll be a few more weeks before we have data from Chrome Stable, but I believe this may be enough to remove the regression.  Chromium code [here](https://cs.chromium.org/chromium/src/net/third_party/quiche/src/quic/core/quic_sent_packet_manager.cc?sq=package:chromium&g=0&l=1054).

One other note is we found that only sending one PTO packet seems sufficient as long as a packet number is skipped, so the delayed ack timer doesn't come into play.

@mjoras or others may also have production metrics relevant to this issue.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/quicwg/base-drafts/issues/3526