Re: [quicwg/base-drafts] QUIC PTO is too conservative, causing a measurable regression in tail latency (#3526)

Martin Thomson <notifications@github.com> Tue, 17 March 2020 03:49 UTC

Return-Path: <noreply@github.com>
X-Original-To: quic-issues@ietfa.amsl.com
Delivered-To: quic-issues@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6A55E3A16DC for <quic-issues@ietfa.amsl.com>; Mon, 16 Mar 2020 20:49:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.1
X-Spam-Level:
X-Spam-Status: No, score=-3.1 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, MAILING_LIST_MULTI=-1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=github.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id t9ZUpPwxch0y for <quic-issues@ietfa.amsl.com>; Mon, 16 Mar 2020 20:49:18 -0700 (PDT)
Received: from out-5.smtp.github.com (out-5.smtp.github.com [192.30.252.196]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B30AA3A16DB for <quic-issues@ietf.org>; Mon, 16 Mar 2020 20:49:18 -0700 (PDT)
Received: from github-lowworker-a6a2749.va3-iad.github.net (github-lowworker-a6a2749.va3-iad.github.net [10.48.16.62]) by smtp.github.com (Postfix) with ESMTP id 825B79605BE for <quic-issues@ietf.org>; Mon, 16 Mar 2020 20:49:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=github.com; s=pf2014; t=1584416957; bh=4C93bTWP80Dr/9cNrBdfl4jkAfZcy9TczXZ4waDKHEE=; h=Date:From:Reply-To:To:Cc:In-Reply-To:References:Subject:List-ID: List-Archive:List-Post:List-Unsubscribe:From; b=1z0H4VaAaVDqUzYD/fp0oUvNGlyXEBe8TXnZ5HQ46/vczfo7NbHpAvs/rffxoi6t0 d1jNbW+/HHZns3EKUlKcP3XBTsFx5Qu/o29+T+3ogdylZcJMufmf0k/sB8L/sB6G9Q Oto+JDylVftqFygYmQOGYnYFLIswuSIVsqqsynLE=
Date: Mon, 16 Mar 2020 20:49:17 -0700
From: Martin Thomson <notifications@github.com>
Reply-To: quicwg/base-drafts <reply+AFTOJK2RHWAYPJU4Z5EUJZN4PQU33EVBNHHCFNKKLA@reply.github.com>
To: quicwg/base-drafts <base-drafts@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <quicwg/base-drafts/issues/3526/599863594@github.com>
In-Reply-To: <quicwg/base-drafts/issues/3526@github.com>
References: <quicwg/base-drafts/issues/3526@github.com>
Subject: Re: [quicwg/base-drafts] QUIC PTO is too conservative, causing a measurable regression in tail latency (#3526)
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="--==_mimepart_5e7048bd72e82_5cf13f8835acd96818729e"; charset="UTF-8"
Content-Transfer-Encoding: 7bit
Precedence: list
X-GitHub-Sender: martinthomson
X-GitHub-Recipient: quic-issues
X-GitHub-Reason: subscribed
X-Auto-Response-Suppress: All
X-GitHub-Recipient-Address: quic-issues@ietf.org
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic-issues/iBlPrAN46uwjrNIz5f2qJ2jV0gY>
X-BeenThere: quic-issues@ietf.org
X-Mailman-Version: 2.1.29
List-Id: Notification list for GitHub issues related to the QUIC WG <quic-issues.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/quic-issues>, <mailto:quic-issues-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/quic-issues/>
List-Post: <mailto:quic-issues@ietf.org>
List-Help: <mailto:quic-issues-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/quic-issues>, <mailto:quic-issues-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 17 Mar 2020 03:49:21 -0000

Ah, I slipped into the same trap as usual.  Mean absolute difference from the mean is the EWMA approximation we are using.  Mean absolute difference is always less than standard deviation, with the factor being ~0.8 for a normal distribution.

Do you have information on the distribution of RTT values that might help determine whether 2 is the right multiplier?  If you have a normal distribution or something approximating that, then 3 would be closer (3.2 and change even).

I can justify capping PTO, but it requires that you be comfortable with being more aggressive.  But, at least to me, it seems like this is all about more aggressive sending.  Which naturally improves performance most of the time.  The network is likely to be able to absorb a minutely higher rate of probe packets as a result.

Based on recent history of the RTT estimate, the peer can produce an acknowledgment within sRTT most of the time.  Anything more than value is included only to allow more time for the peer to respond.  However, if this is the trailing edge of a flight of packets, then any packet in that flight can be acknowledged.  Allowing for some amount of variation in delays relative to the leading edge seems sensible, but delays on the trailing edge do not need any such justification.  The only reason not to use the leading edge alone is that you don't want to send probes when you are sending productive data within the congestion window.

On that basis, probing on the RTT estimate is fine.  That means dead air of one RTT, and maybe some wasted probes because of jitter in the network that extends the effective round trip time.  But you don't need to avoid unnecessary probing with very high confidence, you just need to avoid it most of the time.

The risk with probing sooner is that RTT might genuinely increase, so that probes - which might exceed the congestion window - don't help at all.  That should be rare; we have to treat past performance as a predictor for future gains, that's the basis of all of these systems.  As long as there is an exponential backoff on PTO, we're unlikely to cause significant congestion as you only ever send a handful of probes before flipping to persistent congestion.

A shorter and more aggressive timer here also results in hitting persistent congestion earlier.  So if there really is a routing flap that increases RTT significantly - a case where you really want to reset the congestion window anyway - then you will just end up in hitting that state sooner.

Another justification for a shorter PTO is that your estimation disadvantages shorter sequences of packets.  A single packet flight doesn't get probed any sooner under your algorithm, but that is more likely to see loss of the entire flight than a longer sequence of packets.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/quicwg/base-drafts/issues/3526#issuecomment-599863594