Re: [tsvwg] links to Canary methods for roll-out of new transport features

"Gorry (erg)" <gorry@erg.abdn.ac.uk> Fri, 30 July 2021 09:13 UTC

Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: "Gorry (erg)" <gorry@erg.abdn.ac.uk>
Mime-Version: 1.0 (1.0)
Date: Fri, 30 Jul 2021 10:12:36 +0100
Message-Id: <232F9BFA-0D05-48C5-807E-FA2A7904754A@erg.abdn.ac.uk>
References: <AF731D2C-B796-4B20-973D-6DB496DB1228@akamai.com>
Cc: tsvwg@ietf.org
In-Reply-To: <AF731D2C-B796-4B20-973D-6DB496DB1228@akamai.com>
To: "Holland, Jake" <jholland=40akamai.com@dmarc.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/PJKCr6J6n0tZwFOhz0cRaD-qS3g>
Subject: Re: [tsvwg] links to Canary methods for roll-out of new transport features
Precedence: list

I am not sure though, I think many CC- related topics can have potential collateral damage, and we have managed to deploy these gradually and improve the transport were necessary. So, what is different here to exploring methods such as larger Initial Window, BBR, Hystart, etc.

Gorry

> On 30 Jul 2021, at 03:02, Holland, Jake <jholland=40akamai.com@dmarc.ietf.org> wrote:
> 
> Hi Gorry,
> 
> A canary approach works very well for features that exhibit problems
> on the connections that are using a new feature.
> 
> However, it works less well for features that cause problems with
> competing traffic, as is the case for TCP Prague flows traversing
> a classic queue.
> 
> The issue is that if the new feature (such as a TCP Prague
> endpoint) is causing a problem for competing flows, possibly flows
> from another different operator, there is no indication of the
> problem that's visible to the TCP Prague sender except perhaps
> service calls when the issue can be traced back to the source by
> someone troubleshooting the impacted services.  From the experimental
> endpoint's point of view, the performance is working great.  (Higher
> throughput than usual!  Not much loss!)
> 
> So it's somewhat more challenging to build a canary that can detect
> problems with competing impacted traffic.  This is in contrast to the
> usage of canaries for most services, which will alert when the
> experimental service or upgrade introduces a regression for that
> service.
> 
> This problem is closely related to the difficulty of detecting a
> classic queue, and in most cases would suffer the same pathologies.
> 
> This is covered to some extent in the first resource you linked,
> under "Requirements of a Canary Process", and again under "Metrics
> Should Indicate Problems".
> 
> I hope that's helpful.
> 
> Best regards,
> Jake
> 
> 
> On 07-26, 3:59 PM, "Gorry Fairhurst" <gorry@erg.abdn.ac.uk> wrote:
> 
> So, there's been a change in the way people roll-out new features, that 
> maybe we could say more about in the L4S OPS draft. What I write below 
> is not specific to L4S, and I'd really welcome other familiar with using 
> and evaluating such methods to chime-in and say more, but anyway here is 
> starter:
> 
> Canarying is a partial and time-limited deployment of a change in a 
> service/protocol and its evaluation as a part of the deployment. The 
> method is used throughout the roll-out and helps to decide whether or 
> not to continue with the rollout. The part of the service that receives 
> the change is “the canary,” and the remainder of the service is “the 
> control.” The canary deployment is performed on a small subset of the 
> networ/users, than the control. Canarying is evaluated as an A/B testing 
> process, to check the impact of the (initial) deployment.
> 
> See this from google:
> 
> https://sre.google/workbook/canarying-releases/
> 
> https://developer.android.com/distribute/best-practices/launch/test-tracks 
> 
> When working with QUIC  people have released an update to only to a 
> small subset of the user base, monitor stability or another metric of 
> interest, and
> decide whether to roll out the update to more users, to wait for more 
> data to come in, or to halt the rollout altogether.If one of the metrics 
> you’re monitoring is off, or you check the user reviews and  see issues 
> or complaints on a specific topic. You don't need to enable a feature
> for anyone/network who you might expect to be hurt.
> 
> ECN isn't just "automatically" used, the app can decide (or at least the 
> app-supplier), this will always be the case for QUIC anyway. The
> result of these tests provide the sort of data that has informed QUIC 
> (e.g. Chrome Canary), and I expect the basis of what is reported by
> google and others in MAPRG. The point is that this allows statistical 
> testing without massive impact, and the incremental roll-out.
> 
> This says something about akamai's use:
> 
> https://www.akamai.com/uk/en/products/performance/cloudlets/phased-release.jsp 
> 
> Cloudflare, etc have used similar approaches:
> 
> https://medium.com/boozt-tech/canary-release-with-cloudflare-workers-84a9b45bac0f 
> 
> 
> Gorry
> 
> 
>

[tsvwg] links to Canary methods for roll-out of n… Gorry Fairhurst
Re: [tsvwg] links to Canary methods for roll-out … Holland, Jake
Re: [tsvwg] links to Canary methods for roll-out … Gorry (erg)
Re: [tsvwg] links to Canary methods for roll-out … Jonathan Morton
Re: [tsvwg] links to Canary methods for roll-out … Gorry Fairhurst
Re: [tsvwg] links to Canary methods for roll-out … Martin Duke
Re: [tsvwg] links to Canary methods for roll-out … Holland, Jake
Re: [tsvwg] links to Canary methods for roll-out … Martin Duke
Re: [tsvwg] links to Canary methods for roll-out … Jonathan Morton