Re: [Moq] Latency @ Twitch

On 11/9/2021 2:58 PM, Luke Curley wrote:
> Maybe a dumb thought, but is the PROBE_RTT phase required when sufficiently
> application limited, as is primarily the case for live video? If I
> understand correctly, it's meant to drain the queue to remeasure the
> minimum RTT, but that doesn't seem necessary when the queue is constantly
> being drained due to a lack of data to send.

The probe RTT mechanism is not engaged if there was at least one 
measurement at or below the stated RTT min in the last 10 seconds. So, 
yes, if there are "natural silences" there will be a clean RTT min 
measurement and no need to probe. But if there is competition with 
another data sources, that may well not happen.

I could trivially provide an API to my Quic stack and set the Probe RTT 
delay to 5 minutes or 1 hour, and look at the consequences. Most of the 
times it will probably work just fine, but I can see scenarios in which 
not resetting the RTT is actually harmful. For example, what if the RTT 
actually increased? So maybe we need to think about something smarter 
than either "stop everything every 10 seconds" or "keep the same min RTT 
for an hour."

If you are interested, there is quite a bit of literature about the RTT 
min issues and the "latecomer advantage" in the context of LEDBAT. Some 
of the LEDBAT issues apply almost directly to BBR. They may very well 
apply just the same to other CC algorithms based on delays.

> Either way, the issue is that existing TCP algorithms don't care about the
> live video use-case, and those are the ones that have been ported to QUIC
> thus far. But like Justin mentioned, this doesn't actually matter for the
> sake of standardizing a video over QUIC protocol provided the building
> blocks are in place.
Once QUIC developers start being interested in the use case, they will 
most probably find ways to tune their stacks and make them usable. It 
might be as simple as adding a dozen or so test cases in the validation 
suites...
> The real question is: do QUIC ACKs contain enough signal to implement an
> adequate live video congestion control algorithm? If not, how can we
> increase that signal, potentially taking cues from RMCAT (ex. RTT on a
> per-packet basis)?

We could have an interesting debate about whether RTT measurements on a 
per packet basis provides more signal or more noise. The feedback loop 
has a delay of 1 RTT, it is not obvious that performing control actions 
more often than 4 or 8 times per RTT brings you all that much. In any 
case, the QUIC ACK carry a lot of information -- ACK value, delay 
between arrival of highest sequence number packet and departure of ACK, 
list of packet ranges being acked with this ACK.

Standard QUIC acks every two packets. Many QUIC stack support the "ACK 
Frequency" extension, usually to tune down the frequency of ACKs for 
performance reasons. But the ACK frequency extension could also be used 
to increase the ACK frequency if that makes more sense for the scenario.

-- Christian Huitema

>
> On Tue, Nov 9, 2021, 10:27 AM Mo Zanaty (mzanaty) <mzanaty=
> 40cisco.com@dmarc.ietf.org> wrote:
>
>> All current QUIC CCs (BBRv1/2, CUBIC, NewReno, etc.) are not well suited
>> for real-time media, even for a rough “envelope” or “circuit-breaker”.
>> RMCAT CCs are explicitly designed for real-time media, but, of course, rely
>> on RTCP feedback, so must be adapted to QUIC feedback.
>>
>>
>>
>> Mo
>>
>>
>>
>>
>>
>> On 11/9/21, 1:13 PM, "Bernard Aboba"<bernard.aboba@gmail.com>  wrote:
>>
>>
>>
>> Justin said:
>>
>>
>>
>> "As others have noted, BBR does not work great out of the box for realtime
>> scenarios."
>>
>>
>>
>> [BA] At the ICCRG meeting on Monday, there was an update on BBR2:
>>
>>
>> https://datatracker.ietf.org/meeting/112/materials/slides-112-iccrg-bbrv2-update-00.pdf
>>
>>
>>
>> While there are some improvements, issues such as "PROBE_RTT" and rapid
>> rampup after loss remain, and overall, it doesn't seem like BBR2 is going
>> to help much with realtime scenarios.  Is that fair?
>>
>>
>>
>> On Tue, Nov 9, 2021 at 12:46 PM Justin Uberti <
>> juberti@alphaexplorationco.com> wrote:
>>
>> Ultimately we found that it wasn't necessary to standardize the CC as long
>> as the behavior needed from the remote side (e.g., feedback messaging)
>> could be standardized.
>>
>>
>>
>> As others have noted, BBR does not work great out of the box for realtime
>> scenarios. The last time this was discussed, the prevailing idea was to
>> allow the QUIC CC to be used as a sort of circuit-breaker, but within that
>> envelope the application could use whatever realtime algorithm it preferred
>> (e.g, goog-cc).
>>
>>
>>
>> On Thu, Nov 4, 2021 at 3:58 AM Piers O'Hanlon<piers.ohanlon@bbc.co.uk>
>> wrote:
>>
>>
>>
>> On 3 Nov 2021, at 21:46, Luke Curley<kixelated@gmail.com>  wrote:
>>
>>
>>
>> Yeah, there's definitely some funky behavior in BBR when application
>> limited but it's nowhere near as bad as Cubic/Reno. With those
>> algorithms you need to burst enough packets to fully utilize the congestion
>> window before it can be grown. With BBR I believe you need to burst just
>> enough to fully utilize the pacer, and even then this condition
>> <https://source.chromium.org/chromium/chromium/src/+/master:net/third_party/quiche/src/quic/core/congestion_control/bbr_sender.cc;l=393>  lets
>> you use application-limited samples if they would increase the send rate.
>>
>>
>>
>> And there’s also the idle cwnd collapse/reset behaviour to consider if
>> you’re sending a number of frames together and their inter-data gap exceeds
>> the RTO - I’m not quite sure how the various QUIC stacks have translated
>> RFC2861/7661 advice on this…?
>>
>>
>>
>> I started with BBR first because it's simpler, but I'm going to try out
>> BBR2 at some point because of the aforementioned PROBE_RTT issue. I don't
>> follow the congestion control space closely enough; are there any notable
>> algorithms that would better fit the live video use-case?
>>
>>
>>
>> I guess Google’s Goog_CC appears to be well used in the WebRTC space (e.g.
>> WEBRTC
>> <https://webrtc.googlesource.com/src/+/refs/heads/main/modules/congestion_controller/goog_cc>
>>   and aiortc
>> <https://github.com/aiortc/aiortc/blob/1a192386b721861f27b0476dae23686f8f9bb2bc/src/aiortc/rate.py#L271>)
>> despite the draft
>> <https://datatracker.ietf.org/doc/html/draft-ietf-rmcat-gcc>  never making
>> it to RFC status… There's also SCREAM
>> <https://datatracker.ietf.org/doc/rfc8298/>  which has an open
>> source implementation<https://github.com/EricssonResearch/scream>  but
>> not sure how widely deployed it is.
>>
>>
>>
>>
>>
>> On Wed, Nov 3, 2021 at 2:12 PM Ian Swett<ianswett@google.com>  wrote:
>>
>>  From personal experience, BBR has some issues with application limited
>> behavior, but it is still able to grow the congestion window, at least
>> slightly, so it's likely an improvement over Cubic or Reno.
>>
>>
>>
>> On Wed, Nov 3, 2021 at 4:40 PM Luke Curley<kixelated@gmail.com>  wrote:
>>
>> I think resync points are an interesting idea although we haven't
>> evaluated them. Twitch did push for S-frames in AV1 which will be another
>> option in the future instead of encoding a full IDR frame at these resync
>> boundaries.
>>
>>
>>
>> An issue is you have to make the hard decision to abort the current
>> download and frantically try to pick up the pieces before the buffer
>> depletes. It's a one-way door (maybe your algorithm overreacted) and you're
>> going to be throwing out some media just to redownload it at a lower
>> bitrate.
>>
>>
>>
>> Ideally, you could download segments in parallel without causing
>> contention. The idea is to spend any available bandwidth on the new segment
>> to fix the problem, and any excess bandwidth on the old segment in
>> the event it arrives before the player buffer actually depletes. That's
>> more or less the core concept for what we've built using QUIC, and it's
>> compatible with resync points if we later go down that route.
>>
>>
>>
>>
>>
>> And you're exactly right Piers. The fundamental issue is that a web player
>> lacks the low level timing information required to infer the delivery rate.
>> You would want something like BBR's rate estimation
>> <https://datatracker.ietf.org/doc/html/draft-cheng-iccrg-delivery-rate-estimation>  which
>> inspects the time delta between packets to determine the send rate. That
>> gets really difficult when the OS and browser delay flushing data to the
>> application, be it for performance reasons or due to packet loss (to
>> maintain head-of-line blocking).
>>
>>
>>
>> I did run into CUBIC/Reno not being able to grow the congestion window
>> when frames are sent one at a time (application limited). I don't believe
>> BBR suffers from the same problem though due to the aforementioned rate
>> estimator.
>>
>>
>>
>> On Wed, Nov 3, 2021 at 10:05 AM Ali C. Begen<ali.begen@networked.media>
>> wrote:
>>
>>
>>
>>> On Nov 3, 2021, at 6:50 PM, Piers O'Hanlon<piers.ohanlon@bbc.co.uk>
>> wrote:
>>>
>>>
>>>> On 2 Nov 2021, at 20:39, Ali C. Begen <ali.begen=
>> 40networked.media@dmarc.ietf.org> wrote:
>>>>
>>>>
>>>>> On Nov 2, 2021, at 3:39 AM, Luke Curley<kixelated@gmail.com>  wrote:
>>>>>
>>>>> Hey folks, I wanted to quickly summarize the problems we've run into
>> at Twitch that have led us to QUIC.
>>>>>
>>>>> Twitch is a live one-to-many product. We primarily focus on video
>> quality due to the graphical fidelity of video games. Viewers can
>> participate in a chat room, which the broadcaster reads and can respond to
>> via video. This means that latency is also somewhat important to facilitate
>> this social interaction.
>>>>> A looong time ago we were using RTMP for both ingest and distribution
>> (Flash player). We switched to HLS for distribution to gain the benefit of
>> 3rd party CDNs, at the cost of dramatically increasing latency. A later
>> project lowered the latency of HLS using chunked-transfer delivery, very
>> similar to LL-DASH (and not LL-HLS). We're still using RTMP for
>> contribution.
>>> I guess Apple do also have their BYTERANGE/CTE mode for LL-HLS which is
>> pretty similar to LL-DASH.
>>
>> Yes, Apple can list the parts (chunks in LL-DASH) as byteranges in the
>> playlist but the frequent playlist refresh and part retrieval process is
>> inevitable in LL-HLS, which is one of the main differences from LL-DASH (no
>> need for manifest refresh and request per segment not chunk).
>>
>>>>> To summarize the issues with our current distribution system:
>>>>>
>>>>> 1. HLS suffers from head-of-line blocking.
>>>>> During congestion, the current segment stalls and is delivered slower
>> than the encoded bitrate. The player has no recourse than to wait for the
>> segment to finish downloading, risking depleting the buffer. It can switch
>> down to a lower rendition at segment boundaries, but these boundaries occur
>> too infrequently (every 2s) to handle sudden congestion. Trying to switch
>> earlier, either by canceling the current segment or downloading the lower
>> rendition in parallel, only exacerbates the issue.
>>> Isn't the HoL limitation more down to the use of HTTP/1.1?
>>>
>>>> DASH has the concept of Resync points that were designed exactly for
>> this purpose (allowing you to emergency downshift in the middle of a
>> segment).
>>> I was curious if there are any studies or experience of how resync
>> points perform in practice?
>>
>> Resync points are pretty fresh out of the oven. dash.js has it in the
>> roadmap but not yet implemented (and we also need to generate test
>> streams). So, there is no data available yet with the real clients. But, I
>> suppose you can imagine how in-segment switching can help in sudden bw
>> drops especially for long segments.
>>
>>>>> 2. HLS has poor "auto" quality (ABR).
>>>>> The player is responsible for choosing the rendition to download. This
>> is a problem when media is delivered frame-by-frame (ie. HTTP
>> chunked-transfer), as we're effectively application-limited by the encoder
>> bitrate. The player can only measure the arrival timestamp of data and does
>> not know when the network can sustain a higher bitrate without just trying
>> it. We hosted an ACM challenge for this issue in particular.
>>> The limitation here may also be down to the lack of access to
>> sufficiently accurate timing information about data arrivals in the browser
>> - unfortunately the Streams API, which provides data from the fetch API,
>> doesn’t directly timestamp the data arrivals so the JS app has to timestamp
>> it which can suffer from noise such as scheduling etc - especially a
>> problem for small/fast data arrivals.
>>
>> Yes, you need to get rid of that noise (see LoL+).
>>
>>> I guess another issue could be that if the system is only sending single
>> frames then the network transport may be operating in application limited
>> mode so the cwnd doesn’t grow sufficiently to take advantage of the
>> available capacity.
>>
>> Unless the video bitrate is too low, this should not be an issue most of
>> the time.
>>
>>>> That exact challenge had three competing solutions, two of which are
>> now part of the official dash.js code. And yes, the player can figure what
>> the network can sustain *without* trying higher bitrate renditions.
>> https://github.com/Dash-Industry-Forum/dash.js/wiki/Low-Latency-streaming
>>>> Or read the paper that even had “twitch” in its title here:
>> https://ieeexplore.ieee.org/document/9429986
>>> There was a recent study that seems to show that none of the current
>> algorithms are that great for low latency, and the two new dash.js ones
>> appear to lead to much higher levels of rebuffering:
>>> https://dl.acm.org/doi/pdf/10.1145/3458305.3478442
>> Brightcove’s paper uses the LoL and L2A algorithms from the challenge
>> where low latency was the primary goal. For Twitch’s own evaluation, I
>> suggest you watch:
>> https://www.youtube.com/watch?v=rcXFVDotpy4
>> We later addressed the rebuffering issue, developed LoL+, which is the
>> version included in dash.js now and explained at the ieeexplore link I gave
>> above.
>>
>> Copying the authors in case they want to add anything for the paper you
>> cited.
>>
>> -acbegen
>>
>>
>>> Piers
>>>
>>>>> I believe this is why LL-HLS opts to burst small chunks of data
>> (sub-segments) at the cost of higher latency.
>>>>>
>>>>> Both of these necessitate a larger player buffer, which increases
>> latency. The contribution system it's own problems, but let me sync up with
>> that team first before I try to enumerate them.
>>>>> --
>>>>> Moq mailing list
>>>>> Moq@ietf.org
>>>>> https://www.ietf.org/mailman/listinfo/moq
>>>> --
>>>> Moq mailing list
>>>> Moq@ietf.org
>>>> https://www.ietf.org/mailman/listinfo/moq
>> --
>> Moq mailing list
>> Moq@ietf.org
>> https://www.ietf.org/mailman/listinfo/moq
>>
>>
>>
>> --
>> Moq mailing list
>> Moq@ietf.org
>> https://www.ietf.org/mailman/listinfo/moq
>>
>> --
>> Moq mailing list
>> Moq@ietf.org
>> https://www.ietf.org/mailman/listinfo/moq
>>
>> --
>> Moq mailing list
>> Moq@ietf.org
>> https://www.ietf.org/mailman/listinfo/moq
>>
>