Re: [rtcweb] WebRTC and Real-time Translation
Justin Uberti <juberti@google.com> Thu, 04 October 2018 23:29 UTC
Return-Path: <juberti@google.com>
X-Original-To: rtcweb@ietfa.amsl.com
Delivered-To: rtcweb@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B9D4D130DC4 for <rtcweb@ietfa.amsl.com>; Thu, 4 Oct 2018 16:29:28 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -17.5
X-Spam-Level:
X-Spam-Status: No, score=-17.5 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, ENV_AND_HDR_SPF_MATCH=-0.5, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, USER_IN_DEF_DKIM_WL=-7.5, USER_IN_DEF_SPF_WL=-7.5] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=google.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id serwh9fN_nS0 for <rtcweb@ietfa.amsl.com>; Thu, 4 Oct 2018 16:29:20 -0700 (PDT)
Received: from mail-it1-x12c.google.com (mail-it1-x12c.google.com [IPv6:2607:f8b0:4864:20::12c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 52F27129385 for <rtcweb@ietf.org>; Thu, 4 Oct 2018 16:29:20 -0700 (PDT)
Received: by mail-it1-x12c.google.com with SMTP id p64-v6so373410itp.0 for <rtcweb@ietf.org>; Thu, 04 Oct 2018 16:29:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=pS1pdY6gMLhWJKsGmVVUo2PHaXcckqixEh6DtdiBIMg=; b=UiU7G65LmMmhKk49RwwqKbV6NXGl47gl4yU4Km8KAjLu907rf7LI3cGBdA7gVhvWWV 26r4O7ddu9XtY6bPiAeOtEyXqwbNoQVl4/YLqrzun+waoqDL+wAPBgoR5NzK3vWYk2qt h9J2lOxoWuXhmfby59Eu2MD/tkRXsaV1m93nswtiYQgAMCGYmrpNoplJBZiHq+Ok8+uR qAVM9Ppm6ZlUU7RW3z1OIb2il7kqu52dlvHqt5eHIv0X564rUt2vm1s48iQgarxMSTr8 ZQPbjuKPkv5jHsXrzkhAz1eoCBYRH27VY4Q4fXEeZ+SjI4zxsMphpHPyRC147R7cBW5U ruhw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=pS1pdY6gMLhWJKsGmVVUo2PHaXcckqixEh6DtdiBIMg=; b=apZtLSIsRG8RN+qLklUzyIq8AtXpVg7NacZQlPaCTPVuU2QpxeS1HDyy5HY6JYcPlA 09g94uh5CRcOPj1LC9LQAN7Hsp5u+I5KdFBIE9m32fYiVOkijSgHbyWCmVD+P1y6waRE TZLUfiBnZgrYIw/h1+IAdFJ8wi+h/l2WCgrBqFFdUdgsYNJERvmQUB+6ktzKnbhCnscl MhIEBZzoVHX2T6I4eUWjl1Ll7s5E5JPi4PP6caB7S6OuS7FvFvfgd0RpWCtLZ8sjq3PQ exkPPKDjYgfzJGNVpvvhnWjBiu7Be37TYRvzJf2y9be7Bc5KkHWEZDIo6Q0xk+cmRAjI LO5Q==
X-Gm-Message-State: ABuFfohyyg0ivFWSnI0sPmJi5nffW7bcgGAg/QfTY3ik4vyb/tgWFlJR 33MOjISZ5ShfXtl2tgMv1Sgv8ALqWYAwnxN8RA7IoA==
X-Google-Smtp-Source: ACcGV63FK0rYuHIHJ/cSKol3ib63/ygaOcrIJVNX8VRFezW6s8BQPE3sMBygt3pzlU2XP2l5XY0tkuUxuE/f0WjpHiw=
X-Received: by 2002:a24:d1c5:: with SMTP id w188-v6mr5990252itg.99.1538695758940; Thu, 04 Oct 2018 16:29:18 -0700 (PDT)
MIME-Version: 1.0
References: <CY4PR0101MB309521AF4EF436C0D1503741C5150@CY4PR0101MB3095.prod.exchangelabs.com> <CAOW+2dvkgpWp6h+MY1YY4jDG3=KG-WPes-A1WXW6yuxRG6f9vg@mail.gmail.com> <CY4PR0101MB3095FC76C07E9FD6EB177506C5140@CY4PR0101MB3095.prod.exchangelabs.com> <CY4PR0101MB3095D308D294CCC2EF9026A2C5E90@CY4PR0101MB3095.prod.exchangelabs.com> <CY4PR0101MB3095CD494BF85133BBE9C51DC5EA0@CY4PR0101MB3095.prod.exchangelabs.com>
In-Reply-To: <CY4PR0101MB3095CD494BF85133BBE9C51DC5EA0@CY4PR0101MB3095.prod.exchangelabs.com>
From: Justin Uberti <juberti@google.com>
Date: Thu, 04 Oct 2018 16:29:06 -0700
Message-ID: <CAOJ7v-0enaeWGRdnoTOOD3cetq-Eq+mEoKVKMA2WtN40vdmBrA@mail.gmail.com>
To: adamsobieski@hotmail.com
Cc: Bernard Aboba <bernard.aboba@gmail.com>, Ted Hardie <ted.ietf@gmail.com>, RTCWeb IETF <rtcweb@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000219aaa05776f863c"
Archived-At: <https://mailarchive.ietf.org/arch/msg/rtcweb/zzrQ4zt_CyOcn0YZEiL-PvL33-g>
Subject: Re: [rtcweb] WebRTC and Real-time Translation
X-BeenThere: rtcweb@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Real-Time Communication in WEB-browsers working group list <rtcweb.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rtcweb/>
List-Post: <mailto:rtcweb@ietf.org>
List-Help: <mailto:rtcweb-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 04 Oct 2018 23:29:29 -0000
I think much of what you mention will be possible with the new APIs discussed in https://w3c.github.io/webrtc-nv-use-cases/ (and probably can already be done to some extent with Web Audio). However, I'm not sure that a new transport is needed for this particular use case - it seems mostly satisfied by on-device processing capabilities. On Thu, Oct 4, 2018 at 10:25 AM Adam Sobieski <adamsobieski@hotmail.com> wrote: > A New Transport Protocol? > > The QUIC API [1] shows how to extend the WebRTC specification to enable > the use of a new transport protocol. > > “This specification extends the WebRTC specification [WEBRTC] to enable > the use of QUIC [QUIC-TRANSPORT] to exchange arbitrary data with remote > peers using NAT-traversal technologies such as ICE, STUN, and TURN. Since > QUIC can be multiplexed on the same port as RTP, RTCP, DTLS, STUN and TURN, > this specification is compatible with all the functionality defined in > [WEBRTC], including communication using audio/video media and SCTP data > channels.” > > It could be that, for both real-time translation and video processing > scenarios utilizing combinations of local components and remote services, > that a new transport protocol or a new version of an existing transport > protocol is needed. Such a transport protocol could facilitate transmitting > and routing certain streams (or copies of certain streams) outside of > envelopes between two or more peers such that such streams rejoin other > streams in envelopes on remote peers. > > With such a transport protocol, one could specify, prepare and activate *processing > graphs* for one or more audio, video or data streams. A processing graph > could be such that a stream passes through speech recognition, translation > and speech synthesis components or services between two or more peers. A > processing graph could be such that a stream passes through a video > processing component or service between two or more peers. > > Hopefully, I’ve indicated how tractable it is to add real-time translation > and video processing to a next version of WebRTC. I’m confident that we can > solve any remaining technical details in the upcoming years. > > What do you think about this approach, a solution including a new > transport protocol or version of a transport protocol to provide real-time > translation and real-time video processing utilizing interconnected local > components and remote services? > References > > [1] https://w3c.github.io/webrtc-quic/ > > > > > > Best regards, > > Adam Sobieski > > > > > ------------------------------ > *From:* rtcweb <rtcweb-bounces@ietf.org> on behalf of Adam Sobieski < > adamsobieski@hotmail.com> > *Sent:* Wednesday, October 3, 2018 6:14:51 PM > *To:* Bernard Aboba; ted.ietf@gmail.com > *Cc:* RTCWeb IETF > *Subject:* Re: [rtcweb] WebRTC and Real-time Translation > > RTP Media API > > https://www.w3.org/TR/webrtc/#rtp-media-api > > “The RTP media API lets a web application send and receive > MediaStreamTracks over a peer-to-peer connection. Tracks, when added to an > RTCPeerConnection, result in signaling; when this signaling is forwarded to > a remote peer, it causes corresponding tracks to be created on the remote > side.” > > “The actual encoding and transmission of MediaStreamTracks is managed > through objects called RTCRtpSenders. Similarly, the reception and decoding > of MediaStreamTracks is managed through objects called RTCRtpReceivers. > Each RTCRtpSender is associated with at most one track, and each track to > be received is associated with exactly one RTCRtpReceiver.” > > Envisioned for real-time translation scenarios is that audio tracks – or > copies of audio tracks – can be routed through one or more local components > and remote services such that resultant output can be either sent to a > remote side or multicast to multiple other peers. In particular for > scenarios which utilize remote services, audio tracks to be translated may > travel outside of the envelopes for other tracks. Translated content should > rejoin other tracks on the remote side for synchronized presentation or > processing. > > Real-time audio-to-audio translation is one scenario. Another scenario is > real-time audio-to-subtitles translation where the results of real-time > translation are desired to arrive as a subtitles track. A third scenario is > where translation results are desired to arrive as data, for example to > appear on-screen per the formatting and layout of a web application. The > output from one or more interconnected components and services which > perform real-time translation could then include: (1) audio, (2) subtitles, > (3) data. > Use Case: Funny Hats > > https://w3c.github.io/webrtc-nv-use-cases/#funnyhats* > > The capability of routing one or more tracks through one or more local > components and remote services also facilitates scenarios resembling those > discussed in the use case: Funny Hats. > > Differences include: (1) funny hats scenarios utilize video tracks and (2) > real-time translation scenarios, while possible to present as singular > services, may include a daisy-chaining or a pipelining of a number of local > components or remote services: speech recognition, translation and speech > synthesis. > > As with real-time translation, we can envision free as well as priced > video processing services. > > > > > > Best regards, > > Adam Sobieski > > > ------------------------------ > *From:* rtcweb <rtcweb-bounces@ietf.org> on behalf of Adam Sobieski < > adamsobieski@hotmail.com> > *Sent:* Thursday, September 27, 2018 8:50:02 PM > *To:* Bernard Aboba; ted.ietf@gmail.com > *Cc:* RTCWeb IETF > *Subject:* Re: [rtcweb] WebRTC and Real-time Translation > > > Bernard Aboba, > Ted Hardie, > Client-side Transcription and Translation > > With respect to client-side speech recognition, transcription, translation > and speech synthesis scenarios, we can consider GPGPU approaches. > > HYDRA [1][2] is a “hybrid GPU/CPU-based speech recognition engine that > leverages modern GPU-based parallel computing architectures to realize > accurate real-time recognition with extremely large models.” In 2012, > Professor Ian Lane indicated that HYDRA performs 20x faster than other > approaches [3]. > > Deep Speech [4][5][6] is a deep-learning-based approach to speech > recognition which “outperforms previously published results on the widely > studied Switchboard Hub5'00, achieving 16.0% error on the full test set” > and “handles challenging noisy environments better than widely used, > state-of-the-art commercial speech systems....” > > Articulatory synthesis can be accelerated by graphics cards [7]. > > WaveNet [8][9] is a deep generative model of raw audio waveforms including > speech audio. > > Facebook AI Research recently advanced machine translation [10], advancing > performance metrics by 10 BLEU points. > > With respect to desktop-based translation, vendors such as SYSTRAN [11] > offer desktop-based, server-based and cloud-based solutions. > > There are some desktop-based transcription and machine translation > solutions [12] and it is expected that real-time client-side solutions for > transcription and translation, processing speech audio, will exist in the > upcoming years, at least for desktop computing if not mobile computing. > On-premises Transcription and Translation > > In addition to client-side solutions, on-premises solutions can deliver > lowered latency and enhanced privacy. > Server-side and Cloud-based Transcription and Translation > > For a number of scenarios including mobile computing, server-side and > cloud-based transcription and translation services make sense. > > Major software vendors such as Amazon, Facebook, Google, IBM and Microsoft > offer priced cloud-based services which include speech recognition, machine > translation and speech synthesis. > Post-text Speech Technology > > I am an advocate of post-text speech technologies. Speech-to-text is too > lossy. Information pertaining to prosody, intonation, emphases and pauses > are discarded in text output. Such information can be useful, for example > informing machine translation components and services. In addition to > speech-to-SSML speech recognition and SSML-to-SSML machine translation > scenarios, we can envision new, intermediate data formats beyond SSML. > > The inputs and outputs of speech recognition, translation and speech > synthesis components and services could be multiple formats – formats other > than text. > API Sketch: Dataflow Graphs > > Sketches with respect to APIs include the declarative construction of > dataflow graphs which interconnect abstract components. Such APIs can > abstract away whether the interconnectable components are client-side, > on-prem, server-side, third-party or cloud-based. Such APIs can abstract > away whether the interconnectable components are for free or priced to > end-users. Considerations to such API include the data formats and stream > specifications of components’ various inputs and outputs to be > interconnected. > > Dataflow graphs can be an intuitive abstraction layer, one which provides > intuitive and convenient programming while interconnecting arbitrary > numbers of components and services. Dataflow graphs can interconnect > client-side and remote speech recognition, translation and speech synthesis > components as well as any other components which could reasonably be > interconnected or pipelined. > > When such dataflow graphs are prepared for activation, it is envisioned > that users will be provided with notifications, requests for permissions > and options for payment. > Potential IETF Work Items > > When such dataflow graphs are activated, it is envisioned that computer > networking protocols will be utilized to notify remote components or > services of proper data routings, e.g. daisy-chain or pipeline > configurations, in a secure manner. > > That is, there may be new protocols and computer networking topics with > regard to implementing the APIs for interconnecting WebRTC peers with > speech recognition, translation and speech synthesis components and > services. > Conclusion > > Tight WebRTC integration is important for envisioned efficient, > low-latency, high-performance, scalable real-time translation scenarios. > > While there exist some ad hoc approaches to providing real-time > translation with WebRTC, standardizing new APIs and protocols can > convenience developers, convenience end users, and create new markets with > respect to real-time translation scenarios. > > Thank you for considering adding real-time translation to the use cases > for a next version of WebRTC. I look forward to any discussion on these > topics. > References > > [1] http://www.cs.cmu.edu/~ianlane/hydra/ > [2] https://www.youtube.com/watch?v=73rQ0lRx2aY > [3] https://www.youtube.com/watch?v=Y7Jlj7QYrcg > [4] https://arxiv.org/abs/1412.5567 > [5] > https://devblogs.nvidia.com/deep-speech-accurate-speech-recognition-gpu-accelerated-deep-learning/ > [6] https://github.com/mozilla/DeepSpeech > [7] https://open.library.ubc.ca/media/stream/pdf/24/1.0348751/3 > [8] https://deepmind.com/blog/wavenet-generative-model-raw-audio/ > [9] https://devblogs.nvidia.com/nv-wavenet-gpu-speech-synthesis/ > [10] > https://www.forbes.com/sites/williamfalcon/2018/09/01/facebook-ai-just-set-a-new-record-in-translation-and-why-it-matters/#205b9e5b3124 > [11] https://store.systran.us/lp/storeSystran?Langue=en_US > [12] > https://en.wikipedia.org/wiki/Comparison_of_machine_translation_applications > > > > > > *From: *Bernard Aboba <bernard.aboba@gmail.com> > *Sent: *Thursday, September 27, 2018 12:58 AM > *Subject: *Re: [rtcweb] WebRTC and Real-time Translation > > > > One of the key questions for "Next Version Use Cases" is what > WebRTC-deficiencies are preventing these use cases from being > satisfactorily implemented today. > > > > For example, speech transcription cloud services have been implemented > over Websockets, where a snippet of speech is uploaded, and a transcription > is provided in reply. The latency is satisfactory for some uses cases. > > Improvements can perhaps be made by sending an audio stream and receiving > a transcription via the data channel, but this is also within the > capabilities of the existing RTCWEB protocols and WebRTC-PC API. > > > > What seems to differentiate *next version* scenarios are situations where > the processing is best done on the device, in order to lower latency or > enhance privacy. On-device processing brings in discussion of > workers/worklets, access to raw audio/video, etc. However, so far I'm not > aware of on-device implementations of transcription or translation. > > > > On Wed, Sep 26, 2018 at 6:10 PM Adam Sobieski <adamsobieski@hotmail.com> > wrote: > > IETF RTCWEB Working Group, > > > > Greetings. I opened an issue on *WebRTC and Real-time Translation* at the > GitHub repository for WebRTC version next use cases ( > https://github.com/w3c/webrtc-nv-use-cases/issues/2). > > > Introduction > > Real-time translation is both an interesting and important use case for a > next version of WebRTC. > Speech Recognition, Translation and Speech Synthesis > > Approaches to real-time speech-to-speech machine translation include those > which interconnect speech recognition, translation and speech synthesis > components and services. In that regard, we can consider client-side, > on-prem, server-side, third-party and cloud-based components and services. > In that regard, we can also consider both free and priced components and > services. > > We can envision *post-text* speech technology and machine translation > components and services. Speech recognition need not output to text; we can > consider speech-to-SSML. Machine translation need not input from nor output > to text; we can consider SSML-to-SSML machine translation. Components and > services may provide various options with respect to their input and output > data formats. > Connecting Components and Services by Constructing Graphs > > We can consider APIs which facilitate the construction of graphs which > represent the flow of data between components and services. As these graphs > are constructed, users could be apprised of relevant notifications, > requests for permissions and options for payments. As these constructed > graphs are activated, a number of protocols could be utilized to > interconnect the components and services which, together, provide users > with real-time translation. > Hyperlinks > > WebRTC Translator Demo <https://www.youtube.com/watch?v=Tv8ilBOKS2o> > Real Time Translation in WebRTC > <https://www.youtube.com/watch?v=EPBWR_GNY9U> > > > > > > Best regards, > > Adam Sobieski > > http://www.phoster.com/contents/ > > > > _______________________________________________ > rtcweb mailing list > rtcweb@ietf.org > https://www.ietf.org/mailman/listinfo/rtcweb > > > _______________________________________________ > rtcweb mailing list > rtcweb@ietf.org > https://www.ietf.org/mailman/listinfo/rtcweb >
- [rtcweb] WebRTC and Real-time Translation Adam Sobieski
- Re: [rtcweb] WebRTC and Real-time Translation Bernard Aboba
- Re: [rtcweb] WebRTC and Real-time Translation Ted Hardie
- Re: [rtcweb] WebRTC and Real-time Translation Adam Sobieski
- Re: [rtcweb] WebRTC and Real-time Translation Adam Sobieski
- Re: [rtcweb] WebRTC and Real-time Translation Adam Sobieski
- Re: [rtcweb] WebRTC and Real-time Translation Justin Uberti
- Re: [rtcweb] WebRTC and Real-time Translation Adam Sobieski