Re: [rtcweb] WebRTC and Real-time Translation

A New Transport Protocol?

The QUIC API [1] shows how to extend the WebRTC specification to enable the use of a new transport protocol.

“This specification extends the WebRTC specification [WEBRTC] to enable the use of QUIC [QUIC-TRANSPORT] to exchange arbitrary data with remote peers using NAT-traversal technologies such as ICE, STUN, and TURN. Since QUIC can be multiplexed on the same port as RTP, RTCP, DTLS, STUN and TURN, this specification is compatible with all the functionality defined in [WEBRTC], including communication using audio/video media and SCTP data channels.”

It could be that, for both real-time translation and video processing scenarios utilizing combinations of local components and remote services, that a new transport protocol or a new version of an existing transport protocol is needed. Such a transport protocol could facilitate transmitting and routing certain streams (or copies of certain streams) outside of envelopes between two or more peers such that such streams rejoin other streams in envelopes on remote peers.

With such a transport protocol, one could specify, prepare and activate processing graphs for one or more audio, video or data streams. A processing graph could be such that a stream passes through speech recognition, translation and speech synthesis components or services between two or more peers. A processing graph could be such that a stream passes through a video processing component or service between two or more peers.

Hopefully, I’ve indicated how tractable it is to add real-time translation and video processing to a next version of WebRTC. I’m confident that we can solve any remaining technical details in the upcoming years.

What do you think about this approach, a solution including a new transport protocol or version of a transport protocol to provide real-time translation and real-time video processing utilizing interconnected local components and remote services?

References

[1] https://w3c.github.io/webrtc-quic/

Best regards,
Adam Sobieski

________________________________
From: rtcweb <rtcweb-bounces@ietf.org> on behalf of Adam Sobieski <adamsobieski@hotmail.com>
Sent: Wednesday, October 3, 2018 6:14:51 PM
To: Bernard Aboba; ted.ietf@gmail.com
Cc: RTCWeb IETF
Subject: Re: [rtcweb] WebRTC and Real-time Translation

RTP Media API

https://www.w3.org/TR/webrtc/#rtp-media-api

“The RTP media API lets a web application send and receive MediaStreamTracks over a peer-to-peer connection. Tracks, when added to an RTCPeerConnection, result in signaling; when this signaling is forwarded to a remote peer, it causes corresponding tracks to be created on the remote side.”

“The actual encoding and transmission of MediaStreamTracks is managed through objects called RTCRtpSenders. Similarly, the reception and decoding of MediaStreamTracks is managed through objects called RTCRtpReceivers. Each RTCRtpSender is associated with at most one track, and each track to be received is associated with exactly one RTCRtpReceiver.”

Envisioned for real-time translation scenarios is that audio tracks – or copies of audio tracks – can be routed through one or more local components and remote services such that resultant output can be either sent to a remote side or multicast to multiple other peers. In particular for scenarios which utilize remote services, audio tracks to be translated may travel outside of the envelopes for other tracks. Translated content should rejoin other tracks on the remote side for synchronized presentation or processing.

Real-time audio-to-audio translation is one scenario. Another scenario is real-time audio-to-subtitles translation where the results of real-time translation are desired to arrive as a subtitles track. A third scenario is where translation results are desired to arrive as data, for example to appear on-screen per the formatting and layout of a web application. The output from one or more interconnected components and services which perform real-time translation could then include: (1) audio, (2) subtitles, (3) data.

Use Case: Funny Hats

https://w3c.github.io/webrtc-nv-use-cases/#funnyhats*

The capability of routing one or more tracks through one or more local components and remote services also facilitates scenarios resembling those discussed in the use case: Funny Hats.

Differences include: (1) funny hats scenarios utilize video tracks and (2) real-time translation scenarios, while possible to present as singular services, may include a daisy-chaining or a pipelining of a number of local components or remote services: speech recognition, translation and speech synthesis.

As with real-time translation, we can envision free as well as priced video processing services.

Best regards,
Adam Sobieski

________________________________
From: rtcweb <rtcweb-bounces@ietf.org> on behalf of Adam Sobieski <adamsobieski@hotmail.com>
Sent: Thursday, September 27, 2018 8:50:02 PM
To: Bernard Aboba; ted.ietf@gmail.com
Cc: RTCWeb IETF
Subject: Re: [rtcweb] WebRTC and Real-time Translation

Bernard Aboba,
Ted Hardie,

Client-side Transcription and Translation

With respect to client-side speech recognition, transcription, translation and speech synthesis scenarios, we can consider GPGPU approaches.

HYDRA [1][2] is a “hybrid GPU/CPU-based speech recognition engine that leverages modern GPU-based parallel computing architectures to realize accurate real-time recognition with extremely large models.” In 2012, Professor Ian Lane indicated that HYDRA performs 20x faster than other approaches [3].

Deep Speech [4][5][6] is a deep-learning-based approach to speech recognition which “outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set” and “handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems...”

Articulatory synthesis can be accelerated by graphics cards [7].

WaveNet [8][9] is a deep generative model of raw audio waveforms including speech audio.

Facebook AI Research recently advanced machine translation [10], advancing performance metrics by 10 BLEU points.

With respect to desktop-based translation, vendors such as SYSTRAN [11] offer desktop-based, server-based and cloud-based solutions.

There are some desktop-based transcription and machine translation solutions [12] and it is expected that real-time client-side solutions for transcription and translation, processing speech audio, will exist in the upcoming years, at least for desktop computing if not mobile computing.

On-premises Transcription and Translation

In addition to client-side solutions, on-premises solutions can deliver lowered latency and enhanced privacy.

Server-side and Cloud-based Transcription and Translation

For a number of scenarios including mobile computing, server-side and cloud-based transcription and translation services make sense.

Major software vendors such as Amazon, Facebook, Google, IBM and Microsoft offer priced cloud-based services which include speech recognition, machine translation and speech synthesis.

Post-text Speech Technology

I am an advocate of post-text speech technologies. Speech-to-text is too lossy. Information pertaining to prosody, intonation, emphases and pauses are discarded in text output. Such information can be useful, for example informing machine translation components and services. In addition to speech-to-SSML speech recognition and SSML-to-SSML machine translation scenarios, we can envision new, intermediate data formats beyond SSML.

The inputs and outputs of speech recognition, translation and speech synthesis components and services could be multiple formats – formats other than text.

API Sketch: Dataflow Graphs

Sketches with respect to APIs include the declarative construction of dataflow graphs which interconnect abstract components. Such APIs can abstract away whether the interconnectable components are client-side, on-prem, server-side, third-party or cloud-based. Such APIs can abstract away whether the interconnectable components are for free or priced to end-users. Considerations to such API include the data formats and stream specifications of components’ various inputs and outputs to be interconnected.

Dataflow graphs can be an intuitive abstraction layer, one which provides intuitive and convenient programming while interconnecting arbitrary numbers of components and services. Dataflow graphs can interconnect client-side and remote speech recognition, translation and speech synthesis components as well as any other components which could reasonably be interconnected or pipelined.

When such dataflow graphs are prepared for activation, it is envisioned that users will be provided with notifications, requests for permissions and options for payment.

Potential IETF Work Items

When such dataflow graphs are activated, it is envisioned that computer networking protocols will be utilized to notify remote components or services of proper data routings, e.g. daisy-chain or pipeline configurations, in a secure manner.

That is, there may be new protocols and computer networking topics with regard to implementing the APIs for interconnecting WebRTC peers with speech recognition, translation and speech synthesis components and services.

Conclusion

Tight WebRTC integration is important for envisioned efficient, low-latency, high-performance, scalable real-time translation scenarios.

While there exist some ad hoc approaches to providing real-time translation with WebRTC, standardizing new APIs and protocols can convenience developers, convenience end users, and create new markets with respect to real-time translation scenarios.

Thank you for considering adding real-time translation to the use cases for a next version of WebRTC. I look forward to any discussion on these topics.

References

[1] http://www.cs.cmu.edu/~ianlane/hydra/
[2] https://www.youtube.com/watch?v=73rQ0lRx2aY
[3] https://www.youtube.com/watch?v=Y7Jlj7QYrcg
[4] https://arxiv.org/abs/1412.5567
[5] https://devblogs.nvidia.com/deep-speech-accurate-speech-recognition-gpu-accelerated-deep-learning/
[6] https://github.com/mozilla/DeepSpeech
[7] https://open.library.ubc.ca/media/stream/pdf/24/1.0348751/3
[8] https://deepmind.com/blog/wavenet-generative-model-raw-audio/
[9] https://devblogs.nvidia.com/nv-wavenet-gpu-speech-synthesis/
[10] https://www.forbes.com/sites/williamfalcon/2018/09/01/facebook-ai-just-set-a-new-record-in-translation-and-why-it-matters/#205b9e5b3124
[11] https://store.systran.us/lp/storeSystran?Langue=en_US
[12] https://en.wikipedia.org/wiki/Comparison_of_machine_translation_applications

From: Bernard Aboba<mailto:bernard.aboba@gmail.com>
Sent: Thursday, September 27, 2018 12:58 AM
Subject: Re: [rtcweb] WebRTC and Real-time Translation

One of the key questions for "Next Version Use Cases" is what WebRTC-deficiencies are preventing these use cases from being satisfactorily implemented today.

For example, speech transcription cloud services have been implemented over Websockets, where a snippet of speech is uploaded, and a transcription is provided in reply.  The latency is satisfactory for some uses cases.
Improvements can perhaps be made by sending an audio stream and receiving a transcription via the data channel, but this is also within the capabilities of the existing RTCWEB protocols and WebRTC-PC API.

What seems to differentiate *next version* scenarios are situations where the processing is best done on the device, in order to lower latency or enhance privacy.  On-device processing brings in discussion of workers/worklets, access to raw audio/video, etc.  However, so far I'm not aware of on-device implementations of transcription or translation.

On Wed, Sep 26, 2018 at 6:10 PM Adam Sobieski <adamsobieski@hotmail.com<mailto:adamsobieski@hotmail.com>> wrote:
IETF RTCWEB Working Group,

Greetings. I opened an issue on WebRTC and Real-time Translation at the GitHub repository for WebRTC version next use cases (https://github.com/w3c/webrtc-nv-use-cases/issues/2).

Introduction

Real-time translation is both an interesting and important use case for a next version of WebRTC.

Speech Recognition, Translation and Speech Synthesis

Approaches to real-time speech-to-speech machine translation include those which interconnect speech recognition, translation and speech synthesis components and services. In that regard, we can consider client-side, on-prem, server-side, third-party and cloud-based components and services. In that regard, we can also consider both free and priced components and services.

We can envision post-text speech technology and machine translation components and services. Speech recognition need not output to text; we can consider speech-to-SSML. Machine translation need not input from nor output to text; we can consider SSML-to-SSML machine translation. Components and services may provide various options with respect to their input and output data formats.

Connecting Components and Services by Constructing Graphs

We can consider APIs which facilitate the construction of graphs which represent the flow of data between components and services. As these graphs are constructed, users could be apprised of relevant notifications, requests for permissions and options for payments. As these constructed graphs are activated, a number of protocols could be utilized to interconnect the components and services which, together, provide users with real-time translation.

Hyperlinks

WebRTC Translator Demo<https://www.youtube.com/watch?v=Tv8ilBOKS2o>
Real Time Translation in WebRTC<https://www.youtube.com/watch?v=EPBWR_GNY9U>

Best regards,
Adam Sobieski
http://www.phoster.com/contents/

_______________________________________________
rtcweb mailing list
rtcweb@ietf.org<mailto:rtcweb@ietf.org>
https://www.ietf.org/mailman/listinfo/rtcweb