Re: [rtcweb] WebRTC and Real-time Translation

I think much of what you mention will be possible with the new APIs
discussed in https://w3c.github.io/webrtc-nv-use-cases/ (and probably can
already be done to some extent with Web Audio).

However, I'm not sure that a new transport is needed for this particular
use case - it seems mostly satisfied by on-device processing capabilities.

On Thu, Oct 4, 2018 at 10:25 AM Adam Sobieski <adamsobieski@hotmail.com>
wrote:

> A New Transport Protocol?
>
> The QUIC API [1] shows how to extend the WebRTC specification to enable
> the use of a new transport protocol.
>
> “This specification extends the WebRTC specification [WEBRTC] to enable
> the use of QUIC [QUIC-TRANSPORT] to exchange arbitrary data with remote
> peers using NAT-traversal technologies such as ICE, STUN, and TURN. Since
> QUIC can be multiplexed on the same port as RTP, RTCP, DTLS, STUN and TURN,
> this specification is compatible with all the functionality defined in
> [WEBRTC], including communication using audio/video media and SCTP data
> channels.”
>
> It could be that, for both real-time translation and video processing
> scenarios utilizing combinations of local components and remote services,
> that a new transport protocol or a new version of an existing transport
> protocol is needed. Such a transport protocol could facilitate transmitting
> and routing certain streams (or copies of certain streams) outside of
> envelopes between two or more peers such that such streams rejoin other
> streams in envelopes on remote peers.
>
> With such a transport protocol, one could specify, prepare and activate *processing
> graphs* for one or more audio, video or data streams. A processing graph
> could be such that a stream passes through speech recognition, translation
> and speech synthesis components or services between two or more peers. A
> processing graph could be such that a stream passes through a video
> processing component or service between two or more peers.
>
> Hopefully, I’ve indicated how tractable it is to add real-time translation
> and video processing to a next version of WebRTC. I’m confident that we can
> solve any remaining technical details in the upcoming years.
>
> What do you think about this approach, a solution including a new
> transport protocol or version of a transport protocol to provide real-time
> translation and real-time video processing utilizing interconnected local
> components and remote services?
> References
>
> [1] https://w3c.github.io/webrtc-quic/
>
>
>
>
>
> Best regards,
>
> Adam Sobieski
>
>
>
>
> ------------------------------
> *From:* rtcweb <rtcweb-bounces@ietf.org> on behalf of Adam Sobieski <
> adamsobieski@hotmail.com>
> *Sent:* Wednesday, October 3, 2018 6:14:51 PM
> *To:* Bernard Aboba; ted.ietf@gmail.com
> *Cc:* RTCWeb IETF
> *Subject:* Re: [rtcweb] WebRTC and Real-time Translation
>
> RTP Media API
>
> https://www.w3.org/TR/webrtc/#rtp-media-api
>
> “The RTP media API lets a web application send and receive
> MediaStreamTracks over a peer-to-peer connection. Tracks, when added to an
> RTCPeerConnection, result in signaling; when this signaling is forwarded to
> a remote peer, it causes corresponding tracks to be created on the remote
> side.”
>
> “The actual encoding and transmission of MediaStreamTracks is managed
> through objects called RTCRtpSenders. Similarly, the reception and decoding
> of MediaStreamTracks is managed through objects called RTCRtpReceivers.
> Each RTCRtpSender is associated with at most one track, and each track to
> be received is associated with exactly one RTCRtpReceiver.”
>
> Envisioned for real-time translation scenarios is that audio tracks – or
> copies of audio tracks – can be routed through one or more local components
> and remote services such that resultant output can be either sent to a
> remote side or multicast to multiple other peers. In particular for
> scenarios which utilize remote services, audio tracks to be translated may
> travel outside of the envelopes for other tracks. Translated content should
> rejoin other tracks on the remote side for synchronized presentation or
> processing.
>
> Real-time audio-to-audio translation is one scenario. Another scenario is
> real-time audio-to-subtitles translation where the results of real-time
> translation are desired to arrive as a subtitles track. A third scenario is
> where translation results are desired to arrive as data, for example to
> appear on-screen per the formatting and layout of a web application. The
> output from one or more interconnected components and services which
> perform real-time translation could then include: (1) audio, (2) subtitles,
> (3) data.
> Use Case: Funny Hats
>
> https://w3c.github.io/webrtc-nv-use-cases/#funnyhats*
>
> The capability of routing one or more tracks through one or more local
> components and remote services also facilitates scenarios resembling those
> discussed in the use case: Funny Hats.
>
> Differences include: (1) funny hats scenarios utilize video tracks and (2)
> real-time translation scenarios, while possible to present as singular
> services, may include a daisy-chaining or a pipelining of a number of local
> components or remote services: speech recognition, translation and speech
> synthesis.
>
> As with real-time translation, we can envision free as well as priced
> video processing services.
>
>
>
>
>
> Best regards,
>
> Adam Sobieski
>
>
> ------------------------------
> *From:* rtcweb <rtcweb-bounces@ietf.org> on behalf of Adam Sobieski <
> adamsobieski@hotmail.com>
> *Sent:* Thursday, September 27, 2018 8:50:02 PM
> *To:* Bernard Aboba; ted.ietf@gmail.com
> *Cc:* RTCWeb IETF
> *Subject:* Re: [rtcweb] WebRTC and Real-time Translation
>
>
> Bernard Aboba,
> Ted Hardie,
> Client-side Transcription and Translation
>
> With respect to client-side speech recognition, transcription, translation
> and speech synthesis scenarios, we can consider GPGPU approaches.
>
> HYDRA [1][2] is a “hybrid GPU/CPU-based speech recognition engine that
> leverages modern GPU-based parallel computing architectures to realize
> accurate real-time recognition with extremely large models.” In 2012,
> Professor Ian Lane indicated that HYDRA performs 20x faster than other
> approaches [3].
>
> Deep Speech [4][5][6] is a deep-learning-based approach to speech
> recognition which “outperforms previously published results on the widely
> studied Switchboard Hub5'00, achieving 16.0% error on the full test set”
> and “handles challenging noisy environments better than widely used,
> state-of-the-art commercial speech systems....”
>
> Articulatory synthesis can be accelerated by graphics cards [7].
>
> WaveNet [8][9] is a deep generative model of raw audio waveforms including
> speech audio.
>
> Facebook AI Research recently advanced machine translation [10], advancing
> performance metrics by 10 BLEU points.
>
> With respect to desktop-based translation, vendors such as SYSTRAN [11]
> offer desktop-based, server-based and cloud-based solutions.
>
> There are some desktop-based transcription and machine translation
> solutions [12] and it is expected that real-time client-side solutions for
> transcription and translation, processing speech audio, will exist in the
> upcoming years, at least for desktop computing if not mobile computing.
> On-premises Transcription and Translation
>
> In addition to client-side solutions, on-premises solutions can deliver
> lowered latency and enhanced privacy.
> Server-side and Cloud-based Transcription and Translation
>
> For a number of scenarios including mobile computing, server-side and
> cloud-based transcription and translation services make sense.
>
> Major software vendors such as Amazon, Facebook, Google, IBM and Microsoft
> offer priced cloud-based services which include speech recognition, machine
> translation and speech synthesis.
> Post-text Speech Technology
>
> I am an advocate of post-text speech technologies. Speech-to-text is too
> lossy. Information pertaining to prosody, intonation, emphases and pauses
> are discarded in text output. Such information can be useful, for example
> informing machine translation components and services. In addition to
> speech-to-SSML speech recognition and SSML-to-SSML machine translation
> scenarios, we can envision new, intermediate data formats beyond SSML.
>
> The inputs and outputs of speech recognition, translation and speech
> synthesis components and services could be multiple formats – formats other
> than text.
> API Sketch: Dataflow Graphs
>
> Sketches with respect to APIs include the declarative construction of
> dataflow graphs which interconnect abstract components. Such APIs can
> abstract away whether the interconnectable components are client-side,
> on-prem, server-side, third-party or cloud-based. Such APIs can abstract
> away whether the interconnectable components are for free or priced to
> end-users. Considerations to such API include the data formats and stream
> specifications of components’ various inputs and outputs to be
> interconnected.
>
> Dataflow graphs can be an intuitive abstraction layer, one which provides
> intuitive and convenient programming while interconnecting arbitrary
> numbers of components and services. Dataflow graphs can interconnect
> client-side and remote speech recognition, translation and speech synthesis
> components as well as any other components which could reasonably be
> interconnected or pipelined.
>
> When such dataflow graphs are prepared for activation, it is envisioned
> that users will be provided with notifications, requests for permissions
> and options for payment.
> Potential IETF Work Items
>
> When such dataflow graphs are activated, it is envisioned that computer
> networking protocols will be utilized to notify remote components or
> services of proper data routings, e.g. daisy-chain or pipeline
> configurations, in a secure manner.
>
> That is, there may be new protocols and computer networking topics with
> regard to implementing the APIs for interconnecting WebRTC peers with
> speech recognition, translation and speech synthesis components and
> services.
> Conclusion
>
> Tight WebRTC integration is important for envisioned efficient,
> low-latency, high-performance, scalable real-time translation scenarios.
>
> While there exist some ad hoc approaches to providing real-time
> translation with WebRTC, standardizing new APIs and protocols can
> convenience developers, convenience end users, and create new markets with
> respect to real-time translation scenarios.
>
> Thank you for considering adding real-time translation to the use cases
> for a next version of WebRTC. I look forward to any discussion on these
> topics.
> References
>
> [1] http://www.cs.cmu.edu/~ianlane/hydra/
> [2] https://www.youtube.com/watch?v=73rQ0lRx2aY
> [3] https://www.youtube.com/watch?v=Y7Jlj7QYrcg
> [4] https://arxiv.org/abs/1412.5567
> [5]
> https://devblogs.nvidia.com/deep-speech-accurate-speech-recognition-gpu-accelerated-deep-learning/
> [6] https://github.com/mozilla/DeepSpeech
> [7] https://open.library.ubc.ca/media/stream/pdf/24/1.0348751/3
> [8] https://deepmind.com/blog/wavenet-generative-model-raw-audio/
> [9] https://devblogs.nvidia.com/nv-wavenet-gpu-speech-synthesis/
> [10]
> https://www.forbes.com/sites/williamfalcon/2018/09/01/facebook-ai-just-set-a-new-record-in-translation-and-why-it-matters/#205b9e5b3124
> [11] https://store.systran.us/lp/storeSystran?Langue=en_US
> [12]
> https://en.wikipedia.org/wiki/Comparison_of_machine_translation_applications
>
>
>
>
>
> *From: *Bernard Aboba <bernard.aboba@gmail.com>
> *Sent: *Thursday, September 27, 2018 12:58 AM
> *Subject: *Re: [rtcweb] WebRTC and Real-time Translation
>
>
>
> One of the key questions for "Next Version Use Cases" is what
> WebRTC-deficiencies are preventing these use cases from being
> satisfactorily implemented today.
>
>
>
> For example, speech transcription cloud services have been implemented
> over Websockets, where a snippet of speech is uploaded, and a transcription
> is provided in reply.  The latency is satisfactory for some uses cases.
>
> Improvements can perhaps be made by sending an audio stream and receiving
> a transcription via the data channel, but this is also within the
> capabilities of the existing RTCWEB protocols and WebRTC-PC API.
>
>
>
> What seems to differentiate *next version* scenarios are situations where
> the processing is best done on the device, in order to lower latency or
> enhance privacy.  On-device processing brings in discussion of
> workers/worklets, access to raw audio/video, etc.  However, so far I'm not
> aware of on-device implementations of transcription or translation.
>
>
>
> On Wed, Sep 26, 2018 at 6:10 PM Adam Sobieski <adamsobieski@hotmail.com>
> wrote:
>
> IETF RTCWEB Working Group,
>
>
>
> Greetings. I opened an issue on *WebRTC and Real-time Translation* at the
> GitHub repository for WebRTC version next use cases (
> https://github.com/w3c/webrtc-nv-use-cases/issues/2).
>
>
> Introduction
>
> Real-time translation is both an interesting and important use case for a
> next version of WebRTC.
> Speech Recognition, Translation and Speech Synthesis
>
> Approaches to real-time speech-to-speech machine translation include those
> which interconnect speech recognition, translation and speech synthesis
> components and services. In that regard, we can consider client-side,
> on-prem, server-side, third-party and cloud-based components and services.
> In that regard, we can also consider both free and priced components and
> services.
>
> We can envision *post-text* speech technology and machine translation
> components and services. Speech recognition need not output to text; we can
> consider speech-to-SSML. Machine translation need not input from nor output
> to text; we can consider SSML-to-SSML machine translation. Components and
> services may provide various options with respect to their input and output
> data formats.
> Connecting Components and Services by Constructing Graphs
>
> We can consider APIs which facilitate the construction of graphs which
> represent the flow of data between components and services. As these graphs
> are constructed, users could be apprised of relevant notifications,
> requests for permissions and options for payments. As these constructed
> graphs are activated, a number of protocols could be utilized to
> interconnect the components and services which, together, provide users
> with real-time translation.
> Hyperlinks
>
> WebRTC Translator Demo <https://www.youtube.com/watch?v=Tv8ilBOKS2o>
> Real Time Translation in WebRTC
> <https://www.youtube.com/watch?v=EPBWR_GNY9U>
>
>
>
>
>
> Best regards,
>
> Adam Sobieski
>
> http://www.phoster.com/contents/
>
>
>
> _______________________________________________
> rtcweb mailing list
> rtcweb@ietf.org
> https://www.ietf.org/mailman/listinfo/rtcweb
>
>
> _______________________________________________
> rtcweb mailing list
> rtcweb@ietf.org
> https://www.ietf.org/mailman/listinfo/rtcweb
>