Re: [rtcweb] WebRTC and Real-time Translation

Adam Sobieski <adamsobieski@hotmail.com> Fri, 28 September 2018 00:50 UTC

Return-Path: <adamsobieski@hotmail.com>
X-Original-To: rtcweb@ietfa.amsl.com
Delivered-To: rtcweb@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2BAAE130DD5 for <rtcweb@ietfa.amsl.com>; Thu, 27 Sep 2018 17:50:09 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.125
X-Spam-Level:
X-Spam-Status: No, score=-1.125 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FORGED_HOTMAIL_RCVD2=0.874, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=hotmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6FRi7wPL2r6P for <rtcweb@ietfa.amsl.com>; Thu, 27 Sep 2018 17:50:05 -0700 (PDT)
Received: from NAM01-BY2-obe.outbound.protection.outlook.com (mail-oln040092001061.outbound.protection.outlook.com [40.92.1.61]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EC8A41286E3 for <rtcweb@ietf.org>; Thu, 27 Sep 2018 17:50:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hotmail.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=L4xy+4hx/305A7dvhwWBTmq7+MzpHPuvbYr1tVkye6A=; b=byvriOgKoutFsImZcHyRT/RIhKvp3fHdXJ3eLblxml2wz7reNERWk0k0RCAmXzuzpaivMwgDSJr+2RL2OjDQjQYfK+Sg3qLPhfN/P3g06uEZh3Ywc6croYh11ZgBEhljbVvVMmJLt6CqLBymeQAfp2iFgROZ7ZpYss0hlB850xKVdP4C9VcxUtmEIt3LjLxgZ6FUcRWZF+vfR2Aune+oU35iC6QlFhm0wl0vwCVcR9tnpB733vy0NzN8N9it/N6W5/wGRAsE2DbpecNtTToqe/e30dIMSKNhWljtaCwtgYRTM7/b6GXF7woM9BW9eVU6wmRFFLxo8HXsL+TyGz1vpA==
Received: from SN1NAM01FT024.eop-nam01.prod.protection.outlook.com (10.152.64.52) by SN1NAM01HT165.eop-nam01.prod.protection.outlook.com (10.152.64.140) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.1185.13; Fri, 28 Sep 2018 00:50:02 +0000
Received: from CY4PR0101MB3095.prod.exchangelabs.com (10.152.64.56) by SN1NAM01FT024.mail.protection.outlook.com (10.152.64.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.1185.13 via Frontend Transport; Fri, 28 Sep 2018 00:50:02 +0000
Received: from CY4PR0101MB3095.prod.exchangelabs.com ([fe80::d588:9eb9:cc8c:2be5]) by CY4PR0101MB3095.prod.exchangelabs.com ([fe80::d588:9eb9:cc8c:2be5%5]) with mapi id 15.20.1143.022; Fri, 28 Sep 2018 00:50:02 +0000
From: Adam Sobieski <adamsobieski@hotmail.com>
To: Bernard Aboba <bernard.aboba@gmail.com>, "ted.ietf@gmail.com" <ted.ietf@gmail.com>
CC: RTCWeb IETF <rtcweb@ietf.org>
Thread-Topic: [rtcweb] WebRTC and Real-time Translation
Thread-Index: AQHUVe1M2MPUNlozpUmfS3sPuCqXG6UDkYIAgAE+1yU=
Date: Fri, 28 Sep 2018 00:50:02 +0000
Message-ID: <CY4PR0101MB3095FC76C07E9FD6EB177506C5140@CY4PR0101MB3095.prod.exchangelabs.com>
References: <CY4PR0101MB309521AF4EF436C0D1503741C5150@CY4PR0101MB3095.prod.exchangelabs.com>, <CAOW+2dvkgpWp6h+MY1YY4jDG3=KG-WPes-A1WXW6yuxRG6f9vg@mail.gmail.com>
In-Reply-To: <CAOW+2dvkgpWp6h+MY1YY4jDG3=KG-WPes-A1WXW6yuxRG6f9vg@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-incomingtopheadermarker: OriginalChecksum:C6A66B213E7BD9A8A3C332C98705B2AC0DF9C30ECD24578FF83278BD5C5433CC; UpperCasedChecksum:EF0C4A67E96A41D70124D04B874DF9FED64F408144D9456EBB67AC290D93AC79; SizeAsReceived:7211; Count:47
x-ms-exchange-messagesentrepresentingtype: 1
x-tmn: [QZQAjesnfgdNFp7qi/VnUwm63D3upFLj]
x-ms-publictraffictype: Email
x-microsoft-exchange-diagnostics: 1; SN1NAM01HT165; 6:IFitJkTOREkDUeaAq60QWkvPIIn3wzJwsaNAj9PiUi84ZQhpBE5A5r0HMD95/lgS2xVTJ68ocL7c6dgHbZL8IsCk0+LEaAv8hNS5Yd6MYM3YZT6CgKDzZ4SPlZyBzPLmQIMXVz9PZIQ4S6MqirnuGy4REY8Ov9GjQMO7R6EbXgpuXwagX8iB3GQAOgK8b5wECFhe/3kgoK8IhFzDe/aswXfITIY4jQQkiS3Q8MksQ5gk0ImonyRLj0Pl8TzBPe+IJ8d4nGyjcRmAS23ujyIh32j1Jb2uNZ9XYpTdlkKZZYB3u38BvyGpjZjD3Zx6GqvlcKy+V4uGVv3vvziHL4cvmGvs7brpT0gPEGiKzH0ZMcO8rJS6VQ8v2ugAeLKljfDcVGx4RvqPrwcA6cEEazoqmN6lhWp0qF4BjKn5C1PO3fVC/0h8XE2HPUIjgeu+W2SuGQgYEyOguAUHBiqlxKR9rw==; 5:CD49P4FK+EezPkdX3DVxS7XrJNMYAF+b/iw/vTFsMDmXXa1GX0k/Se9nBZCWHjwdBNAbS0x5bUT8vtNeqdG00yfTpOB24vgsmASDM6YMHZiwNj4JEU+WEib7hTt7t6emVoOfu5E/fK04Ez9OHepPsJcwpuvgEoUta1WKSsWDaac=; 7:BtzIEtt7Pxp4WXGbAF6Brmog5zASgBYE5NAQ43WLEZOtdv5US5iLW4rD6Mx7fgnA/epYi+iy6kHK+2v/KHPiC5n6wgmbRlwjKd30245b5/q8qnlIhtGQc+CJzXeQbEWtPrpRudaAuJ9Vcey9sjsIwstiMv+uo6LeFaXfEP3sS57NOU+tnzalcAdjbLoevo221G9fRJo2qbxUDW2FpDm222hyH0S/0pEYfx6fIMBFnYNXL0c0HkWzuwhYrN3TjmYf
x-incomingheadercount: 47
x-eopattributedmessage: 0
x-microsoft-antispam: BCL:0; PCL:0; RULEID:(7020095)(201702061078)(5061506573)(5061507331)(1603103135)(2017031320274)(2017031324274)(2017031323274)(2017031322404)(1601125500)(1603101475)(1701031045); SRVR:SN1NAM01HT165;
x-ms-traffictypediagnostic: SN1NAM01HT165:
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(4566010)(82015058); SRVR:SN1NAM01HT165; BCL:0; PCL:0; RULEID:; SRVR:SN1NAM01HT165;
x-forefront-prvs: 0809C12563
x-forefront-antispam-report: SFV:NSPM; SFS:(7070007)(189003)(199004)(53546011)(54896002)(81156014)(76176011)(26005)(6346003)(25786009)(102836004)(33656002)(9686003)(74316002)(6506007)(6246003)(7696005)(104016004)(5660300001)(55016002)(6306002)(229853002)(6436002)(236005)(20460500001)(39060400002)(97736004)(8936002)(345774005)(53946003)(4326008)(2900100001)(105586002)(2501003)(106356001)(56003)(68736007)(110136005)(45080400002)(446003)(5250100002)(71190400001)(11346002)(83332001)(8676002)(606006)(575784001)(87572001)(476003)(486006)(966005)(73972006)(14454004)(99286004)(86362001)(82202002)(14444005)(256004)(71200400001)(34290500001)(15852004); DIR:OUT; SFP:1901; SCL:1; SRVR:SN1NAM01HT165; H:CY4PR0101MB3095.prod.exchangelabs.com; FPR:; SPF:None; PTR:InfoNoRecords; MX:1; A:1;
received-spf: None (protection.outlook.com: hotmail.com does not designate permitted sender hosts)
authentication-results: spf=none (sender IP is ) smtp.mailfrom=adamsobieski@hotmail.com;
x-microsoft-antispam-message-info: 5NFBXBJCtB8iar7toRW5hkHPtB7ODJnAdDlWCpmA5Sh6Ew0z79L/S0On15KQPTb+/iBLVr+7UKJCMx7r8YKZJ0qke2wBFnWgUquTgPBSXQq0FdATngd+g6uNdel6pA2RJuaNLXjOmvV7HIhmG3B+nMJXXxpTDUfzd8yALbc8b6MFXGaW6YLoQctTwQARixEGuqcB9FcGj7Oo56ph48/xXOI0T/jEEryd7UIw22mDxw0=
Content-Type: multipart/alternative; boundary="_000_CY4PR0101MB3095FC76C07E9FD6EB177506C5140CY4PR0101MB3095_"
MIME-Version: 1.0
X-OriginatorOrg: hotmail.com
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: d4d70346-2c10-4f39-8c00-e767963926d9
X-MS-Exchange-CrossTenant-Network-Message-Id: 163fa616-0fc1-4bd7-9aad-08d624dc5309
X-MS-Exchange-CrossTenant-rms-persistedconsumerorg: d4d70346-2c10-4f39-8c00-e767963926d9
X-MS-Exchange-CrossTenant-originalarrivaltime: 28 Sep 2018 00:50:02.3606 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Internet
X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN1NAM01HT165
Archived-At: <https://mailarchive.ietf.org/arch/msg/rtcweb/pGoYRb6y6kpzXNKzqFbtJ5cpIps>
Subject: Re: [rtcweb] WebRTC and Real-time Translation
X-BeenThere: rtcweb@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Real-Time Communication in WEB-browsers working group list <rtcweb.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rtcweb/>
List-Post: <mailto:rtcweb@ietf.org>
List-Help: <mailto:rtcweb-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Sep 2018 00:50:09 -0000

Bernard Aboba,
Ted Hardie,

Client-side Transcription and Translation

With respect to client-side speech recognition, transcription, translation and speech synthesis scenarios, we can consider GPGPU approaches.

HYDRA [1][2] is a “hybrid GPU/CPU-based speech recognition engine that leverages modern GPU-based parallel computing architectures to realize accurate real-time recognition with extremely large models.” In 2012, Professor Ian Lane indicated that HYDRA performs 20x faster than other approaches [3].

Deep Speech [4][5][6] is a deep-learning-based approach to speech recognition which “outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set” and “handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.”

Articulatory synthesis can be accelerated by graphics cards [7].

WaveNet [8][9] is a deep generative model of raw audio waveforms including speech audio.

Facebook AI Research recently advanced machine translation [10], advancing performance metrics by 10 BLEU points.

With respect to desktop-based translation, vendors such as SYSTRAN [11] offer desktop-based, server-based and cloud-based solutions.

There are some desktop-based transcription and machine translation solutions [12] and it is expected that real-time client-side solutions for transcription and translation, processing speech audio, will exist in the upcoming years, at least for desktop computing if not mobile computing.

On-premises Transcription and Translation

In addition to client-side solutions, on-premises solutions can deliver lowered latency and enhanced privacy.

Server-side and Cloud-based Transcription and Translation

For a number of scenarios including mobile computing, server-side and cloud-based transcription and translation services make sense.

Major software vendors such as Amazon, Facebook, Google, IBM and Microsoft offer priced cloud-based services which include speech recognition, machine translation and speech synthesis.

Post-text Speech Technology

I am an advocate of post-text speech technologies. Speech-to-text is too lossy. Information pertaining to prosody, intonation, emphases and pauses are discarded in text output. Such information can be useful, for example informing machine translation components and services. In addition to speech-to-SSML speech recognition and SSML-to-SSML machine translation scenarios, we can envision new, intermediate data formats beyond SSML.

The inputs and outputs of speech recognition, translation and speech synthesis components and services could be multiple formats – formats other than text.

API Sketch: Dataflow Graphs

Sketches with respect to APIs include the declarative construction of dataflow graphs which interconnect abstract components. Such APIs can abstract away whether the interconnectable components are client-side, on-prem, server-side, third-party or cloud-based. Such APIs can abstract away whether the interconnectable components are for free or priced to end-users. Considerations to such API include the data formats and stream specifications of components’ various inputs and outputs to be interconnected.

Dataflow graphs can be an intuitive abstraction layer, one which provides intuitive and convenient programming while interconnecting arbitrary numbers of components and services. Dataflow graphs can interconnect client-side and remote speech recognition, translation and speech synthesis components as well as any other components which could reasonably be interconnected or pipelined.

When such dataflow graphs are prepared for activation, it is envisioned that users will be provided with notifications, requests for permissions and options for payment.

Potential IETF Work Items

When such dataflow graphs are activated, it is envisioned that computer networking protocols will be utilized to notify remote components or services of proper data routings, e.g. daisy-chain or pipeline configurations, in a secure manner.

That is, there may be new protocols and computer networking topics with regard to implementing the APIs for interconnecting WebRTC peers with speech recognition, translation and speech synthesis components and services.

Conclusion

Tight WebRTC integration is important for envisioned efficient, low-latency, high-performance, scalable real-time translation scenarios.

While there exist some ad hoc approaches to providing real-time translation with WebRTC, standardizing new APIs and protocols can convenience developers, convenience end users, and create new markets with respect to real-time translation scenarios.

Thank you for considering adding real-time translation to the use cases for a next version of WebRTC. I look forward to any discussion on these topics.

References

[1] http://www.cs.cmu.edu/~ianlane/hydra/
[2] https://www.youtube.com/watch?v=73rQ0lRx2aY
[3] https://www.youtube.com/watch?v=Y7Jlj7QYrcg
[4] https://arxiv.org/abs/1412.5567
[5] https://devblogs.nvidia.com/deep-speech-accurate-speech-recognition-gpu-accelerated-deep-learning/
[6] https://github.com/mozilla/DeepSpeech
[7] https://open.library.ubc.ca/media/stream/pdf/24/1.0348751/3
[8] https://deepmind.com/blog/wavenet-generative-model-raw-audio/
[9] https://devblogs.nvidia.com/nv-wavenet-gpu-speech-synthesis/
[10] https://www.forbes.com/sites/williamfalcon/2018/09/01/facebook-ai-just-set-a-new-record-in-translation-and-why-it-matters/#205b9e5b3124
[11] https://store.systran.us/lp/storeSystran?Langue=en_US
[12] https://en.wikipedia.org/wiki/Comparison_of_machine_translation_applications


From: Bernard Aboba<mailto:bernard.aboba@gmail.com>
Sent: Thursday, September 27, 2018 12:58 AM
Subject: Re: [rtcweb] WebRTC and Real-time Translation

One of the key questions for "Next Version Use Cases" is what WebRTC-deficiencies are preventing these use cases from being satisfactorily implemented today.

For example, speech transcription cloud services have been implemented over Websockets, where a snippet of speech is uploaded, and a transcription is provided in reply.  The latency is satisfactory for some uses cases.
Improvements can perhaps be made by sending an audio stream and receiving a transcription via the data channel, but this is also within the capabilities of the existing RTCWEB protocols and WebRTC-PC API.

What seems to differentiate *next version* scenarios are situations where the processing is best done on the device, in order to lower latency or enhance privacy.  On-device processing brings in discussion of workers/worklets, access to raw audio/video, etc.  However, so far I'm not aware of on-device implementations of transcription or translation.

On Wed, Sep 26, 2018 at 6:10 PM Adam Sobieski <adamsobieski@hotmail.com<mailto:adamsobieski@hotmail.com>> wrote:
IETF RTCWEB Working Group,

Greetings. I opened an issue on WebRTC and Real-time Translation at the GitHub repository for WebRTC version next use cases (https://github.com/w3c/webrtc-nv-use-cases/issues/2).

Introduction

Real-time translation is both an interesting and important use case for a next version of WebRTC.

Speech Recognition, Translation and Speech Synthesis

Approaches to real-time speech-to-speech machine translation include those which interconnect speech recognition, translation and speech synthesis components and services. In that regard, we can consider client-side, on-prem, server-side, third-party and cloud-based components and services. In that regard, we can also consider both free and priced components and services.

We can envision post-text speech technology and machine translation components and services. Speech recognition need not output to text; we can consider speech-to-SSML. Machine translation need not input from nor output to text; we can consider SSML-to-SSML machine translation. Components and services may provide various options with respect to their input and output data formats.

Connecting Components and Services by Constructing Graphs

We can consider APIs which facilitate the construction of graphs which represent the flow of data between components and services. As these graphs are constructed, users could be apprised of relevant notifications, requests for permissions and options for payments. As these constructed graphs are activated, a number of protocols could be utilized to interconnect the components and services which, together, provide users with real-time translation.

Hyperlinks

WebRTC Translator Demo<https://www.youtube.com/watch?v=Tv8ilBOKS2o>
Real Time Translation in WebRTC<https://www.youtube.com/watch?v=EPBWR_GNY9U>


Best regards,
Adam Sobieski
http://www.phoster.com/contents/

_______________________________________________
rtcweb mailing list
rtcweb@ietf.org<mailto:rtcweb@ietf.org>
https://www.ietf.org/mailman/listinfo/rtcweb