ByteDance Unveils End-to-End Simultaneous Interpretation Model: Near-Human Accuracy with 3-Second Latency

“`html

CNBC AI News, July 24th — ByteDance today announced the official launch of its end-to-end simultaneous interpretation model, Seed LiveInterpret 2.0.

According to ByteDance, this marks a significant leap forward in AI-powered translation. This system boasts extremely low latency while maintaining state-of-the-art (SOTA) translation quality in Chinese-English simultaneous interpretation.

The model is built upon a full-duplex end-to-end speech generation and understanding framework, facilitating bidirectional Chinese-English translation.

It processes multi-person speech input in real-time, mirroring the “listen and speak” capability of human interpreters with minimal delay. The system can simultaneously receive source language audio and output translated speech in the target language.

Furthermore, Seed LiveInterpret 2.0 supports zero-shot voice cloning, promising more natural and seamless communication.

Currently, the model primarily focuses on Chinese-English translation.

Here’s how Seed LiveInterpret 2.0 sets itself apart from traditional machine interpretation systems:

Translation Accuracy Approaching Human-Level Interpretation

The system achieves over 70% accuracy in bidirectional Chinese-English translation in complex scenarios like multi-person conferences, and exceeds 80% accuracy in single-person speeches, rivaling professional human interpreters.

Ultra-Low Latency “Listen and Speak” Capability

The translation delay can be as low as 2-3 seconds, a reduction of over 60% compared to traditional machine interpretation systems.

Zero-Shot Voice Cloning

By sampling real-time speech signals, the system can extract voice characteristics and “speak” the foreign language in the speaker’s own voice.

Intelligent Balancing of Translation Quality, Latency, and Speech Output Cadence

The model intelligently adjusts the output pace based on speech clarity, fluency, and complexity, adapting to the nuances of different languages.

Model evaluation results indicate that in speech-to-text simultaneous interpretation tasks, Seed LiveInterpret 2.0 achieved an average human evaluation score of 74.8 for Chinese-English translation quality (assessing translation accuracy, with a maximum score of 100), surpassing the second-ranked benchmark system (47.3 points) by 58%.

In speech-to-speech tasks, where only three systems in the industry support this capability, Seed LiveInterpret 2.0 achieved an average translation quality score of 66.3 for Chinese-English translation (evaluating translation accuracy, speech output latency, speech rate, pronunciation, fluency, and other indicators, with a maximum score of 100), significantly exceeding other benchmark systems and approaching the level of professional human simultaneous interpretation. This marks a pivotal step, and raises the bar for future developments within the AI translatation space.

Furthermore, most benchmark systems do not support the voice cloning function.

Regarding latency performance, Seed LiveInterpret 2.0 has an average first-word output latency of only 2.21 seconds in speech-to-text scenarios and only 2.53 seconds in speech-to-speech scenarios, striking a balance between translation quality and real-time performance. These results signal a technological advancement that could fundamentally alter the global communications landscape.

“`

Original article, Author: Tobias. If you wish to reprint this article, please indicate the source:https://aicnbc.com/5530.html

ByteDance Unveils End-to-End Simultaneous Interpretation Model: Near-Human Accuracy with 3-Second Latency

About Author

Tobias

Related News

Carnival Corp. Brings Meal Donation Program to Latin America

Quantum Expands Distribution Network in China, India, and ASEAN to Drive Growth

Defence Therapeutics Appoints Mark Lambermon as Head of Quality and Operations