“`html
CNBC AI News, July 24th — ByteDance today announced the official launch of its end-to-end simultaneous interpretation model, Seed LiveInterpret 2.0.
According to ByteDance, this marks a significant leap forward in AI-powered translation. This system boasts extremely low latency while maintaining state-of-the-art (SOTA) translation quality in Chinese-English simultaneous interpretation.
The model is built upon a full-duplex end-to-end speech generation and understanding framework, facilitating bidirectional Chinese-English translation.
It processes multi-person speech input in real-time, mirroring the “listen and speak” capability of human interpreters with minimal delay. The system can simultaneously receive source language audio and output translated speech in the target language.
Furthermore, Seed LiveInterpret 2.0 supports zero-shot voice cloning, promising more natural and seamless communication.
Currently, the model primarily focuses on Chinese-English translation.
Here’s how Seed LiveInterpret 2.0 sets itself apart from traditional machine interpretation systems:
Translation Accuracy Approaching Human-Level Interpretation
The system achieves over 70% accuracy in bidirectional Chinese-English translation in complex scenarios like multi-person conferences, and exceeds 80% accuracy in single-person speeches, rivaling professional human interpreters.
Ultra-Low Latency “Listen and Speak” Capability
The translation delay can be as low as 2-3 seconds, a reduction of over 60% compared to traditional machine interpretation systems.
Zero-Shot Voice Cloning
By sampling real-time speech signals, the system can extract voice characteristics and “speak” the foreign language in the speaker’s own voice.
Intelligent Balancing of Translation Quality, Latency, and Speech Output Cadence
The model intelligently adjusts the output pace based on speech clarity, fluency, and complexity, adapting to the nuances of different languages.
Model evaluation results indicate that in speech-to-text simultaneous interpretation tasks, Seed LiveInterpret 2.0 achieved an average human evaluation score of 74.8 for Chinese-English translation quality (assessing translation accuracy, with a maximum score of 100), surpassing the second-ranked benchmark system (47.3 points) by 58%.
In speech-to-speech tasks, where only three systems in the industry support this capability, Seed LiveInterpret 2.0 achieved an average translation quality score of 66.3 for Chinese-English translation (evaluating translation accuracy, speech output latency, speech rate, pronunciation, fluency, and other indicators, with a maximum score of 100), significantly exceeding other benchmark systems and approaching the level of professional human simultaneous interpretation. This marks a pivotal step, and raises the bar for future developments within the AI translatation space.
Furthermore, most benchmark systems do not support the voice cloning function.
Regarding latency performance, Seed LiveInterpret 2.0 has an average first-word output latency of only 2.21 seconds in speech-to-text scenarios and only 2.53 seconds in speech-to-speech scenarios, striking a balance between translation quality and real-time performance. These results signal a technological advancement that could fundamentally alter the global communications landscape.
“`
Original article, Author: Tobias. If you wish to reprint this article, please indicate the source:https://aicnbc.com/5530.html