Alibaba Cloud is strategically investing in a new frontier of artificial intelligence, moving beyond the text-centric capabilities of large language models (LLMs) like OpenAI’s ChatGPT. The focus is shifting towards “world models” that aim to more accurately simulate the complexities of the real world by incorporating visual, auditory, and even tactile data from physical scenarios and video feeds.
This pivot underscores a recognition of the inherent limitations of LLMs, which are primarily trained on vast amounts of text. The burgeoning field of world models promises to unlock a new era of AI that can better understand and interact with the physical environment.
In a significant move to capitalize on this trend, Alibaba Cloud has spearheaded a 2 billion yuan ($290 million) investment in ShengShu, a startup renowned for its AI video generation tool, Vidu. The funding round, which also saw participation from TAL Education and Baidu Ventures, marks a substantial endorsement of ShengShu’s vision. This development follows closely on the heels of ShengShu’s earlier Series A funding, where it secured over 600 million yuan from investors including Qiming Venture Partners.
ShengShu has articulated its ambition to develop a “general world model” capable of bridging the digital and physical realms. This model aims to integrate AI-generated video and gaming environments with the practical applications of autonomous driving and robotics, fostering a more holistic understanding of reality. The company asserts that a world model, built on multimodal data, offers a more intuitive and accurate representation of how the physical world operates compared to LLMs.
Zhu Jun, the founder of ShengShu, emphasized the goal of “connecting perception and action,” enabling AI systems to more effectively model and predict real-world behaviors with greater consistency. ShengShu’s latest offering, the Vidu Q3 Pro model, released in January, has already garnered recognition, ranking among the top AI models for text- and image-to-video generation according to Artificial Analysis. Notably, ShengShu’s global launch of Vidu preceded OpenAI’s wider release of its now-defunct Sora tool for AI video generation, placing the Chinese startup at the forefront of this rapidly evolving technology. Other major players in the Chinese tech landscape, such as Kuaishou and ByteDance, have also entered the fray with their own competing AI video generation tools.
The investment in ShengShu is part of a broader strategic push by Alibaba into AI startups focused on embodied intelligence and world models. Earlier this year, Alibaba Cloud, alongside Baidu Ventures, participated in a $50 million investment in Tripo AI, a platform that leverages AI to rapidly generate 3D digital models from photographs. Tripo AI is also shifting its focus away from language model-centric techniques towards AI tools grounded in physical space, indicating a shared industry trajectory.
Furthermore, Alibaba led a $60 million investment in PixVerse, a company that unveiled an AI world model earlier this year allowing users to actively direct video generation in real-time. This proactive engagement with the world model space extends to Alibaba’s own initiatives, including the release of free, open-source AI models for video generation and a dedicated model for powering robots.
ShengShu’s strategic partnerships with companies developing embodied AI—systems like humanoid robots designed for interaction with the physical world—highlight the critical role of world models in advancing robotics. Experts like Kevin Kelly, co-founder of Wired, have pointed out that true AI replication of human intelligence will require a trifecta of reasoning, an understanding of the physical world, and continuous learning. While the learning component remains a frontier, LLMs have established the knowledge base, making world models the next crucial area for breakthrough innovation.
Original article, Author: Tobias. If you wish to reprint this article, please indicate the source:https://aicnbc.com/20540.html