🪩Alibaba is Bankrolling China’s AI Video Race—Then Racing Against It
Two Chinese video AI startups just raised $300 million each. Their biggest investor is also their biggest rival.
Within four weeks this spring, two Chinese AI startups each raised $300 million to build video generation technology.
PixVerse’s developer AISphere closed a $300 million Series C in March led by CDH Investments. Days later, ShengShu Technology, the Beijing-based startup behind the Vidu video generator, secured a $290 million Series B led by Alibaba Cloud.
Both companies are betting that video generation is not the destination: it is the on-ramp to world models, many believe the next paradigm in AI. And both in different ways are entangled with the same backer and competitor, Alibaba.
AISphere and Shengshu Technology
AISphere and its PixVerse platform are the more consumer-facing of the two. Founded in April 2023 by Wang Changhu, a former Microsoft Research Asia and ByteDance executive, AISphere launched PixVerse to global users in January 2024. The platform lets creators generate videos from text prompts or images, and it caught fire through viral templates, including a “Venom transformation” effect drew over one billion views in late 2024.
By the time of its Series C, PixVerse had surpassed 100 million users across 177 countries, with 16 million monthly active users and over $40 million in annual recurring revenue, according to the company.
The company calls PixVerse the “Canva for video generation.” Canva won by making design simple enough that non-designers stopped needing designers. PixVerse is making the same bet on video. Its latest C1 model claims film-grade quality and the capability to turn storyboards directly into video, and PixVerse V5.6 ranks among the top ten video generators on the Artificial Analysis leaderboard.
The outputs lean toward an animation rather than photorealism, but the motion quality and scene consistency are genuinely impressive. Some of their best generations deliver something harder to quantify: a creative and artistic quality that catches you off guard.
Its more ambitious model is R1, launched in January 2026. It is a real-time interactive world model. According to the company, users can input commands during video playback—changing lighting, replacing backgrounds, redirecting character—with a response latency of around two seconds and output in 1080P.
After R1 launched, the company’s co-founder said the majority of inbound enterprise interest came from the gaming industry. The investor composition of AISphere’s Series C also reflects this: the round included Ruyi Holdings, a film and TV content company, and 37 Interactive Entertainment, a games company.
Vidu, by contrast, is playing a more technical and enterprise-focused game. ShengShu was founded in March 2023 by Zhu Jun, a Tsinghua University professor who serves as its chief scientist. The company launched Vidu globally before OpenAI made its now-shuttered Sora widely available. Its latest model, Vidu Q3 Pro, supports up to 16 seconds of synchronized audio and video generation with multi-shot composition and camera control. The company reported more than tenfold growth in both users and revenue in 2025, though it declined to share specific figures.
Where AISphere’s R1 targets the intersection of video and gaming, ShengShu’s goals are more grounded in embodied intelligence. The company has developed Motus, an embodied AI model designed to enable robots to perform actions. The $290 million will fund a general world model that bridges generated video with real-world use cases like industrial automation and robotics. Founder Zhu Jun has described the goal as connecting perception and action, building AI that can model and predict real-world behavior consistently.
In my view, AIsphere loosely resembles MiniMax in its consumer-first, product-driven growth strategy, while ShengShu more closely mirrors Zhipu in its academic origins and enterprise focuses.
What is a world model, and why now?
Both AIsphere and ShengShu are betting on world models, though applying it toward different ends.
The simplest way to understand a world model is by contrast with what a LLM does. An LLM predicts the next token. A world model predicts what happens next in the world: How objects move, how physics behaves, how cause and effect unfold over time. This is why an AI-generated video clip currently lets a dog’s collar disappear when it runs behind a couch, or turns a love seat into a sofa mid-shot. The model has no stable internal representation of the scene. It is guessing frame by frame what is statistically plausible.
The world model thesis that video is the training ground for AI that understands physical reality has attracted serious believers. Yann LeCun left Meta to pursue it. Google DeepMind’s Genie 3 simulates real-time 3D worlds. Nvidia’s Cosmos platform trained on 20 million hours of real-world data to support physical AI development. Companies that have already spent years building video generation infrastructure are naturally positioned to make this leap.
Alibaba invests in everyone, competes against everyone
Here is where the story gets interesting. The lead investor in ShengShu’s $290 million round was Alibaba Cloud. Alibaba also led AIsphere’s $60 million Series B in September 2025. Which means Alibaba is the principal backer of both of the most well-funded video AI startups in China.
However, the e-commerce and cloud giant is simultaneously building and deploying its own video models internally. In the past week, the company’s newly formed AI unit Alibaba Token Hub, or ATH, released a video model called HappyHorse-1.0. The model debuted anonymously on the Artificial Analysis benchmark, climbed to the top of both text-to-video and image-to-video rankings, and triggered a wave of speculation about its origins before Alibaba confirmed its ownership.
The competition here is essentially Alibaba versus ByteDance. ByteDance’s Seedance 2.0 had been a dominant model on the video AI leaderboards. HappyHorse was the first model to challenge and displace it. My early review is HappyHorse’s photorealistic visual output is a clear step forward, but Seedance 2.0 has an edge in audio-visual consistency and multi-shot camera control. However, against the AI video startups Alibaba backs, HappyHorse outperforms both Vidu and PixVerse on the benchmarks.
This is similar to the ecosystem of Chinese LLMs, where Alibaba’s open-source Qwen family has consistently competed with ByteDance’s Seed series for developer mindshare, while simultaneously backing leading LLM startups—Zhipu AI, MiniMax, and Moonshot AI among them. In both cases, Alibaba’s strategy combines internal model development with external investment in the most promising labs, a dual-track approach that keeps it relevant at every layer of the stack regardless of which specific model wins.
Alibaba Cloud’s revenue grew 36% last quarter, driven by AI workloads, according to the company’s Q3 FY2026 earnings call in March. Video generation is among the most compute-intensive workloads. Every inference request, every API call from a PixVerse or Vidu user represents GPU-hours that, if those companies run on Alibaba Cloud, translate directly into revenue.
US-China race in video generation
If we examine the global AI video landscape, Runway raised $315 million in February 2026 at a $5.3 billion valuation. AIsphere raised $300 million in March at a reported $1 billion-plus valuation. Vidu raised $290 million in April at an undisclosed valuation. Three companies from the US and China, three rounds within eight weeks, all within $25 million of each other in size.
In LLMs, the gap between Chinese and US companies has been real, shaped by the first-mover advantages that OpenAI and Anthropic built through years of foundational research, US export restrictions on advanced chips, and different friendliness to subscription-based services. Video generation is a different story.
Chinese labs entered this race simultaneously with Western ones, and in some cases earlier. Vidu launched globally before Sora was widely available. Kling AI from Kuaishou was consistently competitive on benchmarks — it generated $150 million in full-year 2025 revenue and crossed $300 million ARR by January 2026.
Part of what drives this is structural advantage. China has the world’s most sophisticated short video ecosystem, hundreds of millions of daily active users who have been consuming, producing, and sharing short-form video for nearly a decade. That is both a massive training data advantage and a ready consumer base that is uniquely receptive to AI video tools. The regulatory environment helps too. Compared to the US and Europe, China’s looser approach to IP and copyright creates more room to train on existing content and generate derivative work without the legal friction that has slowed some Western labs.
The field is also growing more specialized. A year or two ago, frontier LLM companies on both sides such as OpenAI, Zhipu AI, MiniMax were hedging their bets by developing video generation models alongside their core LLM models. Most have since pulled back to concentrate on what they do best. The companies that stayed in video have sharpened considerably.
What remains to be seen is whether world models become the unifying layer that both sides have been racing toward, or whether the video and physical AI tracks diverge further, producing different winners for different applications.
Either way, China’s video AI companies have earned a seat at that table.





"In LLMs, the gap between Chinese and US companies has been real, shaped by the first-mover advantages that OpenAI and Anthropic built through years of foundational research, US export restrictions on advanced chips, and different friendliness to subscription-based services. Video generation is a different story. Chinese labs entered this race simultaneously with Western ones, and in some cases earlier."
It’s true. I mean Sora is just a tool. There’s no way for YouTube or Instagram to optimize a Sora-generated video. In fact, it’s usually the opposite. TikTok and Instagram often de-prioritize AI videos, and YouTube doesn’t actively promote them either. But Kling AI and Jimeng are backed by Kuaishou and ByteDance—of course, they can optimize the feed for AI-generated content within the algorithms on their platforms. So the ROI is huge.
And as for China’s looser IP environment, it does pose a significant challenge for artists, but it’s true that it helps AI startup companies progress significantly in a short amount of time, even posing a threat to their US counterparts. It really depends on the angle from which you look at things.
And we can't discuss AI video in China without bringing up verticals/short dramas. In fact, AI-generated videos have already become a major force in vertical production.
你还忘记了minimax的海螺