🤖Zhipu AI Chief Scientist Tang Jie: Making Machines Think Like Humans
Tang looks back on the evolution of Zhipu AI, the maker of GLM-5.2, and shares his insights on the future of AI.
The viral momentum of Zhipu AI’s latest LLM GLM-5.2 feels quite familiar. GLM-5.2 isn’t even a major generational model but an iterative refinement of the existing GLM-5 model. Yet, it is generating the kind of market buzz that echoes the rise of DeepSeek R1. Early expert consensus suggests the model is already performing on par with Anthropic’s Claude Opus 4.7 in coding. The model release has sent Zhipu AI’s market cap skyrocketing.

GLM-5.2’s breakthrough isn’t just about raw benchmarks. On one side, Washington is tightening its grip on closed-source frontier AI. Just last week, the U.S. government stepping in to halt the release of OpenAI’s upcoming GPT-5.6 model. As western closed models stall under regulatory pressure, the global market is aggressively hunting for viable, open alternatives.
On the other side, cost is becoming an increasingly important differentiator. The CEO of Snowflake recently said on X their team benchmarked Opus 4.7 against GLM-5.2 and discovered that the latter completed the same level of complex tasks at half the cost.
Adding fuel to the news cycle is a recent public exchange on X. When Elon Musk estimated that it would take until the first quarter of 2027 for Chinese LLMs to catch up with Anthropic’s flagship Fable 5, Tang Jie, co-founder and chief scientist of Zhipu AI, replied: "Won't take that long."
While Zhang Peng steers the company as CEO, insiders know that the soul of Zhipu AI belongs to Tang Jie. Tang is notoriously low-profile, rarely giving interviews and avoiding the media spotlight (though he’s quite active on social media like X and Weibo). This makes his speech in January 2026, right after Zhipu AI’s IPO in Hong Kong, particularly noteworthy. It offers a rare window into his personal journey and his vision for the future of AI.
(The piece was translated via AI with minor edits and proofreading from me. For reference, the original Chinese transcript of his lecture can be found here and here. Below images are credited to Leinews.com)
Making Machines Think Like Humans
Speaker: Tang Jie (Chief Scientist at Zhipu AI; Professor at Tsinghua University)
Today’s event is more of an academic gathering, so we’ve cut out most of the preliminaries and will go straight into the talks.
I myself asked everyone—asked our team—to do without a host this time; we don’t need one. We’re heading into the age of AI, after all, so let’s have AI host. AI can’t quite do that yet, so I’ll host myself first. For the second talk, Kimi can just come straight up; Junyang (Justin Lin, former tech lead of Alibaba Qwen) too. After that comes the panel. Let me begin my talk.
The title of my talk serves two purposes: on one hand, to report on some of the work our foundational lab is doing now, and on the other, to share some ideas with you and some views on the future. My title is “Making Machines Think Like Humans.” Why do I put it this way? Actually, the first time I proposed this title, Academician Zhang Bo objected to me—he said you can’t keep saying you want machines to think like humans. But I added quotation marks, so perhaps now I’m allowed to say it with the quotes.
The Origins and Spirit of Zhipu
We began thinking back in 2019 about whether we could get machines to truly do even a tiny bit of genuine thinking. So in 2019 we spun the work out of Tsinghua as a commercialization of research results, and with the university’s strong support, we founded a company called Zhipu, where I now serve as Chief Scientist. We’ve also open-sourced a great deal—you can see many open-source projects here, with quite a few things related to LLM API calls over on the left.
I’ve been at Tsinghua for about 20 years; I graduated in 2006, so this year marks exactly 20. The thing is, what I’ve been working on all along really boils down to just two things: first, building the AMiner system back in the day; and second, the LLMs I’m working on now.
I’ve always held one view—one that has shaped me considerably—which I call doing things with a “coffee-like” spirit. That actually has a lot to do with one of our guests here today, Professor Yang Qiang. I remember when I’d just graduated and went to HKUST. Anyone who’s been there knows HKUST is essentially one building—the meeting rooms are inside, the classrooms are inside, the labs are inside, the café is inside; people eating, people playing basketball, all in this one building. So we ran into each other often. Once, after bumping into him at the café, I said I’d been drinking a lot of coffee these past couple of days and wondered whether I should cut back, since it might not be good for my health. Professor Yang’s first response was, “Yes, you should cut back.” Then he said—actually, no: if we could get addicted to research the way you’re addicted to coffee, wouldn’t our research turn out wonderfully?
That idea of being “addicted to coffee” struck me deeply, and it has influenced me from 2008 right up to now—namely, that perhaps the way to do things is to stay focused and keep at it. This time I’ve been fortunate to run into this thing called AGI, which is exactly the kind of endeavor that requires long-term investment and long-term commitment. It’s not a quick win—do it today, see it bloom tomorrow, and wrap up the day after. It’s very much a long game, and that’s precisely why it’s worth investing in.
Back in 2019, our lab was doing reasonably well internationally in graph neural networks and knowledge graphs. But at the time we resolutely paused both directions—set them aside for the time being—and everyone pivoted to LLMs; everyone began research related to LLMs. And here we are today, having accomplished a little something.
The Evolution of LLM Intelligence
As you all know, with globalization—this chart is actually from February 2025—across the entire history of LLM development, what we call the level of intelligence has risen dramatically.
In the early days around 2020, we saw some very simple problems like MMU and QA, which were already quite impressive at the time, and today we can essentially achieve near-perfect scores. Gradually, from those earliest simple problems, we moved into 2021 and 2022, when we started tackling math problems—problems requiring reasoning, where you have to actually do the arithmetic to get them right. Here you can see that through post-training, models gradually filled in these gaps, with their capabilities greatly improved.
Then into 2023 and 2024, you can see the models develop from merely memorizing knowledge, to simple mathematical reasoning, to something more complex—they can even handle graduate-level problems and have begun to answer real-world questions. For instance, on SWE-bench, they’ve already handled many real-world programming problems. At this point you can see the models’ capabilities, their level of intelligence, growing ever more complex—just like a person growing up. At first we read a lot of books in primary school, then gradually do math problems, then on through junior and senior high we answer some complex graduate-level reasoning problems. And after graduation, we begin to take on problems from work, harder problems.
This year, you can see, there’s HLE (Humanity’s Last Exam), a task that’s especially hard. If you look into HLE, some questions can’t even be found on Google—something like a specific part of a specific bone of a specific bird somewhere in the world; even Google can’t surface that page, so the model has to generalize it. How is this to be done? There’s no answer yet, but you can see its capabilities climbing rapidly in 2025.
From Scaling to Generalization
On another front, we can look at this notion of “from scaling to generalization.” What does that mean? We humans have always wanted machines to have the ability to generalize—I teach it just a little, and it can draw broad inferences, just like a person. When we teach a child, we always hope that after teaching three problems, the child will get the fourth, the tenth, and even ones we never taught at all. How do we go about this?
To this day, our goal is to use scaling to give models stronger generalization, but even now that generalization still has a long way to go. We’re improving it at different levels.
In the earliest days, we trained a model with Transformers to memorize all knowledge. The more data we trained on and the more compute we used, the stronger its long-term knowledge retention became—meaning it had memorized essentially all the world’s knowledge, with a degree of generalization, able to abstract and do simple reasoning. So when you ask, “What is the capital of China?”, the model doesn’t need to reason—it simply retrieves it from its knowledge base.
The second layer is to align this model and have it reason, giving it more complex reasoning ability and an understanding of our intent. This requires continually scaling SFT (Supervised Fine-Tuning) and even reinforcement learning. Through large volumes of human data feedback, we scale the feedback data, making the model smarter and more accurate.
This year is the breakout year for RLVR (Reinforcement Learning from Verifiable Rewards). Why was this hard before? Because previously we could only rely on human feedback data, which is very noisy and covers a very narrow range of scenarios. But if we have a verifiable environment, the machine can explore on its own, discover its own feedback data, and grow on its own.
The hardest part here—you can get it immediately—is: what does “verifiable” mean? Take verifiability: math may be verifiable, programming may be verifiable, but for broader cases—say, we build a web page; is it attractive?—that may not be easy to verify; it needs a human to judge. So the problem we now face with verifiable RLVR is this: the verifiable scenarios may gradually be running out. Can we move into semi-automatically verifiable, or even non-verifiable, scenarios to make the model more general? That’s a challenge we face.
Going forward, machines will gradually begin performing real tasks in the physical world. For these real tasks, how do we build the environments for the agents? These are even greater challenges. You can see that over the past few years, AI has been advancing along these several lines—not just simple Transformers; the whole of AI has become a large system, an intelligent system.
From Chat to Doing: A New Paradigm Opens
We’ve moved from mostly STEM-style reasoning—from simple primary, junior, and senior high problems, to more complex GPQA physics/chemistry/biology problems, to harder ones, even Olympiad gold-medal problems—to this year’s HLE, an extremely difficult benchmark for evaluating intelligence, which is now improving rapidly.
On another front, in real-world settings, just as many people are saying today that coding ability has become especially strong and can complete plenty of real code. But in fact, code models already existed in 2021; back then we collaborated a lot with Junyang and Kimi’s Yang Zhilin, and we built many such models. Those coding models could already program, but their coding ability was far inferior to today’s—back then you might write ten programs and get one right, whereas now you might write one program and very often have it run naturally, even for a quite complex task. Today we’ve already begun using code to help senior engineers complete even more complex tasks.
You might ask: as intelligence keeps growing, can’t we just keep training the model nonstop? Actually, no. You all know what happened in early 2025—DeepSeek came out, and people often describe it as “bursting onto the scene out of nowhere.” I think that phrase fits well; it truly did burst onto the scene. For our research community, for industry, even for many individuals—because no one in academia or industry had anticipated DeepSeek would suddenly appear, and its performance really was strong—it left many people stunned.
Later, in early 2025, we found ourselves pondering a question: perhaps under DeepSeek’s paradigm, the chat era was more or less solved. That is, no matter how well we did, on chat problems we might in the end only match DeepSeek; maybe we could personalize a bit more, make it a chat with emotion, or make it a little more sophisticated. But broadly speaking, this paradigm was probably nearing its ceiling, and what remained was mostly engineering and technical issues.
At that point we faced a choice: in what direction should we push this AI next? Our thinking at the time was that perhaps the new paradigm is enabling everyone to use AI to do something. That might be the next paradigm—it used to be chat, now it’s actually getting things done. So a new paradigm has opened.
Choosing a Technical Path: Thinking + Agentic + Coding
There was another choice to make, because once this paradigm opens, there are many ways to open it. You may recall, at the start of the year, there were two questions: one was simple programming—doing coding, doing agents; the other was using AI to help us do research, something like DeepResearch, even writing a complex research report. These two lines of thinking are rather different, and it came down to a choice. One direction is thinking, with some coding scenarios layered on; the other is interacting with the environment to make the model more interactive and lively. How to do it?
In the end we chose the left-hand path—giving it thinking ability. But we didn’t abandon the right-hand path either. Around July 28th we did something that turned out relatively successful: we integrated coding, agentic, and reasoning capabilities together. Integrating them wasn’t easy. Normally, when people build models, Coding is often handled separately—coding stays coding, reasoning stays reasoning, and sometimes math is even kept as math; but this approach tends to sacrifice the other capabilities. So we essentially fused all three so they’d be relatively balanced, and on July 28th we released version 4.5. At the time, across 12 benchmarks—agentic, reasoning, and code—it turned in a pretty solid result. Among all the models domestically—including Qwen and Kimi here today—we’re all neck and neck, sometimes one ahead, sometimes another; on that particular day, we came out in front.
Challenges and Breakthroughs in Real Environments
But we quickly opened up 4.5 for everyone to use—go ahead and code with it; our capabilities are pretty good now. Since we’d chosen coding and agents, it could handle plenty of programming tasks, so we had it tackle some very complex scenarios. As it turned out, users gave us feedback—for example, that when we tried to have it code a Plants vs. Zombies game, the model couldn’t pull it off.
Because real environments are often very complex. This game was auto-generated from a single prompt—the whole thing, fully playable: users can click to score, choose which plants to use and how to fight the zombies as they march in from the right, including the interface and the back-end logic—all written automatically from a single sentence by the program. Here, 4.5 couldn’t do it in this scenario and produced many bugs. What was going on?
We later discovered that in real programming environments there are many problems—for instance, in an editing environment like this, many issues need solving—and this is exactly where RLVR’s verifiable reinforcement-learning environment comes in. So we gathered a large number of programming environments and used them as reinforcement, plus some SFT data, so the two sides could interact and improve the model’s performance. On another front, we did some work on the Web side too, leveraging Web environments along with feedback and verifiable environments. In short, by exploring through verification, we obtained a very good score on SWE-bench at the time, including some excellent scores recently.
But a benchmark score is just a benchmark score; getting that capability into the main model is a very big challenge. Many people have a benchmark and say their score is high, but actually moving that capability into the main model faces even more challenges, and in terms of real user experience, the results aren’t necessarily good.
Another challenge: with such a large volume of RL tasks, how do you train them all together in a unified way? Different tasks have different lengths, and different time durations. So at the time we developed a fully asynchronous reinforcement-learning training framework. How to get it running asynchronously was part of another framework we open-sourced this year. This greatly improved the agent and coding capabilities, and the end result is our recently released 4.7, which—compared with the earlier 4.6 and 4.5—is vastly improved in agent and coding.
The felt experience matters even more, and here’s why: once you actually release a coding model to the public, what users do with it isn’t quite the same as your benchmark. Today it might be their own program—say, a sorting algorithm running on their data—and what matters is whether it works well, whether it feels good; they’re using that outcome, not how high your score is. So for the real-world scoring, we also conducted detailed evaluations done entirely by humans, recruiting a great many programming experts to evaluate. Of course, there are still unsolved issues and many problems yet to address.
Finally, we integrated these capabilities together, and at the end of 2025 we posted a pretty good score on the Artificial Analysis leaderboard—a respectable result.
Device Use: From Coding to Operating Devices
On another front, as we developed further, you want to truly deploy this at scale in agent environments. You can think of the most fundamental capability of an agent—what is that? It’s programming: once the computer finishes writing the program, it can execute it, equivalent to one or two actions within an agent. But if you want to do something more complex—on the left is the computer use that Claude released, in the middle is Doubao’s phone agent, and on the right is the asynchronous, ultra-long tasks that Manus does.
Suppose you want the machine to do dozens or even hundreds of steps for you. You might say, “Please gather all of today’s discussions about Tsinghua University on Xiaohongshu, and once that’s done, compile everything about so-and-so and generate the relevant document for me.” Here the AI has to monitor Xiaohongshu for a day. It’s automatic and fully asynchronous—you can’t sit there with your phone open watching it; it’s asynchronous, and it’s a very complex task. For such a complex task, in short, the earlier problem becomes a matter of device use—that is, how do we operate across the entire device?
A bigger challenge here—some people say it’s mainly about collecting data. But the bigger problem is that many applications have no data at all; it’s all code, all cold-start. What do you do then? Of course, we’d prefer that with this data we could suddenly generalize outward.
So at first we did indeed collect a large amount of data—thousands of data points—and integrated them, including SFT and reinforcement in specific domains, so it could perform well in certain areas. But more often you find that the original “iPhone use” was all button-tapping, whereas more often AI interaction isn’t a human. We originally treated AI as a person, asking whether AI could operate the phone for us. But if you think about it, this AI doesn’t actually need to operate the phone—it’s more a matter of APIs. Yet you can’t turn the phone into a pure API system without the buttons, so what do you do?
We adopted a hybrid approach, mixing API and GUI together: where it’s AI-friendly, use the API approach; where it’s human-friendly, have the AI simulate a human and perform GUI operations. By integrating the two, we extracted large amounts of data across many environments and ran fully asynchronous reinforcement learning, integrating everything so the AI has a degree of generalization. I keep saying “a degree of” generalization because even today that generalization is still very far short—still nowhere near enough—but it does have some now.
More importantly, how do we overcome the problems that come with cold-start? For instance, if we don’t have enough data, reinforcement learning may lead it into a trap. By the end of the RL process, once it has fully learned, the model can be like someone fixated on a dead end—stubbornly insisting “I’ll do it this way”—and the results veer off course. How do you pull it back? So we inserted an SFT step in the middle: reinforce for a while, then do some SFT, then reinforce a bit more, alternating between the two, giving it a degree of fault tolerance and the ability to be pulled back—turning it into a scalable training algorithm. In the mobile environment, we achieved solid improvements on Android.
On multi-task LLM reinforcement learning, we also did some work. Algorithmically, we mainly used multi-turn reinforcement learning; on the engineering side, it’s essentially scaling—pushing it to ever-larger scale.
Open-Sourcing AutoGLM
Around December last year we open-sourced AutoGLM, releasing everything in it. Note that the model we open-sourced is a 9B model, not a super-large model. The reason is that a 9B model is especially fast in human-machine interaction—its execution speed is very fast. If it were very large, its execution speed would be slow. So we open-sourced a 9B model, and once it was out, it immediately drew over 20,000 stars—more than 10,000 within three days—which is pretty good.
Here’s an example. Say we’re going to Changchun next week for fun—help us summarize some recommended attractions on the current page, then save these few spots on Amap, including checking ticket prices, then go book a 10 a.m. high-speed rail ticket from Beijing to Changchun on 12306, and compile the relevant information for me. In the background this model will execute 40 steps: it calls different apps, opens each one, enters the relevant information, runs the queries and executes them, and once the full 40-step operation is done, hands everything back to you. In effect, the AI does something like a secretary’s job, carrying out the whole thing end to end.
More importantly, across all the Device-use leaderboards—including OSWorld, Browser-use, Mobile-use and related benchmarks—we achieved very good results. You can think of this model as having been trained on a lot of Agent data; we trained a 9B model on a great deal of Agent data, which actually reduced much of its original language and reasoning ability. That is, it’s no longer a purely general model—it may be quite strong on the Agent side but weakened elsewhere. This brings us a new problem: in the future, on ultra-large-scale Agent models, how do we keep this from degrading? That becomes a new question.
2025: A Year of Open-Source GLM and China’s Contribution to Open-Source Models
2025 was also a year of open-source GLM for us. From roughly January through December we open-sourced many models, including language models, agent models, and our multimodal models—GLM-4.6, 4.6V, 4.5V, and others.
More importantly, we can see China’s contribution to open-source models in 2025. Here the blue ones are open-source models and the black ones are closed-source. On Artificial Analysis, the top five in blue are essentially all Chinese models—meaning China has made significant contributions to open-source LLMs. Compared with early 2025—that is, back in 2024—the U.S. side still held an absolute advantage in open source, including Meta’s LLaMA. Over the course of a year, China gradually moved into the top five, which is now basically all Chinese models. The chart on the right is the LLM blind-test leaderboard—results from human evaluation—which I’ve screenshotted in.
A Clear-Eyed View: The Gap May Still Be Widening
The next question: can we keep scaling going forward? What is our next AGI paradigm? We face even more challenges.
We’ve just done some open-sourcing, and some people may feel excited, thinking China’s LLMs seem to have already surpassed the U.S. But the truer answer may be that the gap is still widening—because the U.S. side’s LLMs are mostly still closed-source, while we’ve been playing in open source and pleasing ourselves; the gap hasn’t actually narrowed the way we imagine. We may be doing well in some places, but we still have to acknowledge the challenges and gaps we face.
Looking Ahead: Referencing the Learning Process of Human Cognition
What should we do next? Here I have a few simple thoughts. I think the entire history of LLM development is essentially a reference to the learning process of human cognition. From the earliest LLMs—memorizing all the world’s long-term knowledge—just like a child who first reads books and memorizes all the knowledge, then gradually learns to reason, learns math, and learns more deduction and abstraction.
The same holds for the future. In terms of human cognitive learning, what capabilities lie ahead that LLMs still lack but where humans far exceed us?
First, 2025 may be the year of multimodal adaptation. Why do I say this? Apart from a small handful of models worldwide that suddenly drew a lot of attention, many multimodal models—including ours—haven’t attracted much notice. More people are working on improving textual intelligence. For LLMs, how do we gather multimodal information and perceive it in a unified way—what we often call a natively multimodal model? Thinking about it more, native multimodal models are quite similar to human “sensory integration”: I collect some visual information here, some auditory information, some tactile information, and how do I integrate these into a unified perception of something? Just as our brains sometimes have problems—often when sensory integration is insufficient, sensory integration dysfunction causes issues. For models, how do we build the next multimodal sensory-integration capability?
Second, models’ current memory and continual-learning abilities aren’t yet sufficient. Humans have several tiers of memory: short-term memory, working memory, long-term memory. And I’ve even said in conversations with my students and lab members that a person’s long-term memory doesn’t seem to equal knowledge—why? Because we humans only truly possess knowledge when it’s actually recorded. For me, for instance, if my knowledge can’t be recorded on Wikipedia, then 100 years from now I’ll have vanished, contributing nothing to the world; it doesn’t quite count as knowledge—and in the future, when training a human-scale model, my knowledge would be useless, all of it noise. How do we move our entire memory system from a single person’s three tiers to a fourth tier that records all of humanity? Building out this entire memory system is something we humans must construct for LLMs in the future.
Finally, reflection and self-awareness. Models already have a degree of reflective ability now, but future self-awareness is a very hard problem. Many people doubt whether LLMs can have self-awareness. Among those present are many experts from foundational model labs—some support the idea, some oppose it. I lean somewhat toward supporting it; I think it’s possible, and it’s worth exploring.
System 1 and System 2
Human cognition is a dual-system: System 1 and System 2.
System 1 handles 95% of tasks. For example, when someone asks, “What is the capital of China?”, your answer comes from System 1, because you’ve memorized it. Or if asked, “Are you eating dinner tonight?” and you say “yes”—also System 1; these are all memorized in System 1. Only more complex reasoning problems—say, “Tonight I want to treat a friend from Sichuan to a big feast; where should we go?”—become System 2. Then you have to consider where this Sichuan friend is from, and where to go for a big feast; that’s the work of System 2. In daily life, System 2 accounts for only 5%.
The same logic applies to LLMs. Back in 2020 we drew a diagram like this; we said: what should an AI system modeled on humans look like? It should have a human-like System 1, a human-like System 2, and self-learning as well.
Why did we think of self-learning back then? My thinking was this: first, System 1 can be built as an LLM that answers based on matching, solving System 1 problems; System 2 can add some knowledge fusion, such as instruction fine-tuning and chains of thought; and third—for those who’ve studied cognition—the human brain learns unconsciously while we sleep at night, and if a person never sleeps at night, they won’t get smarter. So back in 2020 we said there must be an AI self-learning mechanism and self-learning chain of thought, but we didn’t know how it would learn—we just put the question out there first.
For System 1, we keep scaling. If we keep scaling data, this raises the upper bound of intelligence. At the same time we’re also scaling inference, so the longer the machine thinks, the more compute and search it uses to find more accurate solutions. The third aspect is that we’re scaling the self-learning environment, giving the machine more opportunities to interact with the outside world and obtain more feedback.
So through these three kinds of scaling, we can let the machine model human learning paradigms and gain more opportunities to learn.
The Challenges of Transformers and New Architectures
For System 1: now that we have Transformers, does it mean we just keep adding data and we’re done—just add bigger parameters and we’re done? If 30T isn’t enough, then 50T? If 50T isn’t enough, then 100T, and finally add parameters from 100B to 1T to 3T to 5T or even larger.
But now we face another problem—what problem? The Transformer’s computational complexity is O(N²), so as we increase the context length, the growth in memory usage and the inference efficiency get worse and worse, raising many issues. Recently there have been some new types of models, including some linear models that try to use linear methods—modeled on the human brain, which stores more knowledge with a smaller brain capacity. An even more fundamental question is whether it’s possible—because the original Transformer kept getting bigger the more it trained; early on, when we discussed it, we never said we had to make models smaller, larger came earlier.
But recently I’ve also been reflecting: can we find better methods of knowledge compression, compressing knowledge into a smaller space? This is a new problem.
There are two problems here: first, is there an engineering solution? Second, is there a methodological solution? So recently, many people are exploring whether LLMs may need to return to research, rather than simply scaling as before. Scaling is a great method, but scaling may be the easiest method—it’s our human way of being lazy. We simply scale up, and that’s the lazy approach. But for a more fundamental method, perhaps we need to find something new.
The second point is a new scaling paradigm. Scaling may be a very important path, but how do we find a new paradigm that gives the machine opportunities to scale? Reading is one opportunity; conversing with people is another. We need to find a new way for the machine to scale independently. Some will say we increase the data—but increasing data is something we humans impose on it. The machine must find its own way through, define its own reward functions, define its own interaction methods and even training tasks to do the scaling—that’s the work of System 2.
More importantly, once we have those two, we still have to complete even more ultra-long tasks in real-world scenarios. How do we do that? We need the machine to plan like a human—do a bit, check, then give feedback; humans work this way. Can the machine possibly do the same? How does it complete an ultra-long task?
For example, this year we’ve already produced a little bit of paper. At the start of the year I told my team members that by year’s end you must write me a paper, but it didn’t happen—it didn’t get done in the end. In any case, by now, as you know, some paper have begun to attempt this: the idea is model-generated, the experiments are model-run, the report is model-written, and you can ultimately do a workshop—though in fact it still hasn’t been fully achieved. This gives a real example of a task in an ultra-long environment. On this basis, we hope to define what future AI will look like—these are some of our thoughts.
The Five Levels of Intelligence
Before LLMs, most machine learning was a mapping from F(X) to Y: I learn a function so that an X sample maps to Y. After LLMs arrived, we turned this into a mapping from F(X) to X—maybe not strictly X, but using fully self-supervised learning for multi-task self-learning.
At the second level, after adding this data, we have these models learn how to reason, how to activate the underlying intelligence.
Further on, we’re teaching the machine to have self-reflection and self-learning abilities, so that through continual self-criticism it can learn which things it should do and which it can do better.
In the future, we’ll also teach the machine to learn more—for instance, to learn self-awareness, so that the machine can, say, self-explain the large amount of content AI generates: why I generate this content, what I am, what my goals are. And ultimately, perhaps one day, AI will also have consciousness.
We define roughly five levels of thinking along these lines.
The Three Core Capabilities of Computers
From the computer’s perspective, a computer wouldn’t define things so elaborately. As I see it, a computer has three capabilities:
First, representation and computation. It represents data and can compute on it.
Second, programming. Programming is the only way a computer interacts with the outside world.
Third, fundamentally, search.
But layering these capabilities together: first, having representation and computation gives it storage ability far beyond humans. Second, programming lets it produce logic more complex than humans can. Third, search lets it do things faster than humans. Layering these three computer capabilities together may produce what’s called “superintelligence,” perhaps surpassing some human capabilities.
AGI-Next 30: A Vision for the Next 30 Years
I suddenly recall 2019. This PPT actually dates back to a collaboration with Alibaba, when they asked me to give a single slide; what I gave was this one slide—AGI-Next 30—on what we should do over the next 30 years.
I’ve screenshotted this diagram—Next AI. Back in 2019, we said that over the next 30 years, we should make machines capable of reasoning, of memory, and of consciousness. We’ve now achieved a certain degree of reasoning ability—there’s probably some consensus on that. We have part of the memory ability, but consciousness is not yet there; that’s what we’re working toward.
Going forward, we’re also reflecting: if we take human cognition as a reference, future AI may grapple with what is “I” and why is it “I,” as well as building a system of meaning for the model, and the goals of a single agent, and the goals of an entire population of agents—so that we realize the exploration of the unknown.
Some may say this is utterly impossible, but remember: humanity’s ultimate purpose is our ceaseless exploration of unknown knowledge. The very things we deem impossible may be precisely what we must explore on the road to AGI.
Outlook for 2026
For 2026, what matters more to me is to stay focused and do some genuinely new things.
First, scaling. We’ll likely keep doing it, but there’s scaling the known—continually adding data and pushing the upper limit—and there’s scaling the unknown, which is the new paradigm we don’t yet know.
Second, technical innovation. We’ll pursue entirely new model-architecture innovations to solve ultra-long context, along with more efficient knowledge compression. And we’ll work toward knowledge memory and continual learning—these two together may be an opportunity for the machine to become a little bit more capable than humans.
Third, multimodal sensory integration—a hot spot and focus this year. Because only with this capability can AI enter the long tasks and long-horizon tasks within machines, within our work environments—within phones, within computers—completing our long tasks. And once it completes our long tasks, AI becomes a kind of job role: AI becomes like us, able to help us get things done. Only then can AI achieve embodiment, and only then can it enter the physical world.
I believe this year may be a breakout year for AI for Science, because with so many capabilities greatly improved, we can do far more.















