Note: This post was co-written with Gemini 1.5 Pro. If you’re a writer and aren’t using these models for editing, you’re missing out! You can try stuff out here.
Introduction
There’s been so much excitement, energy and movement around LLMs that I’ve certainly found it somewhat disorienting. Social media allows for the rapid dissemination of information, so that’s certainly been a factor. However, LLMs represent a legitimate profound inflection in how we participate and relate to technology. It’s likely many people feel that intuitively if not consciously.
So everyone’s been wondering - what’s going to happen next?
Before we can understand where we’re going, we need to understand where we currently are. Unfortunately, there’s a lot of confusion and noise that obscures such analysis.
Most of what we now call “frontier labs” have stopped publishing meaningful details about their modeling. For example, I’m referring to OpenAI, Anthropic, Google, Meta, etc. This makes it difficult to understand current model capabilities and leads to baseless and unsophisticated speculation about where the modeling is heading. For example, all the memes/tweets about “What did Ilya see?!?!?”.
The gold rush has encouraged a lot of new entrants who might not have a lot of context about what has come before.
There’s a seemingly endless store of private capital that’s getting deployed. This creates incentives for fundraisers to provide distorted narratives.
Humans (including myself) are generally bad at dealing with nebulosity. Most people (again, including myself) find it uncomfortable. People often prefer to proactively and perhaps prematurely collapse the nebulosity rather than sitting in that discomfort, waiting to watch reality unfold. This technology induces a very real dissonance where on the one hand it seems quite revolutionary, but on the other seems too early. This is exacerbated by the models having a “jagged frontier” of capabilities.
I’ve been thinking about a lot of this stuff for a while. A lot of concerns are quite interconnected. My project over the next few months is to attempt to straighten them out. Rather than viewing this post as The Final Answer, I’d invite you to view it as my first “sketch”. It’s quite vague and hand-wavy in some places. As I continue pulling these threads over the coming months, I might often change my line of argumentation. This document is also quite incomplete. There are many things it doesn’t cover, and this post was already getting quite long.
A reflection on the last three years
It’s frankly astonishing how much things have changed in the last three years. I’ve spent most of my time on code generation. MBPP/HumanEval are coding benchmarks that the models would often struggle with. They’ve now been eclipsed in complexity by the likes of SWE-Bench. Even the SWE-Bench leaderboard will show that we’ve gone from <5% resolved in late 2023 to ~48% resolved in late 2024. Again, we can have a separate conversation about whether SWE-Bench is “complex enough” to capture the real activities of software engineers. But it’s undeniable that it’s at a very different level of qualitative complexity compared to MBPP/HumanEval.
The models have been rapidly improving along every other dimension too, not just coding. It’s easy to forget that the original PaLM model only had a context length of 2048 tokens. Gemini 1.5 has a context length of 2M tokens. The original PaLM paper was published in Oct 2022, and Gemini 1.5 was released early in 2024. Ditto for similar improvements for factuality, instruction following, latency, etc.
There’s been some argumentation on social media about whether or when we’ll hit a “wall”. FWIW I haven’t seen a clear consensus on what this “wall” means in very concrete terms. Most technological improvements seem to follow sigmoids rather than unbounded exponentials. But my gut says that this is just the beginning of this specific arc. Inference will continue to become radically cheaper, model performance will continue to improve and latency will continue to fall. I tend to look at the curves, rather than any specific point along the curve. I personally know so many smart and kind people working in this space. I’m extremely optimistic that this is just the beginning. The vibes seem good, and I’m an extremely vibes-driven human being.
Moreover, the rest of this document is built on the premise the models will continue to rapidly improve. Although I’m not sure what the overall gradient of the curve will be and whether “harder” benchmarks like SWE-Bench will get solved in one, five or ten years. Even if this is essentially as good as the models get for the next ~10 years, the world nevertheless ends up looking like a very different place than the one pre-ChatGPT. I may write a separate post on what that world may look like. But it’s not my focus right now.
The APIs seem largely undifferentiated
The LLM APIs seem to be a largely undifferentiated commodity. It’s true that some models seem better at some things than other models. But I’m not sure if that’s currently very meaningful.
The other participants in the market can rapidly replicate the modeling gains they see elsewhere. For example, look at o1 and DeepSeek, etc. It seems that the folks from the frontier labs (and the broader global ML community), are drinking from the same overall “soup” of ideas. For example, I don’t see anyone attempting to scale up training using alternatives to backprop (although I’m not claiming that’s a good idea!). The moment a given group releases an amazing novel capability, it seems like the mere existence proof is enough to convince the others to search in that direction. Given this homogeneity in the space of ideas, it’s not very “hard” to replicate those results sans some delta in performance, cost, etc. In the sense that a well-funded and well-run lab can usually figure it out after some delay.
o1’s inference-time scaling was rapidly reproduced by others like Gemini’s thinking model and the latest DeepSeek model. IIRC Claude was the one that originally had a 100k context window. That’s now been reproduced by 4o and beaten by Gemini’s 2M window. It’s also interesting that many providers now use the OpenAI API for their SDKs.
Additionally, the underlying dynamics of a typical cloud business likely encourage a certain level of standardization and commoditization. Customers don’t want to get locked into some idiosyncratic cloud offering unless it’s orders of magnitude better than the next best alternative. The most serious customers have a strong incentive to construct rigorous quantitative evals, and the frontier labs orient around and advertise a small number of academic evals. Put together, this likely puts substantial constraints on model behaviour that leads to some sort of regression to the mean. That is, standardization of behaviour across models.
Has this happened before in history?
It’s always dangerous to make such comparisons. But the current moment seems comparable to the late 90s with respect to the growth of the internet.
We’re very early, and we haven’t seen anything yet. In this metaphor, we’re currently quibbling about sending payloads across the internet. But I’m pretty sure that Mosaic and Yahoo haven’t been invented yet.
Moreover, because of ChatGPT, the world’s gotten locked into a framing of LLMs as “chatbots”. But these models are much more profound than that. They’re truly general “anything machines” that can follow human instructions. We’re just beginning to see novel use-cases like this one or this one, which push the boundaries of the sort of products that can be built as inference becomes cheaper. There’s way more examples where those ones came from. We’re unequivocally at the very beginning of all this.
Rapidly decreasing costs will unlock people’s imagination. Making prototypes during the PaLM era involved substantial toil in crafting the most efficient context window. Inference was expensive, so one tried to be very judicious with what was put in there. As inference costs have fallen by orders of magnitude, research projects like
https://www.docetl.org/
have suddenly become much more feasible. DocETL treats the underlying inference API call as the commodity that it truly is, and imagines a robust system built on top of it. It explores what I consider a “horizontal” scaling of inference rather than the “vertical” scaling of o1, etc. We’re unequivocally at the very beginning of all this.
Tools like Replit make it trivial for people to build simple webapps and host them frictionlessly, largely driven via prompting. Similar to how “non-serious” hobbyists started hosting their own personal webpages, I suspect that “non-serious” hobbyists will start hosting their own personal SaaS apps.
At the dawn of the internet, sending a payload of bytes across a network was quite expensive. As infrastructure became cheaper, the field of “networking” evolved into the field of “systems” spanning more sophisticated capabilities like server software, large-scale databases, etc. This article from Jeff Atwood about Google’s history seems particularly relevant here. Instead of buying expensive high-end servers from Sun Microsystems, Google figured out how to use commodity off-the-shelf hardware to run their systems. They stopped paying for expensive Sun servers. Instead, they designed novel software to work around the underlying unreliability of commodity hardware. Perhaps something similar will happen with LLMs too. As inference costs become “too cheap to meter”, it’ll be easier to compose systems on top of plentiful inference. For example, will we end up with something like another layer to the OSI model involving LLMs? I don’t know. But we’re unequivocally at the very beginning of all this.
Open vs closed models
Again, I won’t pretend to provide a full account of what’s driving the current dynamics commoditizing LLM inference. This section is a series of observations I plan to investigate more deeply over time.
A specific dataset, learning algorithm, optimizer and architecture seem like very strong constraints on the set of functions that one can learn. Assuming those are fixed, model performance seems to be a function of how well the experiments are designed and the amount of compute available to run lots of sweeps. Therefore, if there isn’t some defensive moat around those underlying ingredients, then it’s natural for the models to become undifferentiated commodities. Again, I’m oversimplifying it because there’s a lot of work that goes into getting those inputs right. But there’s a lot of smart people out there that can evidently figure it out.
With all that said, a missing variable is time. Even with those raw ingredients, the human capital to effectively use them isn’t always uniformly distributed. It seemed pretty clear that all the LLM providers would end up in this sort of race to the bottom dynamic eventually. But I wasn’t sure how fast it would happen. And I certainly didn’t anticipate Zuck regularly YOLO’ing releases of strong llama checkpoints every few quarters.
Multiple labs seem to have developed models comparable to the original GPT-4. The best OSS model is perhaps comparable with the original GPT-4 and the overall SOTA has moved extremely far ahead of the original GPT-4. This competitiveness of the OSS models creates a very interesting dynamic.
It means that now every model provider has to compete with “free”. They’re not actually free since you still have to pay for the accelerators to actually do inference. And you have to put in the work to host your own jobs on whichever cloud provider you use. But it’s quite a compelling threat. It seems that Zuck has accelerated the dynamics of the race to the bottom. It’s a brilliant move for Meta though. They certainly weren’t “winning” the race. But by releasing the llama weights, they get to commoditize their complements and threaten their competitors. Meta doesn’t have a cloud business. If they can release the models for free, they can benefit from all improvements to them. Centralizing the ecosystem around their design decisions could potentially allow them to reduce costs. Zuck has openly mentioned that this was inspired by their Open Compute project. Especially since they already have a very sticky business that could certainly benefit from lower compute costs. It also has the additional benefit of endearing the academic ecosystem towards them.
This sets up the stage for a fascinating capex race. Let’s say that the OSS world lags the closed labs by ~6-12 months. I pulled that number out of my ass. Every time a new OSS checkpoint gets released, it applies downward pressure on the prices that the closed labs can command for their APIs. To be clear, I’m not sure how strong this effect currently is, and I’d like to understand it better. There’s a wide spectrum of workloads that have different trade-offs for cost and complexity. Every ~6-12mo, the closed labs have to release some groundbreaking capabilities to justify their existing margins on the API. Similarly, llama has to keep catching up to justify adoption for a “worse” model. I use the word “capabilities” here quite broadly. For example, I don’t just mean raw model performance. I also mean considerations like latency and cost-per-token. But at some point, the open models will likely start to become “good enough” for a wider array of workloads.
The overall result is that with every passing year, the models keep improving.
LLMs are likely part of a broader economic trend which defines the late 20th century and perhaps the rest of the 21st century. Moore’s Law was about how transistor density doubled every two years. Perhaps LLMs are a sort of generalization of that insight. That is, perhaps “intelligence per dollar” is exponentially increasing at some rate, or something like that. LLMs are likely going to be at least as big as the internet.
Implications for the machine learning industry
I really liked this blog post by Kyunghyun Cho, and I agreed with a lot of it. A lot of what he describes affects not just academia, but industry too.
An LLM is essentially an arbitrage of ML talent. Pre-LLMs, there was a healthy market for “ML Engineers”. That is, folks that could curate a dataset, run a bunch of experiments, construct good evals, etc. and then finally give the product teams something to deploy. Most of that work has been reduced to writing clear prompts and evaluating their inferences properly. The barrier to deploying “machine learning intelligence” within a product has totally evaporated compared to what it used to be. As inference becomes cheaper, the specific sets of skills for deploying intelligence will be closer to systems design and classical software engineering concerns than the sort of maths-ey skills required to do ML research. Most people prompting models shouldn’t really care about what SGD is, nor even know that backpropagation exists. As the models get increasingly more commoditized, a company calling itself an “AI company” will seem as banal as a company calling itself an “internet company”. Today, we totally take it for granted that every company we interact with uses the internet in some form. Similarly, we’ll take it for granted that every company we interact with will use LLMs in some form.
We might end up in a world where the number of people actually training models is very small, but the number of people prompting models is very large. It’s possible that as the OSS ecosystem matures, there will be increasingly more people post-training models for their idiosyncratic use-cases. But I’m not sure if I have a strong position on this yet.
If I were starting a PhD in ML right now, here’s what I would personally do. What I say below is probably going to piss off some people. I’d like to stress that this isn’t advice. I’ve no qualifications to offer such advice. These are merely my reflections.
I’d largely abandon “methods” based research. For example, I’d abandon working on designing optimizers, architectures, etc. This isn’t to say that such work isn’t important! But doing this sort of work in academia right now seems tough. If I were really good at producing non-consensus results in this space with minimal resources, I’d pursue it. But I wouldn’t bet on myself to do that. The biggest reason I’d go in this direction would be to get a job at a frontier lab doing something similar. But I suspect there are better pathways for doing that.
I’d do research similar to
https://www.docetl.org/
which investigates systems constructed with plentiful LLM inference.
I’d go deep into cognitive science, and examine the cognitive capabilities of these models. There’s likely hundreds if not thousands of years of philosophy on ideas like theory of mind, etc. I’d attempt to ground that in increasingly sophisticated evals and study these capabilities within the existing SOTA models.
My gut feeling is that the most leveraged position for academic impact right now is crafting extremely sophisticated and interesting evals. But that’s just my $0.02.
Implications for software engineering
I really liked this excellent doc by Chris Paik exploring the implications of increasingly cheaper/better inference.
It seems that the supply of bespoke software will substantially increase, which will likely substantially increase the demand for bespoke software. However, in this new world, it’s not clear if the prices commanded by individual software engineers will substantially increase. If anything, I suspect we will see a much more long-tailed dynamic in SWE compensation reminiscent of the music industry.
A “software engineer” is really just a bundle of roles and responsibilities within the context of a modern organization. This bundle wasn’t handed down by God. It emerged due to cultural and economic drivers. Ditto for “product managers” and “UX designers”. The overall roles and responsibilities in an organization will likely start to become a lot more fluid than they are. This fluidity will likely be proportionate to improvements in model performance. For example, what does the modern tech organization look like when PM/UXs can rapidly generate prototypes before pitching them to the Eng leads? Or if engineers can rapidly perform sophisticated market analysis or UX design, that would otherwise be totally inaccessible to them.
No longer constrained by previous limitations, boundaries between engineering, product management and UX will increasingly become more fluid. Again, proportionate to the underlying model performance. Assuming access to plentiful inference is largely ubiquitous, tech employees will have to find other ways of differentiating themselves beyond their job roles. This is especially true if they want to continue to command high total compensation packages.
OSS developments like the LAMP stack led to the emergence of “full-stack engineers”. These generalists could move effectively between the different layers of the tech stack, but had a specialization at some aspect of the stack. Aided by LLMs, we’re already seeing the emergence of “full-stack employees”, rather than “full-stack engineers”. It’s been common career advice that one should cultivate a T-shaped model of competency. That is, to become a generalist across a wide base and a specialist along some narrow dimension. But soon, people will be able to achieve a substantially wider base, a substantially deeper stem, and potentially have time to cultivate various additional stems at lower depth.
For example, it’s been years since I’ve done frontend web-development. Cursor, Claude, Vercel v0, Lovable, etc. have helped me achieve in hours what would have taken me days. I’ve been able to delve into complex technical topics whose exploration would have been simply time-prohibitive in the past. Even this document was co-edited with Gemini 1.5. Pragmatically, I can do things now that would have been simply unavailable ~10 years ago.
It’s not clear what the best way of organizing large groups of “full-stack employees” is likely to be. To borrow the way David Chapman uses the word “nebulous”, organizations are likely to face far more nebulosity in terms of how they organize themselves. For example, typical org charts are grouped together around “functions” like engineering, product management, UX, etc. It’s an open question whether such groups still make sense in this new world. I don’t know, but I think a lot of companies are going to find out. This presents interesting challenges to companies that are currently extremely large but not LLM-first in this way. I suspect that their transformation into this new world might be uncomfortable and painful.
Implications of “full-stack employees”
As the models get better, it’ll become easier and easier to copy a competitor’s tech stack. Branding, marketing and trust-building will likely play a far larger role in this new world. Both for B2C and B2B. It’s possible that “authenticity” will play a much larger role, which seems like a continuation of an existing trend on social media.
I wonder if high-margin software will increasingly seem “artisanal” or bespoke in nature. Perhaps enterprises will demand increasingly bespoke software to meet the idiosyncrasies of their organizations. There’s a famous line that “no one got fired for using IBM”. Although IBM has now become increasingly irrelevant, this dynamic persists for many forms of software (e.g. Salesforce, SAP, etc). Perhaps soon, enterprises will value the overall brand/options of a specific software vendor weighted as much as the actual software itself.
Perhaps consumer software will see substantially more influencer-driven growth with similar economic returns as the music industry. That is, Taylor Swift can easily change hotel prices just by doing a concert in that city. On the other hand, a substantial number of musicians today can’t afford to do music full-time. For the want of another word, I wonder if “aesthetics” will matter far more to capturing value via software than ever before. This is actually why I find Nikita Bier’s explode app so fascinating. I haven’t used it yet, but the X commentary has been amusing.
It’s also possible to imagine that existing influencers will have many more degrees of freedom to differentiate. For example, suppose that you’re an influencer with ~500k Instagram followers. In the abstract, that’s not a lot for the juiciest monetization opportunities. But if it’s increasingly trivial to generate bespoke digital surfaces, perhaps such an influencer would opt to create custom software platforms to better monetize their followings outside the clutches of Instagram/TikTok/Snapchat, etc.
I wonder if the existing ontology of engineering, product, UX, etc. will soon seem antiquated. That is, perhaps some non-trivial competency of these functions will become table stakes, and new avenues of differentiation will emerge. These new roles too are likely to follow some sort of power law.
We’ll see far smaller companies meeting economic metrics (e.g. ARR) that would have required far larger teams in the past. I already hear of startups with ~2-3 people saying that they can achieve what used to take ~5-6 people in the past.
There’s also some interesting questions around existing labor policy, etc. here. For example, spending more money on training or inference is currently a business expense that can be a tax write-off. However, employing humans incurs payroll and other taxes. A lot of foundational assumptions underpinning capitalism will likely need to be revisited to allow markets to continue functioning well.
Implications for value creation in the world of bits
So in this future world of cheap software generation, how does one create value?
Let’s zoom out for a second. Every problem faced by a tech company has a “why”, “what” and “how”. The “why” is the overall motivation for why the feature should get built, nested within the normative structure of the organization and the individual's existential goals. The “what” encapsulates the specific artifacts that the firm’s employees would need to create to deliver value, and perhaps the “how” represents the code, UX assets, etc that would need to be generated to actually deliver value to the customer.
In this new world, the “how” is on a trajectory to get substantially commodified. Out of the three, this has the most amount of ground truth available. For example, the code either works or it doesn’t. The unit tests describing either pass or they don’t.
The “what” has somewhat less ground truth available. Nevertheless, the actual activities viable are often deeply constrained by the organization’s norms. For example, posting ads for counterfeit goods, firearms, etc. aren’t illegal in the US. However, many digital platforms prohibit advertisers from posting ads for such things.
The “why” has the least amount of ground truth available and ultimately involves deep questions about existential motivation. I realize that’s a very abstract thing to say. Suppose I’m the CEO of some company. Why would I want to launch a given feature? Perhaps because I want my business to succeed. Perhaps I also want to be a good role model for my children. Perhaps I’d also like to be a good citizen by solving some real problems in the world, including the problem of continuing to provide employment for many people. Nested inside each “why” question are more “why” questions, that ultimately take you to a question akin to “how can I live a good life in this moment?” There’s no optimal or objectively right answer to such a question. Moreover, all answers are deeply dependent on context. They’re all based on the lived concerns and life path of the relevant individuals involved.
Currently, companies create value by differentiating along the “why” and provide customers with more choices than they’d have otherwise had available. They then construct moats for that differentiation using the “what” and “how”. As the cost of generating software continues to fall, it’ll be harder and harder to maintain a moat purely via the “what” and “how”. Companies will be forced to differentiate along the “why” in an increasingly sophisticated manner, at faster rates than ever before. Moreover, they’ll need to rapidly translate that “why” into a “what” and “how”. This again ties back with my earlier point that software might start to increasingly resemble the music industry.
Taking this even further, the commoditization of “knowledge” leads to “wisdom” becoming an increasingly powerful differentiator. That is, the capacity to rapidly hone in on the right information at the right time, and the capacity to frame the “right” questions for “why” and along with a sound embodied strategy to achieve good answers for them.
Pursuing the proactive cultivation of wisdom has always been a good idea throughout history. Perhaps in the world of AI, it may end up becoming the good idea.
How can we use AI to accelerate the cultivation of wisdom?
Honestly, I don’t know. This is what I’m most preoccupied with right now. Specifically, I want to understand if it’s possible for me to grow wiser and more adaptive at a rate proportional to the underlying developments in AI.
There are many doorways that lead towards the cultivation of wisdom. Many religions could be interpreted as millennia-long conversations on this very question. One framing in particular stands out for integrating AI and the cultivation of wisdom, and that’s the pursuit of wisdom via inspired action. For example, in the East, we might call this karma yoga. IIUC there are comparable ideas in the West in Neoplatonism, but I haven’t yet had a chance to explore them. Essentially, one engages in increasingly adaptive actions that reciprocally opens the world and affords more agency and in the process transforms us. One’s inspired work becomes a doorway towards profound personal transformation.
This post is already getting quite long. I plan on writing a few more posts in the future to explore this overall direction. At a high level, I often find my agency, and therefore my capacity for skillful action curtailed by my ADHD. I’d like to build a series of digital product surfaces and tooling (i.e. “Wise AI”) that would collectively act as a sort of cognitive prosthetic. These tools will sit with me on every digital surface and afford me increased possibilities to engage in reciprocal opening within those surfaces. Rather than monetizing and capturing my agency, they’d seek to constantly increase it. As I co-create my world with this AI, the developers of the AI can use my interactions with it to improve the AI. This in turn would reciprocally afford me increased possibilities to increase my agency and reach my aspirational goals.
Acknowledgments
Thank you so much to rif, Dan Tse and Aedan Pope for their helpful comments and feedback on the manuscript!
Loved the read, Varun. Excited to follow along with your thoughts/experiments on wise AI!
This discussion sparked two thoughts. The first about how we might more rigorously train these models to incorporate wisdom. Your experiment might do this in small ways, but what might a bigger training system look like? I'm reminded of Soryu Forall's description of a wise community serving as a sort of training for the models -- https://youtu.be/z1j9XU_1PVY?si=qhT5vKSOdKHnckYO -- and can't help but think that formal approaches like this will be fruitful.
Second, and more speculatively, current architectures & training methods seem fundamentally limited to pattern-matching wisdom rather than embodying it directly. While better pattern-matching is valuable, I wonder if we'll eventually see architectural breakthroughs that allow AI to tap into consciousness and wisdom at a more foundational level. Would love to hear your thoughts on this architectural question in future posts.