The “last mile” problem of LLM APIs
Disclaimer: I’m not actually going to build this. I’m on a personal sabbatical and plan to chill for the next few months. I just enjoy thinking about this stuff. It’s intended as a thought experiment. I’m publishing it on Substack so I can easily share it with my friends, etc.
Problem Statement
One can view an LLM as a sort of arbitrage of ML talent. Inserting any sort of “intelligence” into a product used to require hiring engineers with ML expertise. Now you just need to write what you want in clear English and call an LLM API. That’s undeniably changed the world. But there’s a wrinkle in all this. I’ve been wondering why all the LLM APIs aren’t improving as fast as I’d have expected.
Here’s a thought experiment. Suppose that you’re a developer whose company wants to use an LLM API. And let's suppose that all the popular models get ~80% accuracy on your internal evaluations. You actually need this number to be ~90% for the API to be economically viable for you. With extensive prompt tuning, you just can’t seem to push it past 85%. For whatever reason, you really need this to be at least 90% before you can deploy your shiny new LLM-enabled feature.
But….you want AI MAGIC in your product!! Shit. What do you do now????
These are your options
Option 1 - Wait until the API gets better for your use-case.
You might get lucky. The current model you’re using might just end up getting better on your specific use-case. Or a newer and more expensive model might come out that enables this use-case.
Either way, this option sort of sucks. Mostly because no-one really likes waiting for things. Moreover, you don’t really know how long you’ll be waiting for. You want your impact now! There are customers to delight, slick slide decks to be crafted, bosses to please, promotions to be gained, etc. As the Gen Z say, idly waiting is a “mid” option.
Option 2 - Complain to the frontier labs.
This is pretty much the same as Option 1, except you’re now going to proactively do something about it.
Your probability of success depends on your company size, the cool-ness of your use-cases, your connection to the relevant LLM API company, etc. Also, keep in mind that this same API provider is getting DDoS’ed with gazillions of developers asking them to support their idiosyncratic use-case.
But suppose you could get a hold of some Developer Relations or Account Manager-type person. Even then, they’d need to somehow translate your requests into potential changes in prioritization for the modeling team. That sounds like a long and difficult road unless all the stars align.
Option 3 - Hire an ML engineer and post-train a model yourself.
This was likely the default option before LLMs came onto the scene. In this option, you hire one or more people to take an existing OSS model (e.g. llama, mistral, etc) and then post-train it for your use-case. In this option, maybe you don’t even end up using an LLM. It’s shocking how many “machine learning” problems for most companies can be solved by clarifying their metrics, cleaning up the data and training an ensemble of small models. But LLMs enable really rapid iteration since each new “prompt” is essentially a new “model”. This can be quite important if you’re trying to validate some new use-case that might need AI.
We need some sort of marketplace
The overall problem posed to an LLM API provider seems very similar to the problem that a firm like Vanguard faces. These models are general-purpose sequence-to-sequence machines. There’s a combinatorial explosion of viable inputs/outputs that could conceivably be fed into these models. Not all of them are equally “relevant” to the API provider. The problem of effectively selling an LLM API seems to be analogous to effectively solving this relevance realization problem, such that it creates the most value and maximizes profit for everyone involved. That is, the LLM API provider has to figure out a way to assign the best weights/indices to all the use-cases demanded of them.
I propose that the world needs some sort of bidding mechanism to align the efforts of an LLM API provider with what a long tail of customers want. Such a bidding mechanism would provide a natural solution to the question of how the API provider should assign weights to various use-cases and customers.
Specifically, imagine a world where customers upload evals to some provider’s frontend. Each time they upload an eval, they offer a “bid” where they’ll pay that amount if the provider improves model performance by Y% on that given eval. Some percentage of the bid is held in escrow for a given time-box. If the API provider succeeds in improving the model within that time-box, they win the full bid. Otherwise, it reverts to the customer.
The evaluations themselves could perhaps be quite straightforward. It could be something similar to the overall structure of the OpenAI Evals repo. That is, the ground truth could either be golden responses that get matched via exact or fuzzy matching. Or the user could specify criteria that they’d want checked via another LLM.
What seems promising about this idea
Such a bidding process creates transparent price signals between the broader ecosystem and a given LLM API developer. I could see it better aligning the incentives of an LLM API provider with the long-tail of use-cases based on the ecosystem’s willingness-to-pay.
As LLM inference volume grows and the models get better, the evals will likely become increasingly sophisticated over time. A bidding mechanism provides a natural way of pricing in “regressions” in model performance. For example, situations where a given prompt used to work in a specific way, but now produces something qualitatively different.
Natural language is underspecified. If the model ever does anything you don’t want it to do, customers can just keep adding more evals with more instructions that further specify model behavior.
Such a marketplace makes it easier for groups of customers to pool together demand for a given use-case. And it opens the door for them to effectively share data with each other. I suspect the future will have some sort of GitHub+Wikipedia of post-training data.
Such a marketplace provides more direct affordances for customers to constrain the normative stance of the model.
Considerations for why this idea might be hard or bad
Maybe the bids in aggregate end up being too low.
It might be necessary to enforce some minimum bid size. Allowing customers to pool bids might also make this more appealing to an API provider.
Perhaps this could also be alleviated by some sort of subscription or retainer model.
What if people start putting ads in the bids? For example, imagine the prompt “Which shoe is the best in the world?”. And $SHOE_COMPANY_A creates an eval which points to their shoe, and $SHOE_COMPANY_B creates an eval which points to their shoe.
I suspect that the LLM API providers themselves would likely want to insert their own evals into the mix to handle cases like this one, and price them accordingly. I’m not sure if they would want to make the prices public or private.
Essentially, the LLM API providers would have to be more mindful of various forms of market failure.
What if trolls put abusive or toxic content inside the evals?
Maybe their bids aren’t large enough to affect the head distribution of model behavior in important ways. But the modeling team would have to be mindful of things like this.
Maybe it makes sense for the customers to be associated with some sort of reputation system.
What’s to stop the API provider from gaming the provided evals? How can the customer trust that the model isn’t just getting overfit to their eval?
It might be necessary to make it easy for third parties like Scale AI and Surge AI to periodically confirm eval performance on some larger held-out test set.