Introduction
Early preview Chrome Prompt API.
I’ve recently been invited into the early preview program for the Chrome Built-in AI (Prompt API). The built-in AI is exploratory work for what will potentially become a cross-browser standard for embedded AI. It leverages Gemini Nano on the device which means that it is bundled into your web browser and the LLM generation happens in your local browser environment.
Benefits
The Good, the Easy, the Fast, and the Free.
There are three primary reasons for us to want embedded AI for our browsers. Speed, cost, and usability. As a native browser API, it is easy to use. Accessing the Prompt API is as simple as these two lines of code.
const session = await window.ai.createTextSession();
const result = await session.prompt(
"Tyingshoelaces.com are writing a really cool blog about you. What do you think about that then?"
);
It couldn’t be easier to get Generative AI results where we need them in the browser. I ran a few tests to check the execution time. Although I was disappointed that we were restricted to a single session (no concurrency), the performance for complicated long text generation was good.
Remember, there is no latency either, so the execution time is literally from the millisecond that we made the request in our browser to the use of the result in our code.
VM975:32 Execution Time 1: 0h 0m 3s 47ms
VM975:32 Execution Time 2: 0h 0m 3s 870ms
VM975:32 Execution Time 3: 0h 0m 2s 355ms
VM975:32 Execution Time 4: 0h 0m 3s 176ms
VM975:32 Execution Time 5: 0h 0m 7s 103ms
VM975:44 Average Session Execution Time: 0h 0m 3s 910.1999999999998ms
);
The average execution time for 5 chained requests to the built-in AI is between 3-4 seconds per complete request for long text generation prompts. I ran this several times (the script is included in the GitHub repo), and although this varies by device, I’d also expect this to improve when the API gets optimized. I’ve noticed that shorter JSON generation tasks are much quicker (200-400ms).
This is more than acceptable for most use cases. We’ve also crowdsourced the issue of scale for our LLMs. Where industrial-scale API usage is infamously expensive, every LLM request is handled via an experimental browser API. It feels really nice and opens up a world of possibilities.
By having Chrome users embed the model into their browser, we have a distribution mechanism with preloaded generative AI models at the point of use and without the need for large servers. This is similar to WebLLM but with a significant advantage that the models are preloaded and bundled into our browsers.
This means that we can download a single model for use across ‘the internet’ rather than being forced to download a vendor-specific model.
The huge positives for this experimental browser API are strong arguments for adoption; it’s fast, it’s free (or paid for by the consumer), and really easy to use.
But what are the tradeoffs?
Costs
Fast and free. But what cost?
The API is unapologetically ready only for experimentation, not for production usage. As a result, a lot of the output is less refined than we would expect for more mature and hosted models. The limitations on size alongside the generalist nature of the model mean that we don’t have polished output.
This leads to frustrations that take us back to the early days of Generative AI APIs. I found myself using a lot of prompt engineering and validation logic to get reliable JSON responses. Every few requests, the API seems to be non-responsive, it’s quite easy to confuse the response in which case the model bombs out.
There is also mention of the fact that given that this model is embedded in the browser; it opens up some value as being a ‘private’ model. I’m not sure this is relevant to most use cases, as public-facing websites will still be interacting with their servers, and for the average user, it is hard to be certain that data is never leaving the local environment. Having said that, for internal usage and non-public facing systems that operate via a browser (corporate environments for example), this could be a bonus point.
The lack of sophistication in the responses owing to the smaller model means that we have to be very careful about the tasks that we use this for. Architectures of the future will optimize their generative AI implementations to use the right weight (and therefore, cost) for the right task. I envisage multiple small, highly tuned, and task-oriented LLMs, each being used for a specific output.
Having said that, all is forgivable especially as the API is explicitly designed for experimentation, not production usage.
The good
-Cost
-Scale
-Speed
-Usability
-Private
The bad
-Sacrifice in quality
-Implementation cost
As an example, if we wanted a deep analysis of current affairs, we would need a large context window and sophisticated RAG flow to inform the output; embedded AI is almost certainly not the right approach. Google alludes to this in its resources.
But I have a theory that I wanted to put to the test; a harebrained, mad, and tremendously fun theory; and a micro browser hosted LLM was the perfect place to do so.
A New Way of Thinking
Neurons, not brain
There has been a little itch I’d been wanting to scratch for a while. What if we are using LLMs all wrong? In fact, what if we’ve got the conceptual model wrong?
As we race for ever larger context windows with expanding training data, we are trying to scale Generative AI vertically. Bigger, stronger, faster, better. My jaw drops as I see people kindly asking for context windows large enough to plug in the entire internet, and then ask the algorithm in the middle to please pick out exactly the information and output that we want from this enormous lake. And faster.
We treat every input into an LLM as an API, text goes in, magic happens, and text comes out. This magic in the middle we call intelligence. The more text in, the louder the magic, and the better the result. This is our current path forward.
I can’t help wondering if we are focused on the wrong scale or zoom, an erroneous interpretation of cognition.
The thing about thinking in general, especially creative output (which is exactly what text generation is), is that it isn’t such a simple process. It’s not a single thread. We are already seeing this in the newer models; for example in my breakdown of the Claude 3.5 Sonnet system prompt, we see that many of the recent advances in LLM output are probably not to do with the algorithm itself, but the infrastructure, systems and tuning that contextually guide the output.
I’ve been wanting to try out a concept of tiny, fast connections meshed together to build something bigger. In the end, a context window of 100k is the same as 1k - 100 times. I suspect that even as we are focused on the grandiose, the key is in small and precise details meshed together to form something larger. This fits in with my mental paradigm of intelligence much more than a sentient machine ‘brain.’
This hasn’t been possible until now due to the relative inefficiency of models in general and the prohibitive cost. Imagine Bob in accounts as we tell him we are going to 100x the number of requests to ChatGPT as we theorize that microtransactions in a mesh architecture will improve the quality of our AI systems. I don’t think Bob works at OpenAI, but for the rest of us, it just isn’t feasible.
Even a small and efficient embedded model in the browser isn’t really ready to handle my theorizing. It’s not quite fast enough and doesn’t enable concurrent requests (concurrent thoughts!), but it is a step in the right direction, and we’ve come far from cloud-hosted APIs charging massive fees for each request. I can’t see the functional architecture, but I can see the path towards it.
To test out this theory, I dusted off my programming gloves, opened up a browser, and started my epic journey to a mesh architecture with 1000 multithreaded requests.
The results were magical.
Your Brain, Not Theirs
A brain is local, so should our APIs be.
I love voice. I think keyboards and mice have become extensions of our monkey brains, but they are human contraptions and are therefore limited as an interface more holistically. As technology advances, so will interfaces, and at some point, keyboards, mice, and even screens will be as obsolete to our ancestors as oil lamps and carrier pigeons are to us.
So, whatever I wanted to build had to be voice-controlled. Luckily, there’s a browser API for that.
- Speech Recognition API (with Speech to Text)
- STT API
- Prompt API
- Internet (Accessed via a browser)
What I wanted to build was a browser-controlled voice interaction demo. An intelligent website that navigates, responds, and changes based on the browser context and the input using nothing other than my voice. No keyboard. No mouse. “Me, my voice, a browser, and the prompt API.” Sounds like the worst children's story I’ve ever heard. I’ve probably written worse.
Conceptually, very similar to the Rabbit device or the Humane AI pin. These are both ambitious ventures, but the problem they share is that they are trying to build an ‘AI OS’. A new AI-powered interface into the software. I find the goal too grandiose, essentially trying to build a new interface into the internet with a sprinkling of AI.
Innovation is about iteration, and the internet in 2024 is ubiquitous and fundamentally intertwined with the browser. Trying to invent a human-friendly AI OS interface is a similar endeavor to trying to reinvent the internet. Folks are already asking, ‘What can I do that I can’t with my mobile phone already, but better’...
Innovation requires a blending of the new and untested but with solid and proven foundations. Too much instability and the results will be mad scientist territory, but get the balance of the proven and the experimental just right, and sometimes, just sometimes, something special happens.
The cognitive paradigm that we have gotten wrong in most LLM use cases, is that we treat an engagement as a handshake. Input ← LLM → Output. Input in, output out. However, with real human interactions, we have multidimensional processes that can be broken down into different thoughts and actions.
“
Store Attendant greets customer ->
[Thoughts]
What are they wearing, how does their style influence their buying patterns
What is their demographic, how does their age influence their buying patterns
How will gender influence their buying patterns
What kind of mood/social signals are they giving off
What have they actually said that will influence their choices
[Action]
Good morning sir, how are you
“
Customer greets attendant ->
[Thoughts]
Hurry up, I’m busy
Hope they have what I want (by reading my mind!)
Will they accept returns?
[Action]
Good morning, I’m looking for a pair of shoes.
We’ve gone so deep into computer science that our thought processes around the discipline have become binary. We think of inputs and outputs, true and false. The truth is that human interaction and thoughts are complicated and nuanced, we can’t reduce or simplify to binary.
But what we can do is mesh this wonderful technology in new and creative ways, to break down the barriers that are homogenizing the output and turning the internet into slurry. turning the internet into slurry
Many of One, One of Many
Let’s make Gen AI interactions multi-threaded and nuanced
My proposal for experimentation uses the built-in AI to mirror social and human interactions. Let’s use an example that I have muscle memory of; building a recommendation algorithm for e-commerce.
Thread 1: Social Cues, sentiment analysis
– How long has it taken for user to interact?
– Is their browsing behavior aggressive, slow, calm, controlled
– Have they arrived from particular source, or looking for something specific?
Thread 2: Behavior Cues, interpretation user input
– How have they begun the conversation? A greeting?
– What tone are they using?
Thread 3: User context, data we have about similar demographics and their preferences
– What age group do they belong to? How does this influence preferences?
– How do they identify? How does this influence preferences?
Thread 4: Site context, data we have how other users are using the site and trends
– What are the trending products?
There is no silver bullet for interpreting so many data points, and there never will be. LLMs are not a plugin “sentiment analyzer, entity classifier, jack of all trades”. LLMs are generative algorithms that can creatively and logically interpret inputs. Notice that each of the cues in the threads are not outputs, they are questions.
To inform thought and generative AI, we need to ask far more questions than provide answers. We need to be sophisticated about how to get all our data points, and structured in the way that we feed these into our LLMs. So, to use behavior and social cues as an example, we’d need to do the following:
- Sentiment analysis
- Data analysis for browser behavior vs site and global averages
- Extract referral data from requests
All of this data would be prepared and processed long before it goes to our LLM. But, once prepared, we can help inform with a prompt like:
User A is a return visitor showing signs of being slightly upset. Remember this as you deal with them, make sure to reassure them we have a returns system. [Action]: Link to our returns policy and popular products.
An alternative would be:
“
User B is showing signs of being impatient and have arrived looking directly for Product X. Take them to the product page and offer to add to cart. [Action]: Navigate direct to page X and add the product to the cart.
LLMs, in this sense, are our agents and interpreters, but the mistake that people are making is assuming the “algorithm” is the solution for quality output. Just like real agents, our judgment is only as reliable as the data and the cues that we have to inform them. Ask more questions than you provide answers.
This is an unalienable social truth and why our current expectations of LLMs are so off-kilter and agents are leading many into the trough of disillusionment. Rubbish in, rubbish out. Doesn’t matter how good the algorithm is.
Just to get two groups of cues for our recommendation algorithm, we’d need to rely on an array of specialist tools and AI infrastructure that is beyond the capabilities of all but a few platforms on the planet. But we can get there iteratively by building nuance, threads, and sophistication into the infrastructure that is feeding our LLMs.
And now, they are in the browser; the future has never been so near.
I built nothing but a simple prototype mocking social cues and inputs. Sprinkled a bit of user data and then asked the Prompt API to respond to my voice with a combination of thoughts and actions. It’s nothing more than a vision of something that ‘might’ work. But by providing granular, detailed, and controlled inputs into our Prompt API, we get intelligent, thoughtful, and controlled feedback. It’s a vision of a mesh infrastructure as micro-threads can dynamically learn, reinforce, and inform each other.
It won’t work yet. But it might work someday, and the prompt engineering with voice input feels magical. It’s a destination worth driving towards.
Conclusion
The future is nearer than ever.
We are still in the early stages of LLMs, and I predict that advances shall be slower than expected and AGI (by any reasonable definition) won’t arrive for generations. But with each step on the road, a world of opportunities arises. Building highly efficient, well thought out and defined infrastructure massively improves the quality of output from our LLMs, irrespective of model size or algorithm quality.
Moving LLMs to the browser can also be understood as moving LLMs to the internet. It will be cheap, easy to play, and to use and experiment. Forcing folk to think smaller, to build more efficiently, and to add depth and nuance to their solutions is a good thing, so I’m not even too worried about ‘Micro’ models. The sophistication is in the usage not just the tool itself, so this is a giant leap forward.
I have attached my demo; it is throw-away code looking at a proof of concept, built upon an exploratory AI that is only suitable for demo purposes.
And it only works sometimes.
Yet, it is a wonderful vision of the future.
Links
More resources.
Please keep this CTA when you submit: