I have some data science background, and I kinda understand how LLM parameter tuning works and how model generates text.
Simplifying and phrasing my understanding, an LLM works like - Given a prompt: Write a program to check if input is an odd number (converts the prompt to embedding), then the LLM plays a dice game/probability game of: given prompt, then generate a set of new tokens.
Now my question is, how are the current LLM’s are able to parse through a bunch of search results and play the above dice game? Like at times it reads through say 10 URLs and generate results, how are they able to achieve this? What’s the engineering behind generating such huge verbose of texts? Cause I always argue about the theoretical limitations of LLM, but now that these “agents” are able to manage huge verbose of text I dont seem to have a good argument. So what exactly is happening? And what is the limit of AI non theortical limit of AI?
Edit
I don’t have the answer, but hopefully this post does, is a step by step guide to create your own LLM: https://karpathy.github.io/2026/02/12/microgpt/
I think you may be mixing a couple of things together, but I’ll take a crack at this.
When you get an Ai generated response from a search engine, this is usually a modified RAG (retrieval augmented generation) approach. How this works is that the content from web pages are already pre-processed into embeddings (numerical representations of the text). When you perform a search, your search text is turned into an embedding and compared (numerical similarity) to the websites to get the most related content for your search. That means that the LLM only parses and processes a very small subset of the returned websites to generate its response.
Another element you might be asking about is how can these agentic AI systems handle larger tasks (things like OpenClaw). That is a bit more complicated and dependent on the systems design, but basically boils down to two things. The first is the “reasoning models” first break concepts into smaller tasks meaning the LLM only has to worry about a subset of a larger task. Secondly, a lot of these systems will periodically merge all past context into a compressed state that the LLM can handle (basically summaries of summaries) or add them to a database for future/faster reference.
At the end of the day, your understanding of the limits of LLM are correct, all the progress we’ve really seen with LLMs (over the past couple of years) has been the creation of systems to work around their limitations. The base technology isn’t getting much better, but the support around it is.
Thanks.
And to clarify, other than the corporate greed is there any actual use case for the work around their limitations? I mean if the building materials aren’t strong enough there is only so much you can achieve with a beautiful paint job (my current understanding, and I may be wrong)
The underlying issues, in my opinion, regarding LLMs is their indeterministic nature. Even zeroing out the temperature (randomness of outputs), you can get significantly different results between two almost identical texts.
However, building out an ecosystem supporting new technology is a fairly common progression. If you compare it to the internet things like browser caches, CDNs (content delivery networks), code minifiers, etc. are all ways to help combat latency (a fundamental problem for the internet).
As for the effectiveness of these solutions, RAGs do help a lot when generating text against a select corpus. Its what allows the linked sources in things like ChatGPT and Googles AI results. It’s also what a lot of companies are using for searching their support pages/etc. It’s maybe not quite as good as speaking to a person, but is faster.
Similarly, the reasoning models and managing the models “context” both have shown demonstrable improvements for models in benchmarking.
I’m not sure I personally believe this makes LLMs a replacement for humans in most situations, but it at least demonstrates forward progress for GenAI.
Others have explained it well; splitting calls up into parallel subjobs, and programatic prompt engineering.
And what is the non theortical limit of AI?
Shrug.
But practically, transformer models are kinda hitting an “innovation” wall. Big companies aren’t taking risks to try and fix (say) the necessity of temperature to literally randomize outputs, or splitting instructions/context/output, or self correction (like an undo token), adaptation on the fly, anything.
All this has been explored in research papers, yet they aren’t even trying it at larger scales. They’re simply scaling up what they have, or (in the case of the Chinese labs) focusing on lowering resource usage.
Basically, corporate LLM development is far more conservative than you’ve been lead to believe, and that’s the wall LLMs are smacking into.
That’s been my issue, ie somewhere I know all this LLM lead AI is a bubble. But the corporates either increase the context window or release something that does better parallel subjobs after 3 months, and now all of a sudden this LLM lead AI is the “future” and it can perform “agentic” tasks.
It kinda makes it impossible to make people (friends who are developers, colleagues) look past the marketing gimmicks.
I mean, even as-is, it’s a very useful tool. Especially as the capabilities we have get exponentially cheaper.
What people don’t get is AI is about to become a race to the bottom, not to the top. It’s a utility to sift through millions of documents, or run simple bots, or operate work assistants, or makeshift translators or whatever; you know, oldschool language modeling. And that’s really neat as the cost approaches “basically free.”
Basically, imagine running Claude Code on your iPhone, and Claude Code itself not really changing all that much. Imagine the economic implications for the big AI houses.
As for the marketing, I want some of what those tech execs are smoking.
The LLMs will just predict probabilities for the single next token based on all previous tokens in the context window (it’s own and the ones entered by the user, system prompt or tool calls). The inference engine / runtime decides which token will be selected, usually one with high probably but that’s configurable.
The LLM can also generate (predict) special tokens like “end of imaginary dialogue” to end it’s turn (the runtime will give the user a chance to reply) or to call tools (the runtime will call the tool and add the result to the context window).
The LLM does not really care about if the stuff in the context was put there by a user, the system prompt, a tool or whatnot. It just predicts the next token probabilities. If you configure the runtime accordingly it will happily “play” the role of the user or of a tool (you usually don’t want that).
Some of the tool calls are e.g. web searches etc. and the search results will be added to the context window. The LLM can decide to do more calls for further research, save data in “memory” that can be accessed by later “sessions” or call other tools (new tools pop up daily).
Models tend to get larger context windows with every update (right now it’s usually between 250K - 2M tokens but the model performance usually gets worse with more filled context windows (needle in a hay stack).
To keep the window small agentic tools often “compact” the context window by summarizing it and then starting a new session with the compacted context.
Sometimes a task is split into multiple sessions (agents) that each have their own context window. E.g. one extra session for a long context subtask like analysis of a long document with a specific task and the result is then sent to an orchestrator agent in charge of the big picture.
The fact that everything in the context window regardless of the origin is used to predict the next token is also the reason why it’s so difficult to avoid prompt injection. It all “looks” the same for the LLM and there is no “hard coded” way from excluding anything.
It’s non-deterministic nature is honestly the scariest thing about vibe coding. In it’s early days when I was experimenting with several llms it quickly became apparent that I would spend 10 times as much time cleaning up its code as I would writing it myself because it would just put in completely nonsense code that did nothing.
I have mixed feelings about it. I wouldn’t give code a full production application but I think it’s sometimes helpful if the LLM is able to generate a prototype or scaffold to get a head start. Removes some of the friction of starting a project.
The fully vibe coded stuff I’ve seen so far were usually unmaintainable dumpster fires.
This is the best explanation of prompt injection I’ve seen
The “agents” and “agentic” stuff works by wrapping the core innovation (the LLM) in layers of simple code and other LLMs. Let’s try to imaging building a system that can handle a request like “find where I can buy a video card today. Make a table of the sites, the available cards, their prices, and how they compare on a benchmark.” We could solve this if we had some code like
search_prompt = llm(f"make a list of google web search terms that will help answer this user's question. present the result in a json list with one item per search. <request>{user_prompt}</request>") results_index = [] for s in json.parse(search_prompt): results_index.extend(google_search(s)) results = [fetch_url(url) for url in results_index] summarized_results = [llm(f"summarize this webpage, fetching info on card prices and benchmark comparisons <page>{r}</page>") for r in results] return llm(f"answer the user's original prompt using the following context: <context>{summarized_results}</context> <request>{user_prompt}</request>")It’s pretty simple code, and LLMs can write that, so we can even have our LLM write the code that will tell the system what to do! (I’ve omitted all the work to try to make things sane in terms of sandboxing and dealing with output from the various internal LLMs).
The important thing we’ve done here is instead of one LLM that gets too much context and stops working well, we’re making a bunch of discrete LLM calls where each one has a limited context. That’s the innovation of all the “agent” stuff. There’s an old Computer Science truism that any problem can be solved by adding another layer of indirection and this is yet another instance of that.
Trying to define a “limit” for this is not something I have a good grasp on. I guess I’d say that the limit here is the same: max tokens in the context. It’s just that we can use sub-tasks to help manage context, because everything that happens inside a sub-task doesn’t impact the calling context. To trivialize things: imagine that the max context is 1 paragraph. We could try to summarize my post by summarizing each paragraph into one sentence and then summarizing the paragraph made out of those sentences. It won’t be as good as if we could stick everything into the context, but it will be much better than if we tried to stick the whole post into a window that was too small and truncated it.
Some tasks will work impressively well with this framework: web pages tend to be a TON of tokens but maybe we’re looking for very limited info in that stack, so spawning a sub-LLM to find the needle and bring it back is extremely effective. OTOH tasks that actually need a ton of context (maybe writing a book/movie/play) will perform poorly because the sub-agent for chapter 1 may describe a loaded gun but not include it in its output summary for the next agent. (But maybe there are more ways of slicing up the task that would allow this to work.)
An LLM reads the previous prompts and replies, plus any base prompts. This is considered the context window. Don’t ask me why its not infinite.
The machine will then generate text following the previous text that continues the spirit and intent of the previous text, based on other texts previously digested into weights.
Its the same thing as your phones autocomplete but with a few gigabytes of weights instead of a few kilobytes.
If the data its working with is larger than the context, it will lose it. Theres a chance it’ll halucinate anyway because the text generator later in the text is non-deterministic. Say you’re working with insurance data. Maybe your data is familiar enough to data it previously injested data. So now it starts using wrong data, but it “feels” right as far as the LLM is concerned, because its a text generator, not a truth checker.
You can ask it to look again but its just generating fresh tokens while the context gets more polluted.
Just start looking at the volumes of non-trivial psuedo-information it generates and just try to verify some of the facts it states about your data.
It’s fundamentally not the same thing as autocomplete. Give autocomplete all the data an LLM has, every gig, every terabyte if it, and it still won’t be an LLM. Autocomplete lacks the semantic meaning layer as well as some other parts. People say it’s nothing but autocomplete from a misunderstanding of what a reward function does in backpropagation training (saying “the reward function is to predict the next word” is not even close to the equivalent of “it’s doing the same thing as autocomplete”)
I’m writing this short reply with hopes that when I have more time in the next two days or so I’ll come back with a more complete explanation, (including why context windows have to be limited).
Disclaimer: : All of my LLM experience is with local models in Ollama on extremely modest hardware (an old laptop with NVidia graphics) , so I can’t speak for the technical reasons the context window isn’t infinite or at least larger on the big player’s models. My understanding is that the context window is basically its short term memory. In humans, short term memory is also fairly limited in capacity. But unlike humans, the LLM can’t really see (or hold) the big picture in its mind.
But yeah, all you said is correct. Expanding on that, if you try to get it to generate something long-form, such as a novel, it’s basically just generating infinite chapters using the previous chapter (or as much of the history fits into its context window) as reference for the next. This means, at minimum, it’s going to be full of plot holes and will never reach a conclusion unless explicitly directed to wrap things up. And, again, given the limited context window, the ending will be full of plot holes and essentially based only on the previous chapter or two.
It’s funny because I recently found an old backup drive from high school with some half-written Jurassic Park fan fiction on it, so I tasked an LLM with fleshing it out, mostly for shits and giggles. The result is pure slop that seems like it’s building to something and ultimately goes nowhere. The other funny thing is that it reads almost exactly like a season of Camp Cretaceous / Chaos Theory (the animated kids JP series) and I now fully believe those are also LLM-generated.
You can improve the novel writing by using agents. First you generate just an outline with the plot points to every chapter. Then you chop that up and feed it to several agents to flesh out individual chapters. Finally the generated chapters are verified against the outline and overall plot. If that doesn’t fit, the agents are tasked with a rewrite. Repeat that until you have something serviceable.
As you point out, there exists plenty of bad writing in TV series. These often have a number of different authors, who don’t necessarily know the other episodes very well.
I will say that while most of these models are non-deterministic their training data was very similar so if you did something like this I can guarantee you if you churned out enough you would start to see the common threads.
Sure. Lots of fiction, especially TV stick to well established tropes, regardless of a human writing it or not.
Disclaimer: I am honestly a layman in this field. I may get a bunch of stuff wrong, but am happy to learn from experts. Feel free to point mistakes out and destroy me in the replies.
Simplifying and phrasing my understanding, an LLM works like - Given a prompt: Write a program to check if input is an odd number (converts the prompt to embedding), then the LLM plays a dice game/probability game of: given prompt, then generate a set of new tokens.
This feels like an oversimplification. Unfortunately, I can’t think of a good analogy without anthromorphosising LLMs.
IMO this anime scene works well enough as an analogy at a super high level: anime_irl
“Comprehending what other people is saying is one step” - encoder
“Thinking about how to answer is one more step” - working with the feature representation
“Putting the things that popped into my mind into words is another step” - decoder
Now my question is, how are the current LLM’s are able to parse through a bunch of search results and play the above dice game?
By current LLMs, I am going to assume that you are not referring to the raw models, but platforms like ChatGPT, Perplexity, etc with UIs for you to interact with the underlying models.
There are fundamentally two different problems here: searching the web for answers, and putting the answers into words.
Like at times it reads through say 10 URLs and generate results, how are they able to achieve this?
If I ask you: “What is the colour of fire engines?”, I imagine you would answer “Red”, sometimes “Yellow”, off the top of your head.
What if I ask you “What are the 10 longest rivers in the world”? I believe you won’t be able to give me an answer right away. What you can do is a web search, find the answer, then present the results to me. You can give it to me in 10 short bullets points, or you can come up with an essay with paragraphs describing each river.
You probably got my point by now, but to make it explicit: finding an answer and putting it into words are two different processes. They are independent of each other, so the final text output can be as long or as short as need be.
For these LLM platforms, when the model “doesn’t know” the answer, they probably have a subroutine that searches the web, then feed the answer to the underlying model. The model then packages the search results into readable form - in words instead of vectors - to you.
What’s the engineering behind generating such huge verbose of texts?
Sorry but I can’t think of a good answer to this at the moment; leaving it to others for now - unless I managed to think of something good.
Cause I always argue about the theoretical limitations of LLM, but now that these “agents” are able to manage huge verbose of text I dont seem to have a good argument. So what exactly is happening? And what is the
limit of AInon theortical limit of AI?Same for this question.
Hope the partial answer helps; tried my best to ELI5.






