ELI5. Limit of current gen AI/LLMs

vaderaj@lemmy.world · edit-2 1 month ago

ELI5. Limit of current gen AI/LLMs

affenlehrer@feddit.org · edit-2 1 month ago

The LLMs will just predict probabilities for the single next token based on all previous tokens in the context window (it’s own and the ones entered by the user, system prompt or tool calls). The inference engine / runtime decides which token will be selected, usually one with high probably but that’s configurable.

The LLM can also generate (predict) special tokens like “end of imaginary dialogue” to end it’s turn (the runtime will give the user a chance to reply) or to call tools (the runtime will call the tool and add the result to the context window).

The LLM does not really care about if the stuff in the context was put there by a user, the system prompt, a tool or whatnot. It just predicts the next token probabilities. If you configure the runtime accordingly it will happily “play” the role of the user or of a tool (you usually don’t want that).

Some of the tool calls are e.g. web searches etc. and the search results will be added to the context window. The LLM can decide to do more calls for further research, save data in “memory” that can be accessed by later “sessions” or call other tools (new tools pop up daily).

Models tend to get larger context windows with every update (right now it’s usually between 250K - 2M tokens but the model performance usually gets worse with more filled context windows (needle in a hay stack).

To keep the window small agentic tools often “compact” the context window by summarizing it and then starting a new session with the compacted context.

Sometimes a task is split into multiple sessions (agents) that each have their own context window. E.g. one extra session for a long context subtask like analysis of a long document with a specific task and the result is then sent to an orchestrator agent in charge of the big picture.

The fact that everything in the context window regardless of the origin is used to predict the next token is also the reason why it’s so difficult to avoid prompt injection. It all “looks” the same for the LLM and there is no “hard coded” way from excluding anything.

KindnessisPunk@piefed.ca · edit-2 1 month ago

It’s non-deterministic nature is honestly the scariest thing about vibe coding. In it’s early days when I was experimenting with several llms it quickly became apparent that I would spend 10 times as much time cleaning up its code as I would writing it myself because it would just put in completely nonsense code that did nothing.

affenlehrer@feddit.org · 1 month ago

I have mixed feelings about it. I wouldn’t give code a full production application but I think it’s sometimes helpful if the LLM is able to generate a prototype or scaffold to get a head start. Removes some of the friction of starting a project.

The fully vibe coded stuff I’ve seen so far were usually unmaintainable dumpster fires.

surewhynotlem@lemmy.world · 1 month ago

This is the best explanation of prompt injection I’ve seen

Greyscale@lemmy.sdf.org · 1 month ago

An LLM reads the previous prompts and replies, plus any base prompts. This is considered the context window. Don’t ask me why its not infinite.

The machine will then generate text following the previous text that continues the spirit and intent of the previous text, based on other texts previously digested into weights.

Its the same thing as your phones autocomplete but with a few gigabytes of weights instead of a few kilobytes.

If the data its working with is larger than the context, it will lose it. Theres a chance it’ll halucinate anyway because the text generator later in the text is non-deterministic. Say you’re working with insurance data. Maybe your data is familiar enough to data it previously injested data. So now it starts using wrong data, but it “feels” right as far as the LLM is concerned, because its a text generator, not a truth checker.

You can ask it to look again but its just generating fresh tokens while the context gets more polluted.

Just start looking at the volumes of non-trivial psuedo-information it generates and just try to verify some of the facts it states about your data.

Iunnrais@piefed.social · 1 month ago

It’s fundamentally not the same thing as autocomplete. Give autocomplete all the data an LLM has, every gig, every terabyte if it, and it still won’t be an LLM. Autocomplete lacks the semantic meaning layer as well as some other parts. People say it’s nothing but autocomplete from a misunderstanding of what a reward function does in backpropagation training (saying “the reward function is to predict the next word” is not even close to the equivalent of “it’s doing the same thing as autocomplete”)

I’m writing this short reply with hopes that when I have more time in the next two days or so I’ll come back with a more complete explanation, (including why context windows have to be limited).

Iced Raktajino@startrek.website · edit-2 1 month ago

Disclaimer: : All of my LLM experience is with local models in Ollama on extremely modest hardware (an old laptop with NVidia graphics) , so I can’t speak for the technical reasons the context window isn’t infinite or at least larger on the big player’s models. My understanding is that the context window is basically its short term memory. In humans, short term memory is also fairly limited in capacity. But unlike humans, the LLM can’t really see (or hold) the big picture in its mind.

But yeah, all you said is correct. Expanding on that, if you try to get it to generate something long-form, such as a novel, it’s basically just generating infinite chapters using the previous chapter (or as much of the history fits into its context window) as reference for the next. This means, at minimum, it’s going to be full of plot holes and will never reach a conclusion unless explicitly directed to wrap things up. And, again, given the limited context window, the ending will be full of plot holes and essentially based only on the previous chapter or two.

It’s funny because I recently found an old backup drive from high school with some half-written Jurassic Park fan fiction on it, so I tasked an LLM with fleshing it out, mostly for shits and giggles. The result is pure slop that seems like it’s building to something and ultimately goes nowhere. The other funny thing is that it reads almost exactly like a season of Camp Cretaceous / Chaos Theory (the animated kids JP series) and I now fully believe those are also LLM-generated.

Limerance@piefed.social · 1 month ago

You can improve the novel writing by using agents. First you generate just an outline with the plot points to every chapter. Then you chop that up and feed it to several agents to flesh out individual chapters. Finally the generated chapters are verified against the outline and overall plot. If that doesn’t fit, the agents are tasked with a rewrite. Repeat that until you have something serviceable.

As you point out, there exists plenty of bad writing in TV series. These often have a number of different authors, who don’t necessarily know the other episodes very well.

KindnessisPunk@piefed.ca · edit-2 1 month ago

I will say that while most of these models are non-deterministic their training data was very similar so if you did something like this I can guarantee you if you churned out enough you would start to see the common threads.

Limerance@piefed.social · 1 month ago

Sure. Lots of fiction, especially TV stick to well established tropes, regardless of a human writing it or not.

jacksilver@lemmy.world · 1 month ago

I think you may be mixing a couple of things together, but I’ll take a crack at this.

When you get an Ai generated response from a search engine, this is usually a modified RAG (retrieval augmented generation) approach. How this works is that the content from web pages are already pre-processed into embeddings (numerical representations of the text). When you perform a search, your search text is turned into an embedding and compared (numerical similarity) to the websites to get the most related content for your search. That means that the LLM only parses and processes a very small subset of the returned websites to generate its response.

Another element you might be asking about is how can these agentic AI systems handle larger tasks (things like OpenClaw). That is a bit more complicated and dependent on the systems design, but basically boils down to two things. The first is the “reasoning models” first break concepts into smaller tasks meaning the LLM only has to worry about a subset of a larger task. Secondly, a lot of these systems will periodically merge all past context into a compressed state that the LLM can handle (basically summaries of summaries) or add them to a database for future/faster reference.

At the end of the day, your understanding of the limits of LLM are correct, all the progress we’ve really seen with LLMs (over the past couple of years) has been the creation of systems to work around their limitations. The base technology isn’t getting much better, but the support around it is.

vaderaj@lemmy.world · 1 month ago

Thanks.

And to clarify, other than the corporate greed is there any actual use case for the work around their limitations? I mean if the building materials aren’t strong enough there is only so much you can achieve with a beautiful paint job (my current understanding, and I may be wrong)

jacksilver@lemmy.world · 1 month ago

The underlying issues, in my opinion, regarding LLMs is their indeterministic nature. Even zeroing out the temperature (randomness of outputs), you can get significantly different results between two almost identical texts.

However, building out an ecosystem supporting new technology is a fairly common progression. If you compare it to the internet things like browser caches, CDNs (content delivery networks), code minifiers, etc. are all ways to help combat latency (a fundamental problem for the internet).

As for the effectiveness of these solutions, RAGs do help a lot when generating text against a select corpus. Its what allows the linked sources in things like ChatGPT and Googles AI results. It’s also what a lot of companies are using for searching their support pages/etc. It’s maybe not quite as good as speaking to a person, but is faster.

Similarly, the reasoning models and managing the models “context” both have shown demonstrable improvements for models in benchmarking.

I’m not sure I personally believe this makes LLMs a replacement for humans in most situations, but it at least demonstrates forward progress for GenAI.

vaderaj@lemmy.world · 1 month ago

Interesting, the thing is I can quite easily pick up something new but at the same time I am very resistant to change until there is good reasoning and some sort of a scientific conformation.

Need to discover good uses cases for LLM/AI and make peace with it I guess!

jacksilver@lemmy.world · 1 month ago

Yeah, that’s fair. I haven’t jumped into the whole agentic side of things as I find LLMs consistently fail at lower level stuff.

Everyone says it’s great at prototyping or writing documents, etc, but I think that’s just cause people have low standards. When coding I find that it quickly messes things up or lacks good quality control (which you only notice if you’re familiar with the domain). For writing it’s fine, but the tone and language always feels off and certainly doesn’t sound like me.

Either way, I would suggest playing around with them to see how they fit into how you do things. I think we’re starting to see things finally slow down on new implementations, and they aren’t going away, so it may be a good time to see if all the fuss is worth it to you.

Poik@pawb.social · 1 month ago

The best uses I’ve seen are blind person aides. Scene understanding and OCR for disability aides. The OCR doesn’t have to be LLMs, but a system that combines the two effectively is useful.

There is merit to sitting an LLM in front of an expert system to act as an intermediate, but the LLM shouldn’t be doing any “thinking.” It should only translate results.

brucethemoose@lemmy.world · edit-2 1 month ago

Others have explained it well; splitting calls up into parallel subjobs, and programatic prompt engineering.

And what is the non theortical limit of AI?

Shrug.

But practically, transformer models are kinda hitting an “innovation” wall. Big companies aren’t taking risks to try and fix (say) the necessity of temperature to literally randomize outputs, or splitting instructions/context/output, or self correction (like an undo token), adaptation on the fly, anything.

All this has been explored in research papers, yet they aren’t even trying it at larger scales. They’re simply scaling up what they have, or (in the case of the Chinese labs) focusing on lowering resource usage.

Basically, corporate LLM development is far more conservative than you’ve been lead to believe, and that’s the wall LLMs are smacking into.

vaderaj@lemmy.world · 1 month ago

That’s been my issue, ie somewhere I know all this LLM lead AI is a bubble. But the corporates either increase the context window or release something that does better parallel subjobs after 3 months, and now all of a sudden this LLM lead AI is the “future” and it can perform “agentic” tasks.

It kinda makes it impossible to make people (friends who are developers, colleagues) look past the marketing gimmicks.

brucethemoose@lemmy.world · edit-2 1 month ago

I mean, even as-is, it’s a very useful tool. Especially as the capabilities we have get exponentially cheaper.

What people don’t get is AI is about to become a race to the bottom, not to the top. It’s a utility to sift through millions of documents, or run simple bots, or operate work assistants, or makeshift translators or whatever; you know, oldschool language modeling. And that’s really neat as the cost approaches “basically free.”

Basically, imagine running Claude Code on your iPhone, and Claude Code itself not really changing all that much. Imagine the economic implications for the big AI houses.

As for the marketing, I want some of what those tech execs are smoking.

Mniot@programming.dev · 1 month ago

The “agents” and “agentic” stuff works by wrapping the core innovation (the LLM) in layers of simple code and other LLMs. Let’s try to imaging building a system that can handle a request like “find where I can buy a video card today. Make a table of the sites, the available cards, their prices, and how they compare on a benchmark.” We could solve this if we had some code like

search_prompt = llm(f"make a list of google web search terms that will help answer this user's question. present the result in a json list with one item per search. <request>{user_prompt}</request>")
results_index = []
for s in json.parse(search_prompt):
  results_index.extend(google_search(s))
results = [fetch_url(url) for url in results_index]
summarized_results = [llm(f"summarize this webpage, fetching info on card prices and benchmark comparisons <page>{r}</page>") for r in results]

return llm(f"answer the user's original prompt using the following context: <context>{summarized_results}</context> <request>{user_prompt}</request>")

It’s pretty simple code, and LLMs can write that, so we can even have our LLM write the code that will tell the system what to do! (I’ve omitted all the work to try to make things sane in terms of sandboxing and dealing with output from the various internal LLMs).

The important thing we’ve done here is instead of one LLM that gets too much context and stops working well, we’re making a bunch of discrete LLM calls where each one has a limited context. That’s the innovation of all the “agent” stuff. There’s an old Computer Science truism that any problem can be solved by adding another layer of indirection and this is yet another instance of that.

Trying to define a “limit” for this is not something I have a good grasp on. I guess I’d say that the limit here is the same: max tokens in the context. It’s just that we can use sub-tasks to help manage context, because everything that happens inside a sub-task doesn’t impact the calling context. To trivialize things: imagine that the max context is 1 paragraph. We could try to summarize my post by summarizing each paragraph into one sentence and then summarizing the paragraph made out of those sentences. It won’t be as good as if we could stick everything into the context, but it will be much better than if we tried to stick the whole post into a window that was too small and truncated it.

Some tasks will work impressively well with this framework: web pages tend to be a TON of tokens but maybe we’re looking for very limited info in that stack, so spawning a sub-LLM to find the needle and bring it back is extremely effective. OTOH tasks that actually need a ton of context (maybe writing a book/movie/play) will perform poorly because the sub-agent for chapter 1 may describe a loaded gun but not include it in its output summary for the next agent. (But maybe there are more ways of slicing up the task that would allow this to work.)

Iunnrais@piefed.social · 1 month ago

The following wall of text is a simplification that I hope will help you understand. The simplification of the simplification (tldr) is: for as long as it has context window available, it figures out the meaning of every word in the entire conversation based on its position relative to every other word in the conversation.

The longer explanation (but still a simplification) is as follows:

An LLM does math not just on every word you send it, but on how every word you send it relates to every other word you send it. You can think of every “token” in an LLM’s context window as being a discrete slot that can take a word (or part of a word, or punctuation, whatever… that’s why we always say “token” not “word”), and that slot has very very complicated wiring that connects to every other slot in the context window. And the output of each of those connections is itself connected to more wiring, and the output of that to more wiring, and so on. Each of these layers seems to help with grammar, understanding, and linking of concepts… it also turns out that a lot of the connections aren’t even used, but having them all wired in allows the system to find the most optimal arrangement by itself. The way it figured out how to wire all the “slots” together is based on terabytes of training data.

Part of this wiring passes through a “dictionary” of sorts (not what it’s called, but we’ll run with it for this simplification), which encodes every token as a long LONG series of numbers. Each number in that series corresponds to a “semantic concept”. For example, one of the numbers in the series might determine how “plural” a word might be. Another number might determine how masculine or feminine the word is. Another number might encode how rude the word is. Another might be how “cat-related” a word is. I keep saying “might” because we didn’t write the “dictionary” ourselves, we got another machine to make it for us by analyzing literal terabytes of human written texts and checking for word co-locations (what words appear in the vicinity of other words). Academic Linguists have been having a golden age recently by studying the math of how the machines mapped words, and have slowly been piecing together what the various numbers mean-- it’s really quite fascinating.

Anyway, the context window is not an arbitrary array, and increasing a context window by even a single token basically requires rewiring the whole thing, which is why an LLM’s context window is inherently limited. And if there isn’t a slot to put a token in, then it simply can’t think about it.

So, an LLM does “think”… in a sense. It does “reason”… but only as to what the *words* mean, not about logical consistency or adherence to the real world or facts. You may have heard the Symphony of Science song “A Glorious Dawn” (https://www.youtube.com/watch?v=zSgiXGELjbc&list=RDEMft98UQ9nSZoCk8V-gaQ7zQ&start_radio=1) where Carl Sagan says:

"But the brain does much more than just recollect It inter-compares, it synthesizes, it analyzes it generates abstractions

The simplest thought like the concept of the number one Has an elaborate logical underpinning\l The brain has its own language For testing the structure and consistency of the world"

An LLM does SOME of this. It inter-compares, but only between definitions of words in its dictionary. It analyzes… but only between definitions of words in its dictionary. It has its own elaborate logical underpinning, but these logical connections apply to WORDS and COMBINATIONS of words, not to ideas like our brain does.

In some ways this can be mitigated by encoding more and more information into the “dictionary”, which is how you can get an LLM to pass various exams it’s never seen before. But it’s all based on the meanings of the words as it understands them, not logic.

How DOES it think? Well, at the LOWEST level, it thinks one word at a time, considering what it should say next based on what has already been said. If it reads “Two plus two equals what?” it looks up the meaning of those numbers, checks the relation to the plus and equal words, does math on the WORDS (not the number 2!) and sees that, hey, there’s a dimension of words that relates to its position on the number line! I can adjust along this dimension of meaning, and come up with the answer four! And as long as two, plus, equals, and four are all SUFFICIENTLY well defined in the dictionary, then it can manipulate those ideas just as well as a human, or better.

What happens when it lacks words for concepts that it can map mathematically (what does cat + dog + not kingly + casual + bridgelike + sounds melancholy + french origin + purple + etc etc etc = ?)? This happens all the time. It looks for the closest word. Even if it has an exact concept mapped (very rare), it’ll still look around its concept space a little bit, according to a metric called “heat”, jiggling around the tokens in its dictionary like molecules jiggle when heated. This gets pretty good results, but not consistent ones… the result isn’t fully random, we don’t get chaos, but we do get different results for the same input. That’s not necessarily a bad thing.

However… true contradictions can also arise in its definitions. The most famous example of this was when ChatGPT was asked about a “seahorse emoji”. Turns out, in the training data, it was able to find connections between seahorse and emoji pretty easily. It’s very confident that there is one. Unfortunately, there isn’t… so it has mathematical connections between the concept of seahorse and the concept of emoji, but when it adds them together, NO actual token of a seahorse emoji emerges (because there isn’t one in unicode). Using the “find the nearest mathematical token that DOES exist” principle, it’ll spit out another emoji. Then it will look at the emoji, and see that it clearly doesn’t match… that’s *not* a seahorse. But it “knows” that a seahorse emoji exists according to its dictionary; it has a link there! So it tries again, and again can’t find it. So it gets stuck in an endless loop.

Anyway, how do agents fit into all this? Well, people started thinking-- if we can’t get an LLM to think in terms of ideas and logic outside of the definitions of words, what if we handed off the logic to another program that can do that? We can train the LLM to associate and link certain words to computer commands to run a program that can do arithmetic, or calculus, or formal logic, or drawing a picture, or arranging text into a table, or things like that. These external programs can then return text to the LLM, which can process it as words with definitions, and give you a good answer.

We can also use the “definitions of words” approach to approximate thinking about abstractions and ideas. Just have the LLM start generating associations, but don’t show them to the user, keep them in the backend as “chain of thoughts”. When the abstraction has gotten to a useful enough point, we can then use it as part of our context window to analyze it as words and get a good result.

Sometimes there’s a problem with using only one dictionary… sometimes words mean VASTLY different things in different contexts. That’s where the “Mixture of Experts” approach comes in. You build different dictionaries for different contexts. You have one LLM figure out which domain is most likely appropriate, then hand the text off to a different LLM who was trained on that other domain with a different dictionary.

It all comes together, and it works. Mostly. Usually. There are problems sometimes. And it used to be that we could fix problems just by giving it more training data… make the dictionary better. And it’s probably true that with an infinitely precise dictionary, there’d be no problems at all, just like it has no problems adding two plus two because it has sufficient definition of all those words; except we’ve literally run out of additional training data to give it. So workarounds and hacks and specialized training and things have been utilized to patch over the bits of the dictionary we don’t have and possibly can’t ever make.

And that’s a simplified version of how LLMs do what they do.

Endmaker@ani.social · edit-2 1 month ago

Disclaimer: I am honestly a layman in this field. I may get a bunch of stuff wrong, but am happy to learn from experts. Feel free to point mistakes out and destroy me in the replies.

Simplifying and phrasing my understanding, an LLM works like - Given a prompt: Write a program to check if input is an odd number (converts the prompt to embedding), then the LLM plays a dice game/probability game of: given prompt, then generate a set of new tokens.

This feels like an oversimplification. Unfortunately, I can’t think of a good analogy without anthromorphosising LLMs.

IMO this anime scene works well enough as an analogy at a super high level: anime_irl

“Comprehending what other people is saying is one step” - encoder

“Thinking about how to answer is one more step” - working with the feature representation

“Putting the things that popped into my mind into words is another step” - decoder

Now my question is, how are the current LLM’s are able to parse through a bunch of search results and play the above dice game?

By current LLMs, I am going to assume that you are not referring to the raw models, but platforms like ChatGPT, Perplexity, etc with UIs for you to interact with the underlying models.

There are fundamentally two different problems here: searching the web for answers, and putting the answers into words.

Like at times it reads through say 10 URLs and generate results, how are they able to achieve this?

If I ask you: “What is the colour of fire engines?”, I imagine you would answer “Red”, sometimes “Yellow”, off the top of your head.

What if I ask you “What are the 10 longest rivers in the world”? I believe you won’t be able to give me an answer right away. What you can do is a web search, find the answer, then present the results to me. You can give it to me in 10 short bullets points, or you can come up with an essay with paragraphs describing each river.

You probably got my point by now, but to make it explicit: finding an answer and putting it into words are two different processes. They are independent of each other, so the final text output can be as long or as short as need be.

For these LLM platforms, when the model “doesn’t know” the answer, they probably have a subroutine that searches the web, then feed the answer to the underlying model. The model then packages the search results into readable form - in words instead of vectors - to you.

What’s the engineering behind generating such huge verbose of texts?

Sorry but I can’t think of a good answer to this at the moment; leaving it to others for now - unless I managed to think of something good.

Cause I always argue about the theoretical limitations of LLM, but now that these “agents” are able to manage huge verbose of text I dont seem to have a good argument. So what exactly is happening? And what is the ~~limit of AI~~ non theortical limit of AI?

Same for this question.

Hope the partial answer helps; tried my best to ELI5.

Danitos@reddthat.com · 1 month ago

I don’t have the answer, but hopefully this post does, is a step by step guide to create your own LLM: https://karpathy.github.io/2026/02/12/microgpt/