

It’s like you didn’t listen to anything I ever said, or you discounted everything I said as fiction, but everything your dear LLM said is gospel truth in your eyes. It’s utterly irrational. You have to be trolling me now.
It’s like you didn’t listen to anything I ever said, or you discounted everything I said as fiction, but everything your dear LLM said is gospel truth in your eyes. It’s utterly irrational. You have to be trolling me now.
… because this sign was made before the IBM PC was invented.
Language changes over time.
it’s so good at parsing text and documents, summarizing
No. Not when it matters. It makes stuff up. The less you carefully check every single fucking thing it says, the more likely you are to believe some lies it subtly slipped in as it went along. If truth doesn’t matter, go ahead and use LLMs.
If you just want some ideas that you’re going to sift through, independently verify and check for yourself with extreme skepticism as if Donald Trump were telling you how to achieve world peace, great, you’re using LLMs effectively.
But if you’re trusting it, you’re doing it very, very wrong and you’re going to get humiliated because other people are going to catch you out in repeating an LLM’s bullshit.
You’re better off asking one human to do the same task ten times. Humans get better and faster at things as they go along. Always slower than an LLM, but LLMs get more and more likely to veer off on some flight of fancy, further and further from reality, the more it says to you. The chances of it staying factual in the long term are really low.
It’s a born bullshitter. It knows a little about a lot, but it has no clue what’s real and what’s made up, or it doesn’t care.
If you want some text quickly, that sounds right, but you genuinely don’t care whether it is right at all, go for it, use an LLM. It’ll be great at that.
I would be in breach of contract to tell you the details. How about you just stop trying to blame me for the clear and obvious lies that the LLM churned out and start believing that LLMs ARE are strikingly fallible, because, buddy, you have your head so far in the sand on this issue it’s weird.
The solution to the problem was to realise that an LLM cannot be trusted for accuracy even if the first few results are completely accurate, the bullshit well creep in. Don’t trust the LLM. Check every fucking thing.
In the end I wrote a quick script that broke the input up on tab characters and wrote the sentence. That’s how formulaic it was. I regretted deeply trying to get an LLM to use data.
The frustrating thing is that it is clearly capable of doing the task some of the time, but drifting off into FANTASY is its strong suit, and it doesn’t matter how firmly or how often you ask it to be accurate or use the input carefully. It’s going to lie to you before long. It’s an LLM. Bullshitting is what it does. Get it to do ONE THING only, then check the fuck out of its answer. Don’t trust it to tell you the truth any more than you would trust Donald J Trump to.
How do I subscribe to a user or community on piefed.world and see it in my lemmy.world feed?
Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.
I agree it was a dumb comparison to start off with.
I wasn’t the one who made it, but the license issue is the logical conclusion if OP insists on the comparison.
Again with dismissing the evidence of my own eyes!
I wasn’t asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I’d told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.
Again, I didn’t read it in a fucking article, I read it on my fucking computer screen, so if you’d stop fucking telling me I’m stupid for using it the way it fucking told me I could use it, or that I’m stupid for believing what the media tell me about LLMs, when all I’m doing is telling you my own experience, you’d sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.
Wow. 30% accuracy was the high score!
From the article:
Testing agents at the office
For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.
They call it TheAgentCompany. It’s a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.
the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.
⚫ Gemini-2.5-Pro (30.3 percent)
⚫ Claude-3.7-Sonnet (26.3 percent)
⚫ Claude-3.5-Sonnet (24 percent)
⚫ Gemini-2.0-Flash (11.4 percent)
⚫ GPT-4o (8.6 percent)
⚫ o3-mini (4.0 percent)
⚫ Gemini-1.5-Pro (3.4 percent)
⚫ Amazon-Nova-Pro-v1 (1.7 percent)
⚫ Llama-3.1-405b (7.4 percent)
⚫ Llama-3.3-70b (6.9 percent),
⚫ Qwen-2.5-72b (5.7 percent),
⚫ Llama-3.1-70b (1.7 percent)
⚫ Qwen-2-72b (1.1 percent).
“We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks,” the authors state in their paper
Why are you giving it data
Because there’s a button for that.
It’s output is dependent on the input
This thing that you said… It’s false.
If guns are so alike to cars, why not require a license that you get by passing a written test on gun safety and a practical test on basic competence and safe usage?
It’s not completely random, but I’m telling you it fucked up, it fucked up badly, time after time, and I had to check every single thing manually. It’s correctness run never lasted beyond a handful. If you build something using some equation it invented you’re insane and should quit engineering before you hurt someone.
The same kind of grill that can be bricked remotely if you stop paying for software updates.
Definitely, but I think that Proud Boys leader who showed he could take a black dildo probably thought he was doing some really clever double bluff thing, but we see you Gavin McKinnes. We see you and the insecurities you’re fighting so hard to hide.
Verify every single bloody line of output. Top three to five are good, then it starts guessing the rest based on the pattern so far. If I wanted to make shit up randomly, I would do it myself.
People who trust LLMs to tell them things that are right rather than things that sound right have fundamentally misunderstood what an LLM is and how it works.
This is hilarious. I laughed for some time.
“Log back in to continue your OralB brushing experience”
Who thought it would be a good idea to have an online toothbrush, who decided to log customers out after a period of inactivity, and why, for all that is sane in the world, would not being logged in stop you from doing anything at all with your toothbrush!?!
Ah, my bad, you’re right, for being consistently correct, I should have done 0.3^10=0.0000059049
so the chances of it being right ten times in a row are less than one thousandth of a percent.
No wonder I couldn’t get it to summarise my list of data right and it was always lying by the 7th row.
I already told you my experience of the crapness of LLMs and even explained why I can’t share the prompt etc. You clearly weren’t listening or are incapable of taking in information.
There’s also all the testing done by the people talked about in the article we’re discussing which you’re also irrationally dismissing.
You have extreme confirmation bias.
Everything you hear that disagrees with your absurd faith in the accuracy of the extreme blagging of LLMs gets dismissed for any excuse you can come up with.