Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

fubarx@lemmy.world · 3 days ago

Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

Zos_Kia@jlai.lu · 8 hours ago

The switch you mention (from 4th gen to 5th gen GPT) is when they introduced the model router, which created a lot of friction. Basically this will try to answer your question with as cheap a model as possible, so most of the time you won’t be using flagship 5.2 but a 5.2-mini or 5.2-tiny which are seriously dumber. This is done to save money of course, and the only way to guarantee pure 5.2 usage is to go through the API where you pay for every token.

There’s also a ton of affect and personal bias. Humans are notoriously bad at evaluating others intelligence, and this is especially true of chatbots which try to mimic specific personalities that may or may not mesh well with your own. For example, OpenAI’s signature “salesman & bootlicker” personality is grating to me and i consistently think it’s stupider than it is. I’ve even done a bit of double blind evaluation on various cognitive tasks to confirm my impression but the data really didn’t agree with me. It’s smart, roughly as smart as other models of its generation, but it’s just fucking insufferable. It’s like i see Sam Altman’s shit eating grin each time i read a word from ChatGPT, that’s why i stopped using it. That’s a property of me, the human, not GPT, the machine.

kescusay@lemmy.world · 2 hours ago

It goes beyond the problems introduced by the model router, though. I have to work with GPT 5.2 for my job (along with Claude, Gemini, and a few others), and we have enterprise API access to it. So when I select GPT 5.2 as the model to use, it’s spending tokens to actually use it.

And it’s pretty bad. It’s noticeably worse than the 4.x series. I find myself having to fix its mistakes far more often.

I’ve struggled to reason out an explanation, and model collapse really seems like a contender, especially if you follow information theory and why training these things is so hard.

As it happens, there’s a new talk about exactly this from George D. Montañez. You might find it interesting: https://youtu.be/ShusuVq32hc

Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

Opper