Ollama will almost always be slower than running vllm or llama.cpp, nobody should be suggesting it for anything agentic. On most consumer hardware, the availability of llama.cpp’s --cpu-moe flag alone is absurdly good and worth the effort to familiarize yourself with llamacpp instead of ollama.
I have used Ollama so far and it’s indeed quite slow, can you recommend a good guide for setting up llama.cpp (on linux). I have Ollama running in a docker container with openwebui, that kind of setup would be ideal.
I just run the llama-swap docker container with a config file mounted, set to listen for config changes so I don’t have to restart it to add new models. I don’t have a guide besides the README for llama-swap.
Ollama will almost always be slower than running vllm or llama.cpp, nobody should be suggesting it for anything agentic. On most consumer hardware, the availability of llama.cpp’s --cpu-moe flag alone is absurdly good and worth the effort to familiarize yourself with llamacpp instead of ollama.
AI Acknowledgement
The joke is worth the slop, imo. “Cpu Moe”. 😂 Find me an anime drawing of a CPU (especially an iconic one) and I’ll use that instead.
In your defense, I’ve thought the same joke every time I’ve seen it lol
I have used Ollama so far and it’s indeed quite slow, can you recommend a good guide for setting up llama.cpp (on linux). I have Ollama running in a docker container with openwebui, that kind of setup would be ideal.
I just run the
llama-swapdocker container with a config file mounted, set to listen for config changes so I don’t have to restart it to add new models. I don’t have a guide besides the README for llama-swap.