May 17, 2011 - Technology media outlet WinBuzzer published a blog post yesterday (May 16) reporting that the open source large language modeling service tool Ollama Launch of self-developedMultimodality AI Customize the engine to get rid of the direct dependency on the llama.cpp framework.

The llama.cpp project has recently integrated full visual support through the libmtmd library, and Ollama's relationship with it has sparked community discussion.
A member of the Ollama team clarified on Hacker News thatOllama is developed independently using golang, not directly borrowed from it. llama.cpp C++ implementation, and thanks the community for feedback on improving the technology.
In an official statement, Ollama noted that with the increased complexity of models such as Meta's Llama 4, Google's Gemma 3, Alibaba's Qwen 2.5 VL, and Mistral Small 3.1, existing architectures were struggling to meet demand.
That's why Ollama is launching a new engine.Targeting a breakthrough in local inference accuracyThis is especially true when dealing with large images and generating a large number of tokens.
Ollama introduces additional metadata for image processing to optimize batch processing and positional data management to avoid output quality degradation due to image segmentation errors, in addition to KVCache optimization techniques to accelerate transformer model inference.
The new engine also dramatically optimizes memory management with a new image caching feature that ensures images can be reused after processing and not discarded prematurely, and Ollama has teamed up with hardware giants such as NVIDIA, AMD, Qualcomm, Intel, and Microsoft to optimize memory estimation by accurately detecting hardware metadata.
The engine also supports techniques such as chunked attention and 2D rotary embedding for models such as Meta's Llama 4 Scout (a 109 billion parameter mixture of expert models MoE).
Ollama has future plans to support longer context lengths, complex inference processes, and streaming responses to tool calls, further enhancing the versatility of native AI models.