Ollama Goes Online with Self-Developed Multimodal AI Engine: Gradually Getting Rid of the llama.cpp Framework Dependency, Local Inference Performance Soars

May 17, 2011 - Technology media outlet WinBuzzer published a blog post yesterday (May 16) reporting that the open source large language modeling service tool Ollama Launch of self-developedMultimodality AI Customize the engine to get rid of the direct dependency on the llama.cpp framework.

Ollama Goes Online with Self-Developed Multimodal AI Engine: Gradually Getting Rid of the llama.cpp Framework Dependency, Local Inference Performance Soars

The llama.cpp project has recently integrated full visual support through the libmtmd library, and Ollama's relationship with it has sparked community discussion.

A member of the Ollama team clarified on Hacker News thatOllama is developed independently using golang, not directly borrowed from it. llama.cpp C++ implementation, and thanks the community for feedback on improving the technology.

In an official statement, Ollama noted that with the increased complexity of models such as Meta's Llama 4, Google's Gemma 3, Alibaba's Qwen 2.5 VL, and Mistral Small 3.1, existing architectures were struggling to meet demand.

That's why Ollama is launching a new engine.Targeting a breakthrough in local inference accuracyThis is especially true when dealing with large images and generating a large number of tokens.

Ollama introduces additional metadata for image processing to optimize batch processing and positional data management to avoid output quality degradation due to image segmentation errors, in addition to KVCache optimization techniques to accelerate transformer model inference.

The new engine also dramatically optimizes memory management with a new image caching feature that ensures images can be reused after processing and not discarded prematurely, and Ollama has teamed up with hardware giants such as NVIDIA, AMD, Qualcomm, Intel, and Microsoft to optimize memory estimation by accurately detecting hardware metadata.

The engine also supports techniques such as chunked attention and 2D rotary embedding for models such as Meta's Llama 4 Scout (a 109 billion parameter mixture of expert models MoE).

Ollama has future plans to support longer context lengths, complex inference processes, and streaming responses to tool calls, further enhancing the versatility of native AI models.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
Information

Google Android to push new ML Kit GenAI API to extend end-side Gemini Nano AI model access

2025-5-17 13:45:59

Information

OpenAI ChatGPT revealed to support MCP protocol to access third-party AI services

2025-5-17 20:38:23

Search