LFM2.5-350M: No Size Left Behind

Today, we're releasing LFM2.5-350M, an improved version of our 350M model with additional pre-training (from 10T to 28T tokens) and large-scale reinforcement learning. Built on the LFM2 architecture, it delivers exceptionally fast inference and runs everywhere, from cloud GPUs to cheap CPUs. It excels at tool use, data extraction, and structured outputs - all at 350M parameters - making it purpose-built for large-scale data processing and function calling at the edge.

To meet developers where they are, we ensure our models run seamlessly on preferred inference engines and hardware across CPUs, NPUs, and GPUs. We are thrilled to welcome Zetic, RunAnywhere, and Mirai to our local AI ecosystem, joining AMD, Qualcomm Technologies, Intel, LM Studio, and Cactus Compute to drive robust edge deployments across vehicles, smartphones, laptops, IoT, and embedded systems. Additionally, our new collaboration with Distil Labs empowers teams to effortlessly fine-tune highly efficient LFMs models to replace expensive and slow LLMs using just model traces.

The base (LFM2.5-350M-Base) and post-trained (LFM2.5-350M) models are available today on Hugging Face, LEAP, and our Playground. Check out our docs on how to run and fine-tune them locally.

Benchmarks

We evaluated LFM2.5-350M across ten benchmarks spanning core capabilities and applied tasks. It outperforms models more than twice its size on core capabilities like knowledge (GPQA Diamond, MMLU-Pro) and instruction following (IFEval, IFBench, Multi-IF), but also specific tasks like data extraction (CaseReportBench) and tool use (BFCLv3, BFCLv4, 𝜏²-Bench Telecom and Retail).

Model	GPQA Diamond	MMLU-Pro	IFEval	IFBench	Multi-IF
LFM2.5-350M	30.64	20.01	76.96	40.69	44.92
LFM2-350M	27.58	19.29	64.96	18.20	32.92
Granite 4.0-H-350M	22.32	13.14	61.27	17.22	28.70
Granite 4.0-350M	25.91	12.84	53.48	15.98	24.21
Qwen3.5-0.8B (Instruct)	27.41	37.42	59.94	22.87	41.68
Qwen3.5-0.8B (Thinking)	19.29	-*	32.93	22.00	26.44
Gemma 3 1B IT	23.89	14.04	63.49	20.33	44.25

Model	CaseReportBench	BFCLv3	BFCLv4	𝜏²-Bench Telecom	𝜏²-Bench Retail
LFM2.5-350M	32.45	44.11	21.86	18.86	17.84
LFM2-350M	11.67	22.95	12.29	10.82	5.56
Granite 4.0-H-350M	12.44	43.07	13.28	13.74	6.14
Granite 4.0-350M	0.84	39.58	13.73	2.92	6.14
Qwen3.5-0.8B (Instruct)	13.83	35.08	18.70	12.57	6.14
Qwen3.5-0.8B (Thinking)	0.39	39.64	25.39	14.33	7.02
Gemma 3 1B IT	2.28	16.61	7.17	9.36	6.43

* Evaluation could not be completed due to doom looping.

Compared to LFM2-350M, three capabilities see a significant improvement: instruction following (IFBench 18.20 → 40.69), data extraction (CaseReportBench 11.67 → 32.45), and tool use (BFCLv3 22.95 → 44.11).

This makes LFM2.5-350M an ideal solution to power large-scale data extraction pipelines or lightweight on-device agentic pipelines. However, it is not recommended for tasks like math, code, and creative writing.

Fast Inference Everywhere

LFM2.5-350M ships with day-one support across the inference ecosystem:

LEAP — Liquid's Edge AI Platform for iOS and Android deployment
llama.cpp — GGUF checkpoints for efficient edge inference
MLX — Optimized inference for Apple Silicon
vLLM — GPU-accelerated serving for production throughput
SGLang — GPU-accelerated serving for production throughput
ONNX — Cross-platform inference across diverse accelerators
OpenVino — Optimization and deployment toolkit for Intel hardware

CPU inference. Thanks to the efficient LFM2 architecture, LFM2.5-350M is significantly faster than similar-sized models, including SSM hybrids (Granite-4.0-H-1B) and Gated Delta Networks (Qwen3.5-0.8B).

LFM2.5-350M Inference Benchmarks by Device

GPU inference. To enable big data processing, we support high-throughput inference engines such as vLLM and SGLang. We measure output throughput (total output tokens / wall time) on a single NVIDIA H100 SXM5 GPU using a sustained load setting: at each concurrency level, we continuously maintain the target number of in-flight requests (as soon as one request completes, another is sent to replace it).

LFM2.5-350M Output Throughput vs Concurrency

Each model is benchmarked with SGLang 0.5.9 across concurrency levels from 1 to 4,096, using 1,024 input tokens and up to 256 output tokens (max_tokens=256) in BF16 precision. We repeat each experiment 3 times and report the average throughput. LFM2.5-350M achieves the highest peak throughput among models in this size category, reaching 40.4K output tokens per second at high concurrency — equivalent to over 3.5B tokens per day on a single H100.

Liquid’s Partner Ecosystem Expands

To ensure the LFM2.5 family runs efficiently and effectively wherever you need it, we are rapidly expanding our hardware and software ecosystem. Today, we are thrilled to welcome a diverse group of partners, from silicon leaders and optimization runtimes, like Mirai, Zetic, and Runanywhere, to customization partners like Distil Labs. Whether you are fine-tuning for specific workflows or deploying on mobile NPUs and edge devices, our ecosystem empowers developers to choose their preferred environments and hit the ground running.

Customizing for Production: Distil Labs

To push the limits of our new 350M model, we partnered with Distil Labs to benchmark LFM2.5 on real-world agentic workflows. By fine-tuning the model for multi-turn interactions across smart home, banking, and terminal-based use cases, they achieved a massive jump in tool-calling reliability, hitting over 95% accuracy throughout the dialog. This result proves that LFM2.5 isn't just fast; it’s highly capable of handling the sophisticated back-and-forth required for production-grade AI agents

Silicon and Software Optimization Partners

To bring LFM2.5 to the edge, we have collaborated with leading silicon and software partners to optimize the models across CPUs, GPUs, and NPUs.

AMD: Our partnership with AMD ensures LFM2.5-350M is fully optimized for their Ryzen™ AI hardware. By leveraging the Ryzen™ AI inference engine, we deliver a high-efficiency local experience across CPUs, GPUs, and NPUs.

“AMD is proud to provide day zero support for LFM2.5-350M. Our Ryzen™ AI processors provide an ideal platform for local deployment of this compact yet powerful model.” - Ramine Roane, Corporate Vice President of AI Product Management at AMD.‍

Qualcomm Technologies: To fully harness the power of Qualcomm SoCs, we have partnered with Qualcomm Technologies, Zetic and Runanywhere. Zetic’s Melange execution layer delivers Day 0 support for the LFM2.5 350M on Qualcomm Snapdragon® devices. This allows developers to instantly leverage Qualcomm CPUs, GPUs and NPUs for rapid on-device inference without needing custom optimization. Alongside Zetic, RunAnywhere further expands our Snapdragon footprint, providing developers with seamless, optimized execution tailored specifically for the Hexagon architecture.

“What LiquidAI has achieved with their LFM2.5-350M model is a strong validation of the on-device AI approach, that we have been building at Qualcomm Technologies,” stated Vinesh Sukumar, VP, Product Management and Head of Gen AI/ML, Qualcomm Technologies, Inc. “Their architecture is uniquely suited for the constraints of mobile and edge deployment, and when you run it on the Qualcomm® Hexagon™ NPU, you see that in numbers--low latency, low memory footprint and inference quality that competes with models several times its sizes. This announcement is a clear signal that efficient AI and capable AI are no longer a tradeoff”.

Intel brings accelerated performance to the LFM2.5 350M with native OpenVINO optimization. Designed to maximize inference speeds on Intel hardware, these ready-to-run models are available right now on Hugging Face.

‍“We are excited to bring the power of Liquid AI’s LFM 2.5 350M to the Intel ecosystem. By integrating native OpenVINO optimizations, we’re enabling developers to achieve exceptional inference speeds and efficient memory management across Intel-powered edge and AI PC environments.” - Sudhir Tonse Udupa, Vice President, AI PC Software Engineering, Intel‍

Apple Silicon Ecosystem: Powered by the Mirai engine, LiquidAI's LFM2.5-350M delivers blazing-fast, on-device inference across the entire Apple Silicon ecosystem. Benchmarks showcase exceptional generation speeds on hardware ranging from the mobile A18 Pro chip to desktop-class M1 through M5 Max processors.‍

Low-Memory Edge Devices: Proving that high performance doesn't require high-end hardware, Cactus Compute showcased the LFM2.5 350M running exceptionally fast on sub-$300 devices like the iPhone 13 Mini, Google Pixel 6a, and Raspberry Pi 5. The Cactus Engine makes this possible by heavily optimizing RAM usage for constrained edge environments.‍

Universal Local Deployment: Finally, LM Studio delivers Day 0 support for the LFM2.5 350M. Developers can run this model seamlessly on their local edge devices using llmster, LM Studio's headless daemon, making local experimentation and deployment effortless.

Inference Speed Benchmarks for LFM2.5-350M

Device	Inference	Framework	Prefill (tok/s)	Decode (tok/s)	Peak Memory
AMD Ryzen™ AI Max 395+	CPU	llama.cpp (Q4)	2.9K	313	434MB
Qualcomm Snapdragon® 8 Elite For Galaxy (Samsung Galaxy S25 Ultra)	NPU	RunAnywhere (Q4)	2.8K	15	169MB
Qualcomm Snapdragon® 8 Elite For Galaxy (Samsung Galaxy S25 Ultra)	GPU	RunAnywhere (Q4)	5.3K	62	81MB
Apple A18 Pro	GPU	Mirai (bf 16)	1953	73	945
Apple M1 Max	GPU	Mirai (bf 16)	10.4K	328	940MB
Apple M5 Max	GPU	Mirai (bf 16)	44.8K	564	1GB
Apple iPhone 13 Mini	CPU	Cactus Engine (int 8)	496	88	56MB
Google Pixel 6a	CPU	Cactus Engine (int 8)	208	42	328MB
Raspberry Pi5	CPU	Cactus Engine (int 8)	200	30	300MB

Speed numbers reflect 1K prefill and 100 decode tokens

Get Started

Start building today with LFM2.5-350M and LFM2.5-350M-Base, available on

Hugging Face

LEAP

Liquid Playground

With LFM2.5, we're delivering on our vision of AI that runs anywhere. These models are:

Open-weight — Download, fine-tune, and deploy without restrictions
Fast from day one — Native support for llama.cpp, NexaSDK, MLX, and vLLM across Apple, AMD, Qualcomm, and Nvidia hardware
A complete family — From base models for customization to specialized audio and vision variants, one architecture covers diverse use cases

The edge AI future is here. We can't wait to see what you build.

Citation

Please cite this article as:

Liquid AI, "LFM2.5-350M: No Size Left Behind", Liquid AI Blog, Mar 2026.

Or use the BibTeX citation:

@article{liquidAI2026350M,
  author = {Liquid AI},
  title = {LFM2.5-350M: No Size Left Behind},
  journal = {Liquid AI Blog},
  year = {2026},
  note = {www.liquid.ai/blog/lfm2-5-350m-no-size-left-behind},
}