We are releasing an early checkpoint of LFM2-8B-A1B, our first on-device Mixture-of-Experts (MoE) with 8.3B total parameters and 1.5B active parameters per token. By activating only a sparse subset of parameters during inference, LFM2-8B-A1B delivers larger model quality with the compute of a 1.5B-class model. It trades a modest increase in memory footprint for higher quality and speed compared to dense models. This enables fast, private, latency-sensitive applications on modern phones, tablets, and laptops.
Highlights
- LFM2-8B-A1B is the best on-device MoE in terms of both quality (comparable to 3-4B dense models) and speed (faster than Qwen3-1.7B).
- Code and knowledge capabilities are significantly improved compared to LFM2-2.6B.
- Quantized variants fit comfortably on high-end phones, tablets, and laptops.
LFM2 MoE Architecture
Most MoE research focuses on cloud models in large-scale batch serving settings. For on-device applications, the key is to optimize latency and energy consumption under strict memory requirements. LFM2-8B-A1B is one of the first models to challenge the common belief that the MoE architecture is not effective at smaller parameter sizes. LFM2-8B-A1B allows for trading a slightly larger memory footprint for higher quality while retaining low-latency and energy consumption.
The LFM2 family is optimized for on-device inference by combining short‑range, input‑aware gated convolutions with grouped‑query attention (GQA) in a layout tuned to maximize quality under strict speed and memory constraints. LFM2‑8B-A1B keeps this fast backbone and introduces sparse MoE feed‑forward networks to add representational capacity without significantly increasing the active compute path.
.png)
Overview:
- LFM2 Backbone: 18 gated short convolution blocks and 6 GQA blocks.
- Size: 8.3B total parameters, 1.5B active parameters
- MoE placement: With the exception of the first two layers, all layers include an MoE block. The first two layers remain dense for stability purposes.
- Expert granularity: 32 experts per MoE block, with top-4 active experts applied per token. This configuration provides a strong quality boost over lower granularity configs while maintaining fast routing and portable kernels.
- Router: Normalized sigmoid gating with adaptive routing biases for better load balancing and training dynamics.
Why does this matter on device? Only a fraction of the network is active per token, so per-token FLOPs and latency track a ~1.5B dense model, while the 8.3B total capacity lets experts specialize (reasoning, multilingual, code, long-tail knowledge) and lifts quality.
Memory & deployability. Weight storage scales with total parameters, while compute and state caches scale with the active path. In practice, quantized variants fit comfortably on high-end phones, tablets, and laptops for a wide range of applications.
Inference
The following video shows LFM2-8B-A1B running locally on a MacBook Pro with an Apple M2 Pro chip and Metal enabled (Q4_0 quantization).
LFM2-8B-A1B is compatible with multiple inference frameworks, such as llama.cpp, ExecuTorch, and vLLM (public version coming soon). The following plots show the decode throughput with llama.cpp backend and Q4_0 quantization on two hardware targets: the Samsung Galaxy S24 Ultra (Qualcomm Snapdragon SoC) and AMD Ryzen HX370.


LFM2-8B-A1B is significantly faster than models with a similar number of active parameters, like Qwen3-1.7B.
Evaluations
The base version of LFM2-8B-A1B was pre-trained for ~12T tokens drawn from a pre-training corpus comprising approximately 55% English, 25% multilingual, and 20% code data sourced from the web and licensed materials. We are continuing to train the base model.
We evaluated LFM2-8B-A1B on a comprehensive set of 16 popular benchmarks, including knowledge (5-shot MMLU, 5-shot MMLU-Pro, 0-shot GPQA), instruction following (IFEval, IFBench, Multi-IF), math (5-shot GSM8K, 5-shot GSMPlus, 0-shot MATH500, 0-shot Math Lvl 5), and multilingual tasks (5-shot MGSM, 5-shot MMMLU). All results were obtained using an internal evaluation library.
Compared to similar-sized models, LFM2-8B-A1B displays strong performance in instruction following and math while also running significantly faster. Compared to LFM2-2.6B, this model has more knowledge capacity thanks to a larger total number of parameters. This translates into a higher MMLU-Pro score, with an increase of +11.46 points compared to LFM2-2.6B.
LFM2-8B-A1B has also been trained on more code during pre- and post-training compared to LFM2-2.6B. This provides more competitive coding skills, as shown by LiveCodeBench (LCB) v6, v5, and HumanEval+ scores. We also evaluate on EQ-Bench Creative Writing v3, a LLM-judged benchmark measuring writing quality in short-form story writing. LFM2-8B-A1B's writing ability is competitive with models with several times more active parameters.
Liquid Preference Alignment
We apply direct alignment methods to balance strong model performance with rapid iteration.
Our preference dataset emphasizes in-distribution coverage. It starts with about one million conversations drawn from both open-source and proprietary instruction-and-preference data. For each prompt, we sample five responses from the SFT checkpoint and use an LLM-based jury to rank them, selecting the top response as “chosen” and the lowest as “rejected,” with ties favoring on-policy outputs. For targeted subsets (e.g., instruction following), the chosen responses are further refined using Contrastive Learning from AI Revisions (CLAIR). We apply a mix of quantitative and qualitative filters to ensure high data quality and remove undesirable behaviors.
Training is then performed based on a family of length-normalized alignment objectives with a generalized loss function given by
The final model is produced by a task-arithmetic merge, combining the strengths of checkpoints trained with each objective.
Pushing the limits of MoE inference on CPU
As our CTO, Mathias Lechner, describes in his article on Short Convolutions, our design philosophy at Liquid comprises two aspects (1) Strong LLM capability (2) Edge device efficiency.
During the development of LFM2-8B-A1B, it was paramount for us to validate that our model would deliver blazing speed on the devices of our customers. Thus, for our internal development, we profiled different architectures using an XNNPACK-based inference stack, which provided us the flexibility to control every aspect of its execution, identify bottlenecks, and write optimized kernels.
This led to the realization that a naive MoE CPU implementation was sub-optimal in terms of utilizing the CPU hardware, as it was still implicitly relying on patterns optimized for GPUs. Therefore, we wrote a CPU-optimized kernel for LFM2 MoE, which allowed us to squeeze more FLOPs from the cores. The results above demonstrate the huge gain in performance of our MoE model compared to similar-sized competitors, and show the potential of sparse architectures on CPU.

In addition to integrating LFM2-8B-A1B on llama.cpp and Executorch to validate inference efficiency on CPU-only devices, we’ve also integrated the model into vLLM to deploy on GPU in both single-request and online batched settings.
Our 8B LFM2 MoE model not only outperforms comparable size models on CPU, as seen in the above plots, but also excels against those same models on GPU (1xH100). All models above were benchmarked with the FlashInfer backend for attention layers with full CUDA-graph compilation during decode and piecewise CUDA-graph during prefill.
Try LFM2-8B-A1B now on upstream vLLM!
Build with LFM2
LFM2-8B-A1B is available today on Hugging Face. We provide a Colab notebook to fine-tune it with TRL and GGUF quants to run it with llama.cpp. You can test it now on Liquid Playground.