Today, we release LFM2.5-VL-450M, an improved version of LFM2-VL-450M with grounding capabilities, better instruction following, and function calling support. The result is a compact model that turns image streams into structured, actionable outputs in real time, even on edge hardware.
LFM2.5-VL-450M is available on Hugging Face, LEAP, and our Playground. Check out our docs on how to run and fine-tune it locally.
What’s new
Compared to our LFM2-VL-450M that we released a few months ago, we scaled the pre-training of LFM2.5-VL-450M from 10T to 28T tokens, followed by post-training focused on improving multimodal behavior in production settings. In particular, we used preference optimization and reinforcement learning to improve grounding, instruction following, and overall reliability across vision-language tasks.
Bounding box prediction: 0 → 81.28 on RefCOCO-M
We added object detection, allowing the model to identify objects in an image and locate them with bounding boxes.
Improved multilingual image understanding: MMMB 54.29 → 68.09, covering Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Spanish
LFM2.5-VL-450M handles prompts in eight languages with higher accuracy, extending visual reasoning to global deployments without separate localization models.
Better instruction following: MM-IFEval 32.93 → 45.00
The model is more responsive to explicit constraints and user instructions, improving steerability across both text and vision inputs.
Benchmarks
We evaluated LFM2.5-VL-450M across benchmarks covering core vision understanding, object detection, and language reasoning. LFM2.5-VL-450M improves over LFM2-VL-450M across both vision and language benchmarks, while also adding support for bounding box prediction (measured on RefCOCO-M) and text function calling (measured by BFCLv4).
Performance: Made for the Edge
In real-world deployments, VLMs process live camera and image input, and must reason and act within a tight latency budget. LFM2.5-VL-450M (Q4_0) runs within budget across the whole range, from embedded AI modules (Jetson Orin) to mini-PC APUs (Ryzen AI Max+ 395) to flagship phone SoCs (Snapdragon 8 Elite).
On Jetson Orin, the model reasons over a 512×512 image in under 250ms, fast enough to process every frame in a 4 FPS video stream with full vision-language understanding, not just detection. On consumer mobile silicon, it stays under one second for smaller resolutions, keeping interactive experiences responsive.
Real-World Use Cases:
LFM2.5-VL-450M is especially well suited to real-world deployments where low latency, compact structured outputs, and efficient semantic reasoning matter most. These qualities make it a strong fit for early adoption in environments with tight compute, power, or throughput constraints, and in settings where offline operation or on-device processing is important for privacy.
Industrial Automation — Edge and Constrained Environments
In compute-constrained environments such as passenger vehicles, agricultural machinery, and warehouses, perception models are often limited to bounding-box outputs. LFM2.5-VL-450M goes further, providing grounded scene understanding in a single pass so systems can reason semantically about the scene, not just detect objects in it. That means richer outputs for settings like warehouse aisles, including worker actions, forklift movement, and inventory flow, while still fitting existing edge hardware like a Jetson Orin.

Wearables and Always-On Monitoring — On-Device and Privacy-Sensitive
Wearables and other always-on monitoring systems are a natural fit for compact VLMs because they operate under strict power, latency, and privacy constraints. Devices such as smart glasses, body-worn assistants, dashcams, and security or industrial monitors cannot afford large perception stacks or constant cloud streaming. In these settings, an efficient VLM can produce compact semantic outputs locally, turning raw video into useful structured understanding while keeping compute demands low and preserving privacy.
.png)
Retail and E-Commerce — High-Throughput Visual Processing
Retail and e-commerce platforms operate in extremely high-throughput environments, where millions of product and shelf images must be processed under tight latency and cost constraints. Tasks like catalog ingestion, visual search, product matching, and shelf compliance require more than object detection, but richer visual understanding is often too expensive to deploy at this scale. LFM2.5-VL-450M makes structured visual reasoning practical for these workloads, delivering semantic outputs with the speed and efficiency needed for real production systems.
Get Started
The LFM2.5-VL-450M model is available today on Hugging Face, LEAP, and our Playground. Adapt it to your domain with LEAP Fine-Tune.