Today, we’re releasing two new multilingual retrieval models: LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M. Both are 350M-parameter models and the first bidirectional members of the LFM family, building on our LFM2.5-350M-Base from March. They are built for fast and reliable multilingual and cross-lingual search across 11 languages, with a footprint small enough to run almost anywhere.
They are especially well-suited for short-context search: product catalogs, FAQ knowledge bases, support docs, and other collections that need to be searched quickly, cost-effectively, and reliably across languages.
The two models suit different needs:
- LFM2.5-Embedding-350M turns each document into a single vector. Pick it when you want the fastest search and the smallest, cheapest index.
- LFM2.5-ColBERT-350M converts each token into a vector rather than a single vector per document. This lets it match queries word-by-word for higher accuracy and better generalization, at the cost of a larger index. Pick it when accuracy matters more than storage.


Architecture Updates
Both models are built from LFM2.5-350M-Base, a mid-trained general-purpose checkpoint. We apply a small set of bidirectional patches to the LFM2 architecture, adapting it from a causal decoder to a bidirectional encoder.
In the causal setup, each token can only use information from itself and previous tokens, which is ideal for left-to-right generation but less natural for retrieval. We replace the causal attention mask with a bidirectional one (figure below, left side), so every token can attend to both left and right context. We also make the LFM2 short convolutions non-causal (figure below, right side), so they mix local information symmetrically around each token rather than only from the past. This preserves the efficiency of the LFM2 backbone while producing the full-context representations retrieval tasks need.

From this shared bi-directional encoder, the two models differ only in how they represent text. LFM2.5-Embedding-350M uses CLS-style pooling to produce a single dense embedding, while LFM2.5-ColBERT-350M keeps compact per-token embeddings for MaxSim late interaction.

Compared with LFM2.5-ColBERT-350M, this release uses the newer LFM2.5 checkpoint, expands language coverage, and adds explicit multilingual and cross-lingual retrieval training. It also introduces a companion bi-encoder built on the same backbone and recipe.
Training and Data
Both models follow the same three-stage training recipe: (1) large-scale contrastive pretraining in English, (2) multilingual and cross-lingual distillation from a strong teacher (across all 11 supported languages), and (3) final fine-tuning on hard-mined negatives. The staged structure was also inspired by LightOn’s LateOn and DenseOn release, which also separate broad contrastive pretraining from later specialization stages.
LFM2.5-Embedding-350M receives slightly more cross-lingual data than LFM2.5-ColBERT-350M, since cross-lingual retrieval emerges more naturally in the late-interaction setup and benefits less from additional supervision.
The training data combines curated internal data with open-source English retrieval datasets. We leverage LLM-based translation of queries and documents to expand multilingual and cross-lingual pairs used during the second and third training phases.
Benchmarks
We report fine-grained benchmark results across all 11 supported languages: Arabic, German, English, Spanish, French, Italian, Japanese, Korean, Norwegian, Portuguese, and Swedish. Our evaluation focuses on two capabilities: multilingual retrieval with NanoBEIR, and cross-lingual open-domain QA with MKQA-11. Together, they test whether the models can retrieve relevant documents within a language and across language boundaries.
Overall, both LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M show best-in-class multilingual and cross-lingual performance. Their results remain consistently competitive across all 11 supported languages, highlighting the robustness of the retrieval quality beyond English.
Different from LightOn’s work, we find that NanoBEIR English provides a sufficient evaluation signal. Across the models we evaluated, NanoBEIR English and the more expensive full BEIR remain highly correlated, with NanoBEIR scoring a near-constant ~15% higher. We therefore use NanoBEIR as a practical proxy for full BEIR when iterating across training runs.
Inference
We evaluate end-to-end latency in the retrieval regimes that matter in practice: query embedding with cached documents, query embedding plus MaxSim, and query embedding plus document embedding plus MaxSim when documents are not cached.
For portable deployment, we release LFM2.5-ColBERT-350M-GGUF and LFM2.5-Embedding-350M-GGUF for llama.cpp, so the models can run nearly anywhere (CPUs, laptops, and edge devices) at near-zero cost and with compelling latency.
For large-scale production-grade enterprise deployments, we also develop an internal GPU stack to deliver extremely low-latency serving under high inbound load.

Training your own
While these models perform strongly out of the box, we encourage you to fine-tune either model on your own data for domain-specific retrieval. We especially recommend this for LFM2.5-Embedding-350M, for which our Hugging Face model card provides simple fine-tuning snippets with sentence-transformers.
Get Started
The LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M models are available today on Hugging Face. For teams looking to deploy retrieval at enterprise scale, contact us to learn more.