LFM2-ColBERT-350M: One Model to Embed Them All

Today, we release LFM2-ColBERT-350M, a late interaction retriever with excellent multilingual performance. It allows you to store documents in one language (for example, a product description in English) and retrieve them in many languages (e.g., a query in Spanish, German, French, etc.) with high accuracy. Thanks to the efficient LFM2 backbone, it also benefits from extremely fast inference speed, on par with models that are 2.3 times smaller.

Highlights

LFM2-ColBERT-350M offers best-in-class accuracy across different languages.
Inference speed is on par with models 2.3 times smaller, thanks to the efficient LFM2 backbone.
You can use it today as a drop-in replacement in your current RAG pipelines to improve performance.

Why Late Interaction?

Embedding models can be divided into three families:

Bi-encoders like BERT encode queries and documents independently, then compute a similarity score. These are fast and scalable but lose fine-grained token-level interactions, which often hide critical signals like subtle term matching.
Rerankers (or cross-encoders) encode queries and documents jointly, enabling full attention-based interaction between query and document tokens. Accuracy is high, but computational cost is prohibitive at scale.
Late interaction retrievers encode queries and documents independently at a token level for higher accuracy. At query time, it compares the tokens from the query and document embedding (e.g., via MaxSim) and aggregates the scores.

Late interaction retrievers are particularly interesting because they preserve much of the expressivity of cross-attention while retaining the efficiency of pre-computation. In practice, they're used to both retrieve documents at scale (like bi-encoders) and rank them at the same time (like rerankers).

Evaluations

We compared LFM2-ColBERT-350M against the current best late interaction retriever in the sub-500M parameter category: GTE-ModernColBERT-v1 (150M parameters).

We extended the NanoBEIR benchmark to include Japanese and Korean languages. We open-sourced this dataset on Hugging Face at LiquidAI/nanobeir-multilingual-extended for reproducibility. On this NanoBEIR benchmark, LFM2-ColBERT-350M displays significantly stronger multilingual capabilities (especially in German, Arabic, Korean, and Japanese) while maintaining English performance.

Even more interestingly, LFM2-ColBERT-350M is an excellent cross-lingual retriever. This means that it is capable of retrieving documents based on queries from other languages. This is ideal for client-facing applications, like in e-commerce, where a description might be in English but the query is in another language.

This works especially well for English, French, Spanish, Italian, Portuguese, and German, as shown with these NDCG@10 scores on NanoBEIR:

Document language / Query language	ar	de	en	es	fr	it	ja	ko	pt
ar	0.490481	0.28804	0.338989	0.302909	0.303996	0.286204	0.356865	0.337925	0.291165
de	0.382889	0.562897	0.547297	0.498391	0.501828	0.488725	0.424428	0.368002	0.485573
en	0.416114	0.55357	0.661226	0.552827	0.550861	0.52218	0.476901	0.395041	0.53475
es	0.412107	0.514308	0.578319	0.562899	0.546525	0.528614	0.435513	0.394267	0.546747
fr	0.407777	0.526774	0.572788	0.5519	0.564106	0.537273	0.449577	0.388198	0.549112
it	0.394692	0.512369	0.55386	0.534627	0.534724	0.543281	0.439383	0.385906	0.529167
ja	0.375256	0.364846	0.40911	0.358068	0.344954	0.336961	0.556755	0.490953	0.330229
ko	0.325789	0.274116	0.30957	0.28201	0.265219	0.265617	0.440098	0.527494	0.270575
pt	0.401627	0.499093	0.557681	0.545103	0.527805	0.528761	0.435853	0.382246	0.546856

In comparison, GTE-ModernColBERT-v1 consistently gets lower scores when documents and queries are not in the same language.

Document language / Query language	ar	de	en	es	fr	it	ja	ko	pt
ar	0.309094	0.088882	0.107318	0.088656	0.093531	0.092211	0.070479	0.049236	0.087402
de	0.039103	0.498902	0.453624	0.36155	0.393378	0.366959	0.132763	0.061183	0.36104
en	0.041559	0.407823	0.679694	0.445712	0.484456	0.419897	0.167128	0.072883	0.437644
es	0.044072	0.359704	0.484505	0.525248	0.465013	0.437339	0.149247	0.061257	0.486842
fr	0.043986	0.381051	0.504678	0.455461	0.546465	0.427851	0.136489	0.0574	0.44784
it	0.042952	0.368802	0.449246	0.445581	0.451044	0.515613	0.143343	0.053565	0.442368
ja	0.030702	0.169346	0.250173	0.172071	0.177372	0.168574	0.459477	0.058598	0.16489
ko	0.029558	0.133936	0.169461	0.127463	0.133053	0.125281	0.09037	0.367652	0.123583
pt	0.043255	0.367522	0.479025	0.491683	0.466999	0.448111	0.138271	0.06162	0.530257

This makes retrieval a lot more reliable and can replace architectures with multiple models with a single, unified retriever.

Inference speed

Despite being more than twice as big, LFM2-ColBERT-350M demonstrates throughput performance on par with GTE-ModernColBERT-v1 for query and document encoding across various batch sizes.

We profiled inference speed as follows:

Query encoding was evaluated using realistic query patterns from datasets like MS MARCO and Natural Questions.
Document encoding was measured on realistic documents with varying lengths and domains.

This fast inference is possible thanks to the efficient LFM2 backbone that combines short‑range, input‑aware gated convolutions with grouped‑query attention.

Build with LFM2

LFM2-ColBERT-350M is available today on Hugging Face, complete with an interactive demo. If you are interested in custom solutions with edge deployment, please contact our sales team.