NewsResearch
By Philipp Nazari and T. Konstantin Rusch
Linear attention offers a computationally efficient yet expressive alternative to softmax attention, maintaining a recurrent state that functions as a linear associative memory. However, recent empirical results indicate that the associative memory of trained linear attention models often exhibits a low-rank structure, suggesting that these models underexploit their capacity in practice. To illuminate this phenomenon, we provide a theoretical analysis of the role of rank in linear attention, revealing that low effective rank can affect retrieval error by amplifying query noise, as well as poorly condition query gradients.
By Wanqi Yang, Yuexiao Ma, Alexander Conzelmann, Xiawu Zheng, Michael W. Mahoney, and 2 others
Mixture-of-Experts (MoE) architectures scale computation via sparse expert activations, yet they remain memory-bound because all expert weights must reside in memory. Mixed-precision quantization can substantially reduce this footprint, but existing quantization methods estimate expert importance and assign bits based on calibration data. For frontier MoE LLMs, however, the original training data (and thus its true training distribution) is proprietary and inaccessible. Thus, any calibration set is at best a surrogate and may yield a biased or incomplete view of expert utilization, leading to suboptimal bit allocation. To address these problems, we propose AlphaQ, a novel calibration-free bit-allocation method for MoE quantization.
By Francesco M. Ruscio and T. Konstantin Rusch
Flow Matching typically relies on white noise sources, a choice often misaligned with the power spectra of natural data, which tend to decay with frequency. To address this, we introduce , a variant of Flow Matching based on an operator-modulated interpolant. This formulation induces a time-varying spectral bias that transitions from the source spectrum to a frequency-decaying bias as the path approaches the data. We validate our method on unconditional image generation tasks, including the scientific Galaxy10 dataset. Empirically, we show that our method is particularly effective when paired with adaptive ODE solvers, where it improves or preserves sample quality while substantially reducing sampling cost compared to standard baselines.
By Rohin Manvi, Joey Hong, Tim Seyde, Maxime Labonne, Mathias Lechner, and 1 other
Large language models excel at reasoning but lack key aspects of introspection, including the ability to anticipate their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this ability, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods such as Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness.
By Mihir Bafna, Bowen Jing, and Bonnie Berger
Many methods have been developed to predict static protein structures, however understanding the dynamics of protein structure is essential for elucidating biological function. While molecular dynamics (MD) simulations remain the in silico gold standard, its high computational cost limits scalability. We present DynaProt, a lightweight, SE(3)-invariant framework that predicts rich descriptors of protein dynamics directly from static structures.
By Samuel J Paech, Allen G Roush, Judah Goldfeder, and Ravid Shwartz-Ziv
Repetitive lexical patterns in LLM output, termed "slop," degrade writing quality through over-use and make AI-generated text immediately recognizable. We present Antislop, a comprehensive framework providing tools to both detect and eliminate these overused patterns. Our approach combines three innovations: (1) The Antislop Sampler, which uses backtracking to suppress unwanted strings at inference time without destroying vocabulary. (2) An automated pipeline that profiles model-specific slop against human baselines and generates training data. and, (3) Final Token Preference Optimization (FTPO), a novel fine-tuning method that operates in logit-space on individual tokens, surgically adjusting logits wherever a banned pattern has appeared in an inference trace.
By Makram Chahine, Philipp Nazari, Daniela Rus, and T. Konstantin Rusch
State Space Models (SSMs), developed to tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference. At their core are recurrent dynamical systems that maintain a hidden state, with update costs scaling with the state dimension. A key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden. Control theory, and more specifically Hankel singular value analysis, provides a potent framework for the measure of energy for each state, as well as the balanced truncation of the original system down to a smaller representation with performance guarantees.
By Alexander Amini, Anna Banaszak, Harold Benoit, Arthur Böök, Tarek Dakhran, and 28 others
We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities. Using hardware-in-the-loop architecture search under edge latency and memory constraints, we obtain a compact hybrid backbone that combines gated short convolutions with a small number of grouped query attention blocks, delivering up to 2x faster prefill and decode on CPUs compared to similarly sized models.
By Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, and 1 other
Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these methods shape reasoning capabilities remains largely elusive. Going beyond an accuracy-based investigation of how these two components sculpt the reasoning process, this paper introduces a novel analysis framework that quantifies reasoning paths and captures their qualitative changes under each training process (with models of 1.5B, 7B, and 14B parameters on mathematical domains).
By Tim Seyde, Rohin Manvi, Maxime Labonne, and Liquid AI Team
We share insights based on our exploration of reasoning recipes for small language models, with the example of LFM-1.3B. We demonstrate how a general chat model, without any math-specific pre-training, can acquire concise reasoning abilities through a combination of extensive fine-tuning and short-horizon reinforcement learning. By sharing our findings with our model, pre-training mix, and architecture, we aim to broaden the understanding of how to effectively instill reasoning abilities into small language models suitable for edge deployment.
By Keshigeyan Chandrasegaran and Michael Poli
Designing model architectures is a core part of building modern AI systems, alongside data, algorithms, compute, and benchmarks. Model architecture defines a learnable function and involves key choices—such as which operators to use (e.g., attention, convolution) and how to configure them (e.g., model depth, width). Despite its critical role, insight into architectures—what works and what doesn’t—is difficult to obtain, due to the prohibitive cost of training models from scratch, especially in today’s foundation model era. As a result, exploring new architectures remains a major challenge, particularly for generative models.
By Rom Parnichkun, Neehal Tumma, Stefano Massaroli, Michael Poli, Armin Thomas, and 1 other
Even with the same state/cache size, models can differ significantly in how well they utilize memory—impacting recall, compression, and trainability. We introduce Effective State-Size (ESS): A proxy metric for memory utilization.
By Armin Thomas, Stefano Massaroli, Michael Poli, and Liquid Edge Team
Today, we introduce a Liquid architecture called Hyena Edge, a convolution-based multi-hybrid model that not only matches but outperforms strong Transformer-based baselines in computational efficiency and model quality on edge hardware, benchmarked on the Samsung S24 Ultra smartphone. To design Hyena Edge, we use our recently proposed end-to-end automated model design framework.
Today, we report advances in automated neural network architecture design and customization. We developed algorithms for the synthesis of tailored architectures (STAR), based on evolutionary algorithms applied to a numerical representation for model architectures derived from a new design theory. STAR automates the process of architecture discovery and optimization, turning it into an end-to-end process. With these methods, we have been able to tailor architectures to custom tasks, metrics, and hardware. We used STAR to synthesize hundreds of different designs that outperform strong Transformer and hybrid architectures in quality, with smaller caches and number of parameters.
We invented liquid neural networks, a class of brain-inspired systems that can stay adaptable and robust to changes even after training [R. Hasani, PhD Thesis] [Lechner et al. Nature MI, 2020] [pdf] (2016-2020). We then analytically and experimentally showed they are universal approximators [Hasani et al. AAAI, 2021], expressive continuous-time machine learning systems for sequential data [Hasani et al. AAAI, 2021] [Hasani et al. Nature MI, 2022], parameter efficient in learning new skills [Lechner et al. Nature MI, 2020] [pdf], causal and interpretable [Vorbach et al. NeurIPS, 2021] [Chahine et al. Science Robotics 2023] [pdf], and when linearized they can efficiently model very long-term dependencies in sequential data [Hasani et al. ICLR 2023].