Even with the same state/cache size, models can differ significantly in how well they utilize memory—impacting recall, compression, and trainability.
We introduce Effective State-Size (ESS): A proxy metric for memory utilization.
At its core, many deep learning sequence models—attention, SSMs, gated convs—can be expressed as: y = T(u)u, where T(u) is an input-dependent matrix.
By extending classic signal processing results, we show that any equivalent recurrence must materialize a state whose size is at least the rank of the submatrices of T(u). We define this rank as the ESS and interpret it as a measure of the model’s memory utilization.
Our analysis of ESS reveals several key insights:
- State compression: Sequence models with high ESS (high memory utilization) are harder to distill into smaller-state students.
- Initialization and featurization: ESS can inform initialization and featurization schemes for better recall performance.
- State modulation: ESS can be tracked at each time step, revealing how models adjust memory usage in response to context—e.g., around EOS tokens. We find that LLMs that modulate ESS more effectively tend to perform better on recall-intensive tasks.

This work was accepted at ICML 2025.
For all the details refer to the paper: “Quantifying Memory Utilization with Effective State-Size”.