In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions—specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing an initial estimation of the query, referred to as the initial guess. We prove an upper bound on the number of heads needed for ICL linear regression setup. Our experiments confirm this result and further observe that a performance gap between one-step GD and multi-head LSA persists. To address this gap, we introduce
The initialization determines whether in-context learning is gradient descent
Transactions on Machine Learning Research, December 2025
Type:
Journal
Date:
2025-12-07
Department:
Data Science
Eurecom Ref:
8536
Copyright:
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in Transactions on Machine Learning Research, December 2025 and is available at :
See also:
PERMALINK : https://www.eurecom.fr/publication/8536