Embedding distance as a reward signal can replace verifiers for LLM reasoning

Benechehab, Abdelhakim; El Hili, Youssef Attia; Thomas, Albert; Paolo, Giuseppe; Filippone, Maurizio
ICLR 2026, 14th International Conference on Learning Representations, Workshop LLM Reasoning, 23-27 April 2026, Rio de Janeiro, Brazil

Reinforcement Learning (RL) has emerged as a powerful paradigm for adapting Large Language Models (LLMs), offering advantages over Supervised Fine-Tuning (SFT) including reduced catastrophic forgetting and improved generalization. However, these benefits require explicit reward signals, often obtained from human preferences or verifiable outcomes, which are unavailable in many cases. We address this gap by introducing a framework that derives reward functions directly from supervised data, enabling RL-based training without additional annotation. Our approach formulates reward functions as a weighted distance between embeddings of labels and generated answers. Experiments with LLMs fine-tuning for a reasoning task demonstrate that our learned rewards match the performance of oracle RL that has access to groundtruth rewards.


Type:
Conférence
City:
Rio de Janeiro
Date:
2026-04-23
Department:
Data Science
Eurecom Ref:
8667
Copyright:
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in ICLR 2026, 14th International Conference on Learning Representations, Workshop LLM Reasoning, 23-27 April 2026, Rio de Janeiro, Brazil and is available at :

PERMALINK : https://www.eurecom.fr/publication/8667