Pareto Q-Learning with Reward Machines

By Arnaud Lequen, Clément Legrand-Lixon, Léo Saulières

· ArXiv · AI/CL/LG · Jun 17, 2026

PQLRM combines Pareto Q-learning with reward machines for sample-efficient multi-objective reinforcement learning under non-Markovian rewards.

Categories: Research

Excerpt

We present Pareto Q-Learning with Reward Machines (PQLRM), a multi-objective reinforcement learning algorithm for tasks whose reward structure is specified by a set of reward machines (RMs). PQLRM combines Pareto Q-Learning (PQL), which maintains sets of vector-valued Q-estimates to approximate the Pareto front, with enhancements from Q-Learning with Reward Machines (QRM), which exploits the factored automaton structure of the reward signal. This yields a multi-policy algorithm that remains sample-efficient under non-Markovian, RM-encoded rewards. Experimental trials show that PQLRM converges faster than a naive PQL baseline applied to the cross-product MDP and can synthesize Pareto-optimal policies that QRM cannot.

Read at source: https://arxiv.org/abs/2606.19134v1