In inverse reinforcement learning (IRL), an agent seeks to replicate expert demonstrations through interactions with the environment. Traditionally, IRL is treated as an adversarial game, where an adversary searches over reward models, and a learner optimizes the reward through repeated RL procedures. This game-solving approach is both computationally expensive and difficult to stabilize. In this work, we propose a novel approach to IRL by direct policy optimization: exploiting a linear factorization of the return as the inner product of successor features and a reward vector, we design an IRL algorithm by policy gradient descent on the gap between the learner and expert features. Our non-adversarial method does not require learning a reward function and can be solved seamlessly with existing actor-critic RL algorithms. Remarkably, our approach works in state-only settings without expert action labels, a setting which behavior cloning (BC) cannot solve. Empirical results demonstrate that our method learns from as few as a single expert demonstration and achieves improved performance on various control tasks.
Adversarial IRL faces some well-documented challenges: requires tricks to stabilize learning and repeatedly solving a costly RL problem in the inner loop of bilevel optimization. The task becomes even more challenging without access to expert actions in demonstrations. We present a simpler approach for imitation learning– which we call Successor Feature Matching (SFM). SFM minimizes the gap between expected features using direct policy optimization via reduction to a RL problem. Furthermore, with state-only base features to estimate SF, we show that SFM can learn imitation policy without action labels in demonstrations. A brief pseudocode for the learning procedure is provided below.
We conduct experiments on 10 environments from the DMControl Suite and compare with supervised offline method Behavior Cloning (BC), a non-adversarial IRL method that uses expert action labels IQ-Learn, and our implementation of adversarial state-only baselines MM and GAIfO. The architecture of SFM and the state-only baselines is based on the TD7 algorithm. Our method outperforms all baselines when trained with a single expert demonstration.
We also experiment with a simpler RL optimizer where the architecture of SFM and the state-only baselines depend on the TD3 algorithm. Remarkably, the performance of SFM (TD3) is similar demonstrating the efficacy of our non-adversarial method to learn with other off-the-shelf RL algorithms. However, the adversarial baselines did not perform as well on top of TD3.