TY - JOUR
T1 - StARformer
T2 - Transformer With State-Action-Reward Representations for Robot Learning
AU - Shang, Jinghuan
AU - Li, Xiang
AU - Kahatapitiya, Kumara
AU - Lee, Yu Cheol
AU - Ryoo, Michael S.
N1 - Publisher Copyright:
© 1979-2012 IEEE.
PY - 2023/11/1
Y1 - 2023/11/1
N2 - Reinforcement Learning (RL) can be considered as a sequence modeling task, where an agent employs a sequence of past state-action-reward experiences to predict a sequence of future actions. In this work, we propose State-Action-Reward Transformer (StARformer), a Transformer architecture for robot learning with image inputs, which explicitly models short-term state-action-reward representations (StAR-representations), essentially introducing a Markovian-like inductive bias to improve long-term modeling. StARformer first extracts StAR-representations using self-attending patches of image states, action, and reward tokens within a short temporal window. These StAR-representations are combined with pure image state representations, extracted as convolutional features, to perform self-attention over the whole sequence. Our experimental results show that StARformer outperforms the state-of-the-art Transformer-based method on image-based Atari and DeepMind Control Suite benchmarks, under both offline-RL and imitation learning settings. We find that models can benefit from our combination of patch-wise and convolutional image embeddings. StARformer is also more compliant with longer sequences of inputs than the baseline method. Finally, we demonstrate how StARformer can be successfully applied to a real-world robot imitation learning setting via a human-following task.
AB - Reinforcement Learning (RL) can be considered as a sequence modeling task, where an agent employs a sequence of past state-action-reward experiences to predict a sequence of future actions. In this work, we propose State-Action-Reward Transformer (StARformer), a Transformer architecture for robot learning with image inputs, which explicitly models short-term state-action-reward representations (StAR-representations), essentially introducing a Markovian-like inductive bias to improve long-term modeling. StARformer first extracts StAR-representations using self-attending patches of image states, action, and reward tokens within a short temporal window. These StAR-representations are combined with pure image state representations, extracted as convolutional features, to perform self-attention over the whole sequence. Our experimental results show that StARformer outperforms the state-of-the-art Transformer-based method on image-based Atari and DeepMind Control Suite benchmarks, under both offline-RL and imitation learning settings. We find that models can benefit from our combination of patch-wise and convolutional image embeddings. StARformer is also more compliant with longer sequences of inputs than the baseline method. Finally, we demonstrate how StARformer can be successfully applied to a real-world robot imitation learning setting via a human-following task.
KW - imitation learning
KW - reinforcement learning
KW - robot learning
KW - Transformer
UR - https://www.scopus.com/pages/publications/85137939771
U2 - 10.1109/TPAMI.2022.3204708
DO - 10.1109/TPAMI.2022.3204708
M3 - Article
C2 - 36067106
AN - SCOPUS:85137939771
SN - 0162-8828
VL - 45
SP - 12862
EP - 12877
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
IS - 11
ER -