抄録
Recent Human-object interaction detection (HOID) methods highly require prior
knowledge from VLMs to enhance the interaction recognition capabilities. The
training strategies and model architectures for connecting the knowledge from
VLMs to the HOI instance representations from the object detector are
challenging, and the whole framework is complex for further development or
application. On the other hand, the inherent reasoning abilities of MLLMs on
human-object interaction detection are under-explored. Inspired by the recent
success of training MLLMs with reinforcement learning (RL) methods, we propose
HOI-R1 and first explore the potential of the language model on the HOID task
without any additional detection modules. We introduce an HOI reasoning process
and HOID reward functions to solve the HOID task by pure text. The results on
the HICO-DET dataset show that HOI-R1 achieves 2x the accuracy of the baseline
with great generalization ability. The source code is available at
https://github.com/cjw2021/HOI-R1.