抄録
In this paper, we propose a method for constructing a gaze direction estimation model using Vision Transformer (ViT) as a fundamental technology for developing input interfaces using eye movement. Eye movement measurement methods using machine learning enable highly accurate estimation but require many computational resources. Therefore, in this paper, we consider a calibration-free and lightweight gaze direction identification model with implementation in gaze input interfaces in mind. The proposed method constructs a gaze direction classification model by fine-tuning a large-scale pre-trained model of ViT using the constructed dataset. The training dataset was constructed by extracting the face region as a still image for each frame from a video image captured by a webcam and then focusing on the area near the eyeball. In addition, we experimented to evaluate the performance of the gaze direction estimation model constructed using the proposed method. As a result of the experiment, the accuracy rate and macro average F-value of the proposed method were approximately 19.0 points higher than the conventional method, and we confirmed the overall improvement of the classification performance under calibration-free.