Revisiting Data Normalization for Appearance-Based Gaze Estimation

Xucong Zhang, Yusuke Sugano, Andreas Bulling

Abstract

Appearance-based gaze estimation is promising for unconstrained real-world settings, but the significant variability in head pose and user-camera distance poses significant challenges for training generic gaze estimators. Data normalization was proposed to cancel out this geometric variability by mapping input images and gaze labels to a normalized space. Although used successfully in prior works, the role and importance of data normalization remains unclear. To fill this gap, we study data normalization for the first time using principled evaluations on both simulated and real data. We propose a modification to the current data normalization formulation by removing the scaling factor and show that our new formulation performs significantly better (between 9.5% and 32.7%) in the different evaluation settings. Using images synthesized from a 3D face model, we demonstrate the benefit of data normalization for the efficiency of the model training. Experiments on real-world images confirm the advantages of data normalization in terms of gaze estimation performance.

Data Normalization

Data normalization was originally proposed bu Sugano et al [1]. The normalization scheme aims at canceling variations in the eye image appearance as much as possible. The key idea is to standardize the translation and rotation between camera and face coordinate system via camera rotation and scaling.

The process starts from an arbitrary pose of the target face. The pose is defined as a rotation and translation of the head coordinate system with respect to the camera coordinate system, and the right-handed head coordinate system is defined according to the triangle connecting three midpoints of the eyes and mouth. The x-axis is defined as the line connecting midpoints of the two eyes from right eye to left eye, and the y-axis is defined as perpendicular to the x-axis inside the triangle plane from the eye to the mouth. The z-axis is perpendicular to the triangle and pointing backwards from the face.

To simplify the notation of eye image normalization, we use the midpoint of the right eye as the origin of the head coordinate system, and we denote the translation and rotation from the camera coordinate system to the head coordinate system as e_rand R_r.

Given this initial condition, the normalization process transforms the input image so that the normalized image meets three conditions. First, the normalized camera looks at the origin of the head coordinate system and the center of the eye is located at the center of the normalized image. Second, the x-axes of the head and camera coordinate systems are on the same plane, i.e., the x-axis of the head coordinate system appears as a horizontal line in the normalized image. Third, the normalized camera is located at a fixed distance d_n from the eye center and the eye always has the same size in the normalized image.

The rotation matrix R to achieve the first and second conditions and the scaling matrix S to meet the third condition. Therefore, the overall transformation matrix is defined as M = SR. The details can be found in our paper.

In the extreme case where the input is a 3D face mesh, the transformation matrix M can be directly applied to the input mesh and then it appears in the normalized space with a restricted head pose variation. Since the transformation is M defined as rotation and scaling, we can apply a perspective image warping to chieve the same effect if the input is a 2D face image.

Modified Data Normalization

Assuming 3D data, Sugano et al. [1] originally proposed to apply the same transformation matrix to the gaze vector as g_n = Mg_r.However, while in the 3D space the same rotation and translation should be applied to the original gaze vector g_r , this assumption is not precise enough when dealing with 2D images. Different with the original 2D data normalization method, we propose to only rotate the original gaze vector to obtain the normalized gaze vector g_n = Rg_r.

Source Code

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The code can be download from here.

Contact: Xucong Zhang, Campus E1.4 room 609, E-mail: xczhang@mpi-inf.mpg.de

If you use this code in scientific publication, please cite the following paper:

Xucong Zhang; Yusuke Sugano; Andreas Bulling. Revisiting Data Normalization for Appearance-Based Gaze Estimation. Proc. International Symposium on Eye Tracking Research and Applications (ETRA), pp. 12:1-12:9, 2018.
PDF
@inproceedings{zhang18_etra,
title = {Revisiting Data Normalization for Appearance-Based Gaze Estimation},
author = {Xucong Zhang and Yusuke Sugano and Andreas Bulling},
year = {2018},
date = {2018-03-28},
booktitle = {Proc. International Symposium on Eye Tracking Research and Applications (ETRA)},
pages = {12:1-12:9},
tppubtype = {inproceedings}
}

References

[1] Y. Sugano, Y. Matsushita, and Y. Sato. Learning-by-synthesis for appearance-based 3d gaze estimation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1821–1828. IEEE, 2014.