We propose a multimodal CNN-based model for the appearance-based gaze estimation method. The Basic structure of this CNN model includes two convolutional layers and one fully connected layer, and a linear regression on the top. We put head pose information into this CNN model by concatenating head pose angle vector with the output of the fully connected layer. The architectural of our CNN-based model is showed in the following figure.
We train this model based on Caffe. You can download the configuration file here. As the main parameter, leaning rate is set to be 0.1.
Please notice that you also need modify the "accuracy" and "euclidean distance" layers of Caffe. You can download it here.
Q & A
1. How do you convert .mat file to .h5 file?
Please find the example Matlab script here.
2. How do you convert 3d directional vector to 2d angle?
We refer to the paper  for the data normalization.
Briefly to say, the 3D gaze direction (x, y, z) can be converted to 2D representation (theta, phi) like:
- theta = asin(-y)
- phi = atan2(-x, -z)
The negative representation has been used so that camera-looking direction becomes (0,0).
And in contrast, 3D head rotation (x, y, z) can be converted to (theta, phi) like:
- M = Rodrigues((x,y,z))
- Zv = (the third column of M)
- theta = asin(Zv)
- phi = atan2(Zv, Zv)
3. Why I got "nan" during training?
Usually, there are two reasons to cause "nan".
Firstly, it can be the value is out of the float type range if you got "nan" all the time. It can happen with inappropriate layer initialization. Since Caffe is keeping update, my configuration also can fail the layer initialization. So please modify the layer initialization parameter by yourself, like the "std" value.
Secondly, it can be caused by calculation exception if you got "nan" from time to time. I modify the accuracy layer and euclidean loss layer to report the "angle difference", where the function "acos" is been called. It sometimes output "nan" because the variable is out the range of (-1,1). However, it is just for showing, so that wouldn't affect the training.