Discovery of Everyday Human Activities From Long-term Visual Behaviour Using Topic Models

Abstract

Human visual behaviour has significant potential for activity recognition and computational behaviour analysis, but previous works focused on supervised methods and recognition of predefined activity classes based on short-term eye movement recordings. We propose a fully unsupervised method to discover users’ everyday activities from their long-term visual behaviour. Our method combines a bag-of-words representation of visual behaviour that encodes saccades, fixations, and blinks with a latent Dirichlet allocation (LDA) topic model. We further propose different methods to encode saccades for their use in the topic model. We evaluate our method on a novel long-term gaze dataset that contains full-day recordings of natural visual behaviour of 10 participants (more than 80 hours in total). We also provide annotations for eight sample activity classes (outdoor, social interaction, focused work, travel, reading, computer work, watching media, eating) and periods with no specific activity. We show the ability of our method to discover these activities with performance competitive with that of previously published supervised methods.

Dataset

We were able to record a dataset of more than 80 hours of eye tracking data (see Table 1 for an overview and Figure 1 for sample images). The dataset comprises 7.8 hours of outdoor activities, 14.3 hours of social interaction, 31.3 hours of focused work, 8.3 hours of travel, 39.5 hours of reading, 28.7 hours of computer work, 18.3 hours of watching media, 7 hours of eating, and 11.4 hours of other (special) activities. Note that annotations are not mutually exclusive, i.e. these durations should be seen independently and sum up to more than the actual dataset size. Most of our participants were students and wore the eye tracker through one day of their normal university life. This is reflected in the overall predominant activities, namely focused work, reading, and computer work. Otherwise, as can also be seen from the table, our dataset contains significant variability with respect to participants’ daily routines and consequently the number, type, and distribution of activities that they performed. For example, while P1 wore the eye tracker during a normal working day at the university, P7 and P9 recorded at a weekend and stayed at home all day mainly reading and working on the computer (P7) or watching movies (P9) with little or no social interactions.

Figure 1: Sample scene images for each activity class annotated in our dataset showing the considerable variability in terms of place and time of recording. The red dot indicates the gaze location in that particular image.

Figure 2: Recording setup consisting of a laptop with an additional external hard drive and battery pack, as well as a PUPIL head-mounted eye tracker (a). Recording hardware worn by a participant (b).

Apparatus

The recording system consisted of a Lenovo Thinkpad X220 laptop, an additional 1TB hard drive and battery pack, as well as an external USB hub. Gaze data was collected using a PUPIL head-mounted eye tracker connected to the laptop via USB (see Figure 2).. The eye tracker features two cameras: one eye camera with a resolution of 640x360 pixels recording a video of the right eye from close proximity, as well as an egocentric (scene) camera with a resolution of 1280x720 pixels. Both cameras record at 30 Hz. The battery lifetime of the system was four hours. We implemented custom recording software with a particular focus on ease of use as well as the ability to easily restart a recording if needed.

Procedure

We recruited 10 participants (three female) aged between 17 and 25 years through university mailing lists and adverts in university buildings. Most participants were bachelor’s and master’s students in computer science and chemistry. None of them had previous experience with eye tracking. After arriving in the lab, participants were first introduced to the purpose and goals of the study and could familiarise themselves with the recording system. In particular, we showed them how to start and stop the recording software, how to run the calibration procedure, and how to restart the recording. We then asked them to take the system home and wear it continuously for a full day from morning to evening. We asked participants to plug in and recharge the laptop during prolonged stationary activities, such as at their work desk. We did not impose any other restrictions on these recordings, such as which day of the week to record or which activities to perform, etc.

Ground Truth Annotation

For evaluation purposes, the full dataset was annotated posthoc from the scene videos by a paid human annotator with a set of nine non-mutually-exclusive ground truth activity labels (see Table 1). Specifically, we included labels for whether the participant was inside or outside (outdoor), took part in social interaction, did focused work, travelled (such as by walking or driving), read, worked on thecomputer, watched media (such as a movie) or ate. We further included a label for special events, such as tying shoes or packing a backpack. This selection of labels was inspired by previous works and includes a subset of activities from [1,2,4]. Figure 3 provides an example of a ground truth labelling of participant 6 over a full day.

Content

The dataset consists of 20 files. Ten files contain the long-term eye movement data of the ten recorded participants of this study. The other ten files describe the corresponding ground truth annotations.

Download

Here you can see an example for each of the two types of files.

Please download the full dataset here(958 MB (1,004,685,360 bytes)).

Long-term eye movement recording of participant 1:

Eye Frame Number	Eye Frame Time	Normal -ised GazeX	Normal -ised GazeY	Normal -ised PupilX	Normal -ised PupilY	Confidence Value	Ellipse Major	Ellipse Minor	Ellipse Angle
₁₂₃	_25591.408231	_0.543449	_0.567080	_0.632791	_0.446285	_0.916569	_62.348351	_52.137829	_125.349686
₁₂₄	_25591.439907	_0.539582	_0.571380	_0.634194	_0.447448	_0.915689	_62.362423	_52.047989	_124.526283
₁₂₅	_25591.477712	_0.533799	_0.579573	_0.636220	_0.449711	_1.000000	_62.647682	_51.846684	_125.090401
₁₂₆	_25591.527972	_0.528315	_0.586214	_0.638157	_0.451481	_1.000000	_62.424488	_51.700821	_125.746971
₁₂₇	_25591.569404	_0.523166	_0.591113	_0.640005	_0.452711	_1.000000	_62.404976	_51.296490	_125.991196
₁₂₈	_25591.627690	_0.519758	_0.593308	_0.641257	_0.453192	_0.902514	_62.444271	_51.495365	_126.859177

Ground truth annotation of participant 1:

Scene Frame Number

Scene Frame Time

Day Time

Activity Label

Activity Description

₆₅₃₉₂₇

_48450.92

_13:27:30

_eating/

_social

_inter-

_action

_having

_lunch/

_talking

_{to a}

_colleague

₀

₁

₀

₁

₀

_...

₈₀₀₆₆₀

_53586.55

_14:53:06

_reading/

_focused

_work

_reading

_paper

₀

₁

₀

₁

₀

Method

So far, however, it remains unclear how much information about daily routines is contained in long-term human visual behaviour, how this information can be extracted, encoded, and modelled efficiently, and how it can be used for unsupervised discovery of human activities. The goal of this work is to shed some light on these questions. We collected a new long-term gaze dataset that contains natural visual behaviour of 10 participants (more than 80 hours in total). The data was collected with a state-of-the-art head-mounted eye tracker that participants wore continuously for a full day of their normal life. We annotated the dataset with eight sample activity classes (outdoor, social interaction, focused work, travel, reading, computer work, watching media, and eating) and periods with no specific activity. The dataset and annotations are publicly available online. We further present an approach for unsupervised activity discovery that combines a bag-of-words visual behaviour representation with a latent Dirichlet allocation (LDA) topic model (see Figure 4). In contrast to previous works, our method is fully unsupervised, i.e. does not require manual annotation of visual behaviour. It also does not only extract information from saccade sequences but learns a more holistic model of visual behaviour from saccades, fixations, and blinks.

The specific contributions of this work are three-fold. First, we present a novel ground truth annotated long-term gaze dataset of natural human visual behaviour continuously recorded using a head-mounted video-based eye tracker in the daily life of 10 participants. Second, we propose an unsupervised method for eye-based discovery of everyday activities that combines a bag-of-words visual behaviour representation with a topic model. To this end we also propose different approaches to efficiently encode saccades, fixations, and blinks for topic modelling. Third, we present an extensive performance evaluation that shows the ability of our method to discover daily activities with performance competitive with that of previously published supervised methods for selected activities.

Figure 4: Input to our method are eye movements detected in the eye video. These movements are first encoded into a string sequence from which a bag-of-words representation is generated. The representation is used to learn a latent Dirichlet allocation (LDA) topic model. Output of the model are the topic activations that can be associated with different activities.

References

[1] Bulling, A., Ward, J. A., Gellersen, H., and Tröster, G. Eye Movement Analysis for Activity Recognition Using Electrooculography. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 4 (2011), 741–753.

[2] Bulling, A., Weichel, C., and Gellersen, H. EyeContext: Recognition of High-level Contextual Cues from Human Visual Behaviour. In Proc. CHI 2013 (2013), 305–308.

[3] Kassner, M., Patera, W., and Bulling, A. Pupil: an open source platform for pervasive eye tracking and mobile gaze-based interaction. In Adj. Proc. UbiComp (2014), 1151–1160.

[4] Shiga, Y., Toyama, T., Utsumi, Y., Kise, K., and Dengel, A. Daily Activity Recognition Combining Gaze Motion and Visual Features. In Adj. Proc. UbiComp (2014), 1103–1111.