Visual attention is highly fragmented during mobile interactions, but the erratic nature of attention shifts currently limits attentive user interfaces to adapting after the fact, i.e. after shifts have already happened. We instead study attention forecasting – the challenging task of predicting users’ gaze behaviour (overt visual attention) in the near future. We present a novel long-term dataset of everyday mobile phone interactions, continuously recorded from 20 participants engaged in common activities on a university campus over 4.5 hours each (more than 90 hours in total). We propose a proof-of-concept method that uses device-integrated sensors and body-worn cameras to encode rich information on device usage and users’ visual scene. We demonstrate that our method can forecast bidirectional attention shifts and predict whether the primary attentional focus is on the handheld mobile device. We study the impact of different feature sets on performance and discuss the significant potential but also remaining challenges of forecasting user attention during mobile interactions.
Given the lack of a suitable dataset for algorithm development and evaluation, we conducted our own data collection. Our goal was to record natural attentive behaviour during everyday interactions with a mobile phone. The authors of [Oulasvirta et al. 2005] leveraged the – at the time – long page loading times during mobile web search to analyse shifts of attention. We followed a similar approach but adapted the recording procedure in several important ways to increase the naturalness of participants’ behaviour and, in turn, the realism of the prediction task. First, as page loading times have significantly decreased over the last 10 years, we instead opted to engage participants in chat sessions during which they had to perform web search tasks as in [Oulasvirta et al. 2005] and then had to wait for the next chat message.
To counter side effects due to learning and anticipation, we varied the waiting time between chat messages and search tasks. Second, we did not perform a fully scripted recording, i.e. participants were not asked to follow a fixed route or perform particular activities in certain locations in the city, they were not accompanied by an experimenter, and the recording was not limited to about one hour. Instead, we observed participants passively over several hours while they interacted with the mobile phone during their normal activities on a university campus. For our study we recruited twenty participants (six females), aged between 22 and 31 years, using university mailing lists and study board postings. Participants were students with different backgrounds and subjects. All had normal or corrected-to-normal vision.
The recording system consisted of a PUPIL head-mounted eye tracker [Kassner et al. 2014] with an additional stereo camera, a mobile phone, and a recording laptop carried in a backpack. The eye tracker featured one eye camera with a resolution of 640×480 pixels recording a video of the right eye from close proximity with 30 frames per second, and a scene camera with a resolution of 1280×720 pixels recording at 24 frames per second. The original lens of the scene camera was replaced with a fisheye lens with a 175◦ field of view. The eye tracker was connected to the laptop via USB. In addition, we mounted a DUO3D MLX stereo camera to the eye tracker headset. The stereo camera recorded a depth video with a resolution of 752×480 pixels at 30 frames per second as well as head movements using its integrated accelerometer and gyroscope. Intrinsic parameters of the scene camera were calibrated beforehand using the fisheye distortion model from OpenCV. The extrinsic parameters between the scene camera and the stereo camera were also calibrated. The laptop ran the recording software and stored the timestamped egocentric, stereo, and eye videos. Given the necessity to root the phone to record touch events and application usage, similar to [Oulasvirta et al. 2005] we opted to provide a mobile phone on which all necessary data collection software was pre-installed and validated to run robustly. For participants to “feel at home” on the phone, we encouraged them to install any additional software they desired and to fully customise the phone to their needs prior to the recording. Usage logs confirmed that participants indeed used a wide variety of applications, ranging from chat software, to the browser, mobile games, and maps. To robustly detect the phone in the egocentric video and thus help with the ground-truth annotation, we attached visual markers to all four corners of the phone. We used WhatsApp to converse with the participants and to log accurate timestamps for these conversations [Church and De Oliveira 2013]. Participants were free to save additional numbers from important contacts, but no one transferred their whole WhatsApp account to the study phone. We used the Log Everything logging software to log phone inertial data and touch events [Weber and Mayer 2014], and the Trust Event Logger to log the current active application as well as whether the mobile phone screen was turned on or off.
After arriving in the lab, participants were first informed about the purpose of the study and asked to sign a consent form. We did not reveal which parts of the recording would be analysed later so as not to influence their behaviour. Participants could then familiarise themselves with the recording system and customise the mobile phone, e.g. install their favourite apps, log in to social media platforms, etc. Afterwards, we calibrated the eye tracker using the calibration procedure implemented in the PUPIL software [Kassner et al. 2014]. The calibration involved participants standing still and following a physical marker that was moved in front of them to cover their whole field of view.
To obtain some data from similar places on the university campus, we asked participants to visit three places at least once (a canteen, a library, and a café) and to not stay in any self-chosen place for more than 30 minutes. Participants were further asked to stop the recording after about one and a half hours so we could change the laptop’s battery pack and re-calibrate the eye tracker. Otherwise, participants were free to roam the campus, meet people, eat, or work as they normally would during a day at the university. We encouraged them to log in to Facebook, check emails, play games, and use all pre-installed applications on the phone or install new ones. Participants were also encouraged to use their own laptop, desktop computer, or music player if desired.
As illustrated in this figure, 12 chat blocks (CB) were distributed randomly over the whole recording. Each block consisted of a conversation via WhatsApp during which the experimental assistant asked the participant six random questions (Q1–Q6) out of a pool of 72 questions. Some questions could be answered with a quick online search, such as “How many states are members of the European Union?” or “How long is the Golden Gate Bridge?”. Similar to Oulasvirta et al. we also asked simple demographic questions like “What is the colour of your eyes?” or “What is your profession?” that could be answered without an online search. After each answer (A1–A6), participants had to wait for the next question. This waiting time was varied randomly between 10, 15, 20, 30, and 45 seconds by the experimental assistant. This was to avoid learning effects and to create a similar situation as in [Oulasvirta et al. 2005]. This question-answering procedure was repeated until the sixth answer had been received, thus splitting each chat block into six working time segments (yellow) and five waiting time segments (red). At the end of the recording, participants returned to the lab and completed a questionnaire about demographics and their mobile phone usage behaviour. In total, we recorded 1440 working and 1200 waiting segments over all participants.
Fixations were detected from the raw gaze data using a dispersion-based algorithm with a duration threshold of 150ms and an angular threshold of 1° [Kassner et al. 2014]. The 3D position of the mobile phone in the scene camera was estimated using visual markers. The position of the mobile phone surface was logged if at least two markers were visible in the scene camera. However, we only used the mobile phone detection as an aid for the ground truth annotation.
Classifier training requires precise annotations of when an attention shift takes place and how long an attention span lasts. Findlay and Gilchrist showed that in real-world settings, covert attention rarely deviates from the gaze location [Findlay and Gilchrist 2003]. Thus, we leveraged gaze as a reliable indicator of the user’s current attentional focus. Annotations were performed using videos extracted from the monocular egocentric video for the working/waiting time segments overlaid with gaze data provided by the eye tracker. Three annotators were asked to annotate each chat block with information on participants’ current environment (office, corridor, library, street, canteen, café), whether they were indoors or outdoors, their mode of locomotion (sitting, standing or walking), as well as when their attention shifted from the mobile device to the environment or back.
The dataset consists of a .zip file with three files per participant for each of the three recording blocks (RB). Each recording block file is saved as a .pkl file which can be read with python using pandas. The data scheme of the 213 coulmns is given in this README.txt file.
Please download the full MPIIMobileAttention dataset here (2.4 GB).