Big Data: Scalable Analysis of Very Large Datasets

The technical capabilities for data collection as well as the number of available data sources have increased tremendously in recent years, imposing new, unprecedented challenges to information management. The development of the Web 2.0 and social networks, the ubiquity of mobile devices and sensor networks, as well as advances in gathering scientific data contribute significantly to this development. The resulting flood of data is difficult, if not impossible, to manage using traditional tools for data management such as relational databases. On the one hand, the sheer size of the data requires massively-parallel processing using hundreds or thousands of computers. On the other hand, novel methods for data analysis are necessary to extract useful information from the raw data.

At Max-Planck-Institut für Informatik, we develop efficient, highly scalable methods and systems for the statistical analysis of such big datasets. For example, internet companies such as Amazon, Google, or Netflix provide their users personalized recommendations for products, websites, news, movies, or images. From a user's point a view, such personalized recommendations enable targeted exploration of potentially interesting items; from a provider's point of view, customer satisfaction and loyalty increase. The American movie rental company Netflix, for instance, allows its more than 20 million customers to rate movies using a 5-star system. The so-obtained ratings are used to individually recommend new, not-yet-seen movies to users. Modern recommender systems are based on an approach called "collaborative filtering," i.e., the behavior of many users and user groups is analyzed in order to create recommendations for each individual user. The key challenges that recommender systems need to solve are (1) the modeling and prediction of user interests, and (2) the creation of interesting recommendations on the basis of these predictions.

A successful technique for prediction, which is also used by Netflix in production, models the available movie ratings in the form of an incomplete matrix and subsequently tries to ``complete'' this matrix. Every row of the matrix corresponds to a user, every column to a movie, and every entry to a rating of the respective movie by the respective user. Figure 1 shows such an incomplete matrix; here black dots correspond to unknown ratings. We have developed parallel algorithms that can complete incomplete matrices with millions of users, millions of movies, and billions of entries in a couple of minutes. Every entry of the output matrix corresponds to a predicted rating. Figure 2 shows a completed matrix computed using our methods.

One option to create recommendations for each user is to simply recommend the movies with the highest predicted ratings. In practice, however, it is important that recommendations are diverse, i.e., that each user's recommendations consist of sufficiently different movies (e.g., different genres or different directors). The availability of licenses or physical media is also important: If a user accepts a recommendation, then this recommendation should be delivered as fast as possible. Our techniques allow efficient, high-quality personalized recommendations that respect constraints such as the ones above.

Apart from personalized recommendations, our group works on methods for interactive exploration of large document collections, for the analysis and automated extraction of knowledge from natural language text, and for pattern mining and logical reasoning on the resulting knowledge bases.

About the author:

Contact: rgemulla (at)