Scalable Analysis of Very Large Datasets

Rainer Gemulla & Christina Teflioudi

Scalable Analysis of Very Large Datasets

The technical capabilities for data collection as well as the number of available data sources have increased tremendously in recent years, imposing new, unprecedented challenges to information management. The development of the Web 2.0 and social networks, the ubiquity of mobile devices and sensor networks, and the advances in gathering scientific data have contributed signifi cantly to this development. The resulting fl ood of data is difficult, if not impossible, to manage using traditional tools for data management such as relational databases. On the one hand, the sheer size of the data requires massively-parallel processing, using hundreds or thousands of computers. On the other hand, novel methods for data analysis are necessary to extract useful information from the raw data.At the Max Planck Institute for Informatics, we develop efficient, highly scalable methods and systems for the statistical analysis of such big datasets. Internet companies such as as Amazon, Google, or Netfl ix, for example, analyze data about users and their behavior to provide personalized recommendations for products, websites, news, movies, or images. From a user’s point a view, such personalized recommendations enable the targeted exploration of potentially interesting items; from a provider’s point of view, customer satisfaction and loyalty increase. The American movie rental company Netflix, for instance, allows its more than 20 million customers to rate movies using a 5-star system. The so obtained ratings are used to individually recommend new, not-yet-seen movies to users. Modern recommender systems are based on an approach called “collaborative filtering”, i.e., the behavior of many users and user groups is analyzed in order to create recommendations for each individual user. The key challenges that recommender systems need to solve are (1) the modeling and prediction of user interests, and (2) the creation of interesting recommendations on the basis of these predictions.

Figure 1: Known incomplete matrix

A successful technique for prediction, which is also used by Netfl ix in production, models the available movie ratings in the form of an incomplete matrix and subsequently tries to “complete” this matrix. Every row of the matrix corresponds to a user, every column to a movie, and every entry to a rating of the respective movie by the respective user. Figure 1 visualizes such an incomplete matrix; here, black dots correspond to unknown values. We have developed parallel algorithms that can complete incomplete matrices with millions of users, millions of movies, and billions of entries within a couple of minutes. Every entry of the output matrix corresponds to a predicted rating. Figure 2 shows a completed matrix computed using our methods.

One option to create recommendations for each user is to recommend the movies with the highest predicted ratings. This is a non-trivial task, however, because the completed matrix is usually too large to be constructed and processed in reasonable time. We have developed algorithms that effi ciently extract only good recommendations for each user, without computing all predicted ratings. Our methods are orders of magnitudes faster than the naive approach of fully computing the completed matrix.

Figure 2: Automatically completed matrix

In practice, we may not want to simply recommend the movies with the highest predicted rating. For example, it is important that recommendations are diverse, i.e., that each use’s recommendations consist of suffi ciently different movies (e.g., different genres or different directors). The availability of licenses or physical media is also crucial: if a user accepts a recommendation, then this recommendation should be delivered as fast as possible. Our techniques allow effi cient, high-quality personalized recommendations that respect constraints such as those mentioned above.

Apart from personalized recommendations, our group also works on methods for the interactive exploration of large document collections, for the analysis and automated extraction of knowledge from natural language text, and for pattern mining on sequential data.

Christina Teflioudi

DEPT. 5 Databases and Information Systems
Phone
+49 681 9325-5029
Email chtefl io mpi-inf.mpg.de

Rainer Gemulla

University of Mannheim, Mannheim, Germany
Email rgemulla uni-mannheim.de