Data Mining and Matrices

Advanced lecture, 6 ECTS credits, summer semester 2017

Organization

Staff

Dr. Pauli Miettinen

Sanjar Karaev
Saskia Metzler

Time & Location

  • The lectures will take place on Mondays between 14:00 and 16:00 in room 029, building E1.5 (MPI-SWS) starting 24 April
  • The tutorial groups will take place on Wednesdays between 10:00 and 14:00 in room 024, building E1.4 (MPI-INF) starting 3 May

Course contents

A matrix of temperature readings and a heat map of the Earth

Graphs, relations, and sets of measurements over scalar variables form a significant part of the data types modern data analysis considers. All these different data types can be expressed as matrices, and matrix decomposition methods – originally developed for applications in linear algebra – are nowadays standard tools in any data analyst's toolbox. 

The term ‘matrix decomposition’ covers a multitude of different techniques that are applicable to all stages of the knowledge discovery process, from pre-processing to visualization and analysis of the results. The main applications, nonetheless, are in data mining, where various decomposition methods are used to find regularities and patterns from a wide variety of data types. 

While the matrix decomposition methods are predominantly based on linear algebra, the heterogeneous data they are applied to has required data analysts to develop matrix decomposition methods that are based on approaches not common in (or even applicable to) conventional linear algebra. In this course, we will cover matrix decomposition methods that stem from standard linear algebra, as well as methods that deviate from it. We will concentrate on established methods, but at the end, we will also cover more recent advancements.

The tentative list of contents for the course is:

  • Singular value decomposition (SVD) and principal component analysis (PCA)
  • Nonnegative matrix factorisation (NMF)
  • Column and column-row (CX and CUR) decompositions
  • Independent component analysis (ICA)

Learning objectives

Goals

The course aims at teaching, on one hand, the necessary theory behind the decomposition methods, and on the other hand, the practical applications of the decompositions to real-world data analysis tasks. After the course, the students should know how the covered decompositions work, when they work, and why they work (or not). The students should also be able to apply the methods to real-world data analysis problems, and analyse and present their findings.

To achieve its goals, the course consist of fortnightly homework assignments that develop the theoretical apparatus and understanding, and of three hands-on data analysis assignments, where the goal is to learn to apply the decompositions to real-world problems.

Homework assignments

The homework assignments are handed out every second week, and the corresponding tutorial is held a week after. Your presence is required for most homework tutorial sessions (approx. 4–6 times during the course). During the tutorial sessions, the students present their solutions to the homework assignments, see other students' solutions, and discuss the solutions with the tutor. 

Hands-on data analysis assignments

During the course, three hands-on data analysis assignments are handed out. The students will have approximately three weeks to complete the assignments. The assignments will involve implementing some data analysis methods, pre-processing the data to a suitable format, running the analysis, analysing the results, and writing a short report on the findings. The analysis will be done with the R statistical software. The tutorial sessions in every second week concentrate on the hands-on assignments, allowing the students to discuss any problems they might have with the assignment with the tutor, and to get feedback from their earlier assignments. At the begin of the course, there will be one tutorial session covering the basics of the R language.

Prerequisites

The students are expected to know basic linear algebra (e.g. from course Linear Algebra I or similar knowledge) and basic data analysis (e.g. from courses Information Retrieval and Data Mining, Machine Learning, Elements of Statistical Analysis, Topics in Algorithmic Data Analysis, or similar knowledge). Students are strongly recommended to refresh their linear algebra knowledge before the course starts.

No prior knowledge of the R language or software is required.