Data Mining and Matrices

Advanced lecture, 6 ECTS credits, summer semester 2017

Organization

Staff

Dr. Pauli Miettinen

Sanjar Karaev
Saskia Metzler

Time & Location

  • The lectures will take place on Mondays between 14:00 and 16:00 in room 029, building E1.5 (MPI-SWS) starting 24 April
  • The tutorial groups will take place on Wednesdays between 10:00 and 12:00 in room 024, building E1.4 (MPI-INF) starting 3 May
  • The final exam will take place on 24 July between 14:00 and 16:00 in lecture hall 001, building E1.3. More information below.

News

  • Grades are in HISPOS
  • Re-exam will take place in hall 002 in E1 3 on Sept. 27th from 10.00 - 12.00
  • Re-exam inspection will take place in the Rotunda of D5 (E1.4, 4th floor) on Friday, 29 Sept, between 14:00 and 14:30

Lecture slides

  1. Introduction [PDF]
  2. R-Tutorial [PDF, R-script]
  3. SVD & PCA [PDF]
  4. Preprocessing & computing the SVD [PDF]
  5. Optimization [PDF]
  6. Introduction to NMF [PDF]
  7. Variations and applications of NMF [PDF]
  8. CX and CUR decompositions [PDF]
  9. Introduction to ICA [PDF]
  10. Algorithms for ICA [PDF]
  11. Spectral clustering [PDF]
  12. Finding planted patterns [PDF]
  13. Wrap-up [PDF]

Problem sheets

The sample solutions to the problems can be found here. You will need a username and password if you are connecting outside the university network.

  1. Prerequisites [PDF], tutorial on 10 May 
  2. SVD [PDF], tutorial on 24 May (NB: from 16:15 in room 021, building E1.4)
  3. Optimization [PDF], tutorial on 7 June
  4. NMF [PDF], tutorial on 21 June
  5. CX, CUR, and ICA [PDF], tutorial on 5 July
  6. ICA and spectral methods [PDF], tutorial on 19 July

Hands-On Analysis Assignments

Access to the data is restricted to inside the university. Please email us if you need the password for accessing it from outside (we will also mention it in the next tutorial/lecture).

  1. SVD and preprocessing [PDF|ZIP], due 28 May
  2. NMF [PDF|ZIP], due 18 June
  3. CX and ICA [PDF|ZIP], due 9 July

The slides from the tutorials with general feedback are in the password-protected area.

Exam

The final exam will be held on 24 July 2017 between 14:00 and 16:00 in lecture hall 001, building E1.3. Please note that the times are sharp!

You must bring writing equipments and your Student ID card with you. In addition, you can bring one (1) A4-sized sheet of text ("cheat sheet") that must have your name written on it. You are not allowed to use any electronic devices, or any other notes or material than the aforementioned cheat sheet. You can find more information regarding the exam and the cheat sheet in these slides.

The exam inspection will take place on 26 July 2017 between 10:15 and 11:45 in room 024, building E1.4. This is your only change for exam inspection, so please be present.

If you want to take the re-exam, you must inform the lecturer before 1 August. The time and place of the re-exam will be announced later. Note that bonus points do not apply for the re-exam.

Course contents

A matrix of temperature readings and a heat map of the Earth

Graphs, relations, and sets of measurements over scalar variables form a significant part of the data types modern data analysis considers. All these different data types can be expressed as matrices, and matrix decomposition methods – originally developed for applications in linear algebra – are nowadays standard tools in any data analyst's toolbox. 

The term ‘matrix decomposition’ covers a multitude of different techniques that are applicable to all stages of the knowledge discovery process, from pre-processing to visualization and analysis of the results. The main applications, nonetheless, are in data mining, where various decomposition methods are used to find regularities and patterns from a wide variety of data types. 

While the matrix decomposition methods are predominantly based on linear algebra, the heterogeneous data they are applied to has required data analysts to develop matrix decomposition methods that are based on approaches not common in (or even applicable to) conventional linear algebra. In this course, we will cover matrix decomposition methods that stem from standard linear algebra, as well as methods that deviate from it. We will concentrate on established methods, but at the end, we will also cover more recent advancements.

The tentative list of contents for the course is:

  • Singular value decomposition (SVD) and principal component analysis (PCA)
  • Nonnegative matrix factorisation (NMF)
  • Column and column-row (CX and CUR) decompositions
  • Independent component analysis (ICA)

Learning objectives

Goals

The course aims at teaching, on one hand, the necessary theory behind the decomposition methods, and on the other hand, the practical applications of the decompositions to real-world data analysis tasks. After the course, the students should know how the covered decompositions work, when they work, and why they work (or not). The students should also be able to apply the methods to real-world data analysis problems, and analyse and present their findings.

To achieve its goals, the course consist of fortnightly homework assignments that develop the theoretical apparatus and understanding, and of three hands-on data analysis assignments, where the goal is to learn to apply the decompositions to real-world problems.

Homework assignments

The homework assignments are handed out every second week, and the corresponding tutorial is held a week after. Your presence is required for most homework tutorial sessions (approx. 4–6 times during the course). During the tutorial sessions, the students present their solutions to the homework assignments, see other students' solutions, and discuss the solutions with the tutor. 

Hands-on data analysis assignments

During the course, three hands-on data analysis assignments are handed out. The students will have approximately three weeks to complete the assignments. The assignments will involve implementing some data analysis methods, pre-processing the data to a suitable format, running the analysis, analysing the results, and writing a short report on the findings. The analysis will be done with the R statistical software. The tutorial sessions in every second week concentrate on the hands-on assignments, allowing the students to discuss any problems they might have with the assignment with the tutor, and to get feedback from their earlier assignments. At the begin of the course, there will be one tutorial session covering the basics of the R language.

Prerequisites

The students are expected to know basic linear algebra (e.g. from course Linear Algebra I or similar knowledge) and basic data analysis (e.g. from courses Information Retrieval and Data Mining, Machine Learning, Elements of Statistical Analysis, Topics in Algorithmic Data Analysis, or similar knowledge). Students are strongly recommended to refresh their linear algebra knowledge before the course starts.

No prior knowledge of the R language or software is required.