Binary Factorizations in Data Mining

Block seminar, 7 ECTS credits, winter semester 2016–17

Basic information

Type: block seminar
Lecturer:Dr. Pauli Miettinen
Credits: 7 ECTS credits
Registration: The registration opens on Monday, 17 October, at 9 am CET. Information on how to register is posted below when the registration starts.
Kick-off meeting: The kick-off meeting will take place on Thursday, 27 October, 14:15–16:00 in room 024 at the MPI-INF building (E 1.4). Attending the kick-off meeting is mandatory.
Seminar days: The seminar will take place on 25 and 26 January.
Location: On the 25th, we will be in room 016 (building E1.3) and on the 26th, we will be in room 630 (building E.5)
Prerequisites: The seminar has not strict prerequisites, but the attendants are expected to have knowledge on data analysis or machine learning, for example, from courses Information Retrieval and Data Mining, Machine Learning, Elements of Statistical Learning, or equivalent. Basic knowledge of linear algebra and computational complexity will be helpful.

News

The seminar rooms are 016 (E1.3) and 630 (E1.5) for the 25th and 26th of January, respectively.
The seminar days are 25 and 26 January; the schedule has been updated to include all of the DLs.
The kick-off slides are available.

Registration

To register to the seminar, send email to the lecturer containing at least three (3) papers you would like to present, in the order of preference. The email should also contain student's full name and matriculation number. The places to the seminar are handed based on the availability of the listed papers in first-come-first-serve basis, but in case of conflicts, a priority is given to the students with stronger background information. All students who have been accepted to the course must attend the kick-off meeting or their spot will be given away.

Content

Example decomposition of a binary matrix into two binary component matrices

Binary matrices can be used to represent many types of data: graphs, binary relations, market basket data, set systems, and presence/absence data, to name a few examples. The common methods employed to analyze these matrices can be divided into two broad categories: either the matrix interpretation of the data is emphasized, and methods deriving from standard linear algebra (e.g. spectral methods, SVD, or NMF) are employed, or the combinatorial structure (e.g. graph) of the data is emphasized, and corresponding combinatorial methods are used.

Binary factorizations merge these two approaches: the binary matrix is decomposed into (almost) binary factor matrices. On one hand, the factorization can be seen as a variant of the standard matrix factorization, with similar interpretation of the factor matrices as some kinds of bases of smaller-dimensional subspace, while on the other hand, the constraint to (almost) binary factor matrices changes the methods to more combinatorial-like, and preserves the interpretation of the original combinatorial structure.

This dual nature has made binary factorizations increasingly popular data mining methods in the recent years. In this seminar we will study the recent literature on binary factorizations in data analysis: we will cover important methods and successful applications in data mining, machine learning, and bioinformatics.

Format and prerequisites

This is a block seminar. The seminar will take place over two full days in Mid-January 2017, with mandatory participation in both days. In addition, there is a mandatory kick-off meeting at the begin of the semester. To pass the seminar, the participants must give a presentation and hand in a short report on their topic. The reports and preliminary versions of the slides have to be handed in to the lecturer during the semester, and the reports will be distributed to the other attendants. The grading is based on the reports, the presentation, the student's knowledge on the subject (as evidenced in the discussion after the presentation), and his/her activity in the discussions.

The students taking this seminar are expected to know data analysis (data mining or machine learning), for example, by having taken courses such as Information Retrieval and Data Mining, Machine Learning, or Elements of Statistical Learning. Some knowledge of linear algebra and computational complexity is also recommended.

Schedule

Date	Time	Topic	Location
27 October	14:15–16:00	Kick-off meeting	Room 024, building E1.4 (MPI-INF)
27 November	23:59 CET	Written report first draft DL
9 December	15:59 CET	Slides first draft DL
17 January	23:59 CET	Written report hand-in DL
25 January		Seminar	Room 016, building E1.3
26 January		Seminar	Room 630, building E1.5

Papers

Below is the list of papers. Papers 1–6 present different methods for doing Boolean matrix factorization (or a part thereof), papers 7–12 present different variations, and papers 13 and 14 present applications of binary factorizations to bioinformatics and machine learning, respectively.

When registering, each student must list at least three papers they would want to present, in the order of preference.

The PDFs of the papers are behind a username and password. Please contact the lecturer to obtain these.

A. Ene et al., 2008. Fast Exact and Heuristic Methods for Role Minimization Problems. In ACM SACMAT '08, pp. 1–10. [PDF]
P. Miettinen et al., 2008. The Discrete Basis Problem. IEEE Trans. Knowl. Data En. 20(10), pp. 1348–1362. [PDF]
B.-H. Shen, S. Ji & J. Ye, 2009. Mining Discrete Patterns via Binary Matrix Factorization. In ACM KDD '09, pp. 757–765. [PDF]
C. Lucchese, S. Orlando & R. Perego, 2013. A Unifying Framework for Mining Approximate Top-k Binary Patterns. IEEE Trans. Knowl. Data En. 26(12), pp. 2900–2913. [PDF]
S. Karaev, P. Miettinen & J. Vreeken, 2015. Getting to Know the Unknown Unknowns: Destructive-Noise Resistant Boolean Matrix Factorization. In SIAM SDM '15, pp. 325–333. [PDF]
R. Belohlavek & M. Trnecka, 2015. From-below approximations in Boolean matrix factorization: Geometry and new algorithms. J. Comput. Syst. Sci. 81(8), pp. 1678–1697. [PDF]
A. I. Schein, L. K. Saul & L. H. Ungar, 2003. A Generalized Linear Model for Principal Component Analysis of Binary Data. In AISTATS '03. [PDF]
Z.-Y. Zhang et al., 2010. Binary Matrix factorization for analyzing gene expression data. Data Min. Knowl. Disc. 20, pp. 28–52. [PDF]
M. Slawski, M. Hein & P. Lutsik, 2013. Matrix factorization with Binary Components. In NIPS '13. [PDF]
R. Belohlavek & M. Krmelova, 2013. Beyond Boolean Matrix Decompositions: Toward Factor Analysis and Dimensionality Reduction of Ordinal Data. In ICDM '13, pp. 961–966. [PDF]
S. Maurus & C. Plant, 2014. Ternary Matrix Factorization. In ICDM '14, pp. 400–409. [PDF]
S. Ravanbakhsh, B. Póczos & R. Greiner, 2016. Boolean Matrix Factorization and Noisy Completion via Message Passing. In ICML '16. [PDF]
Y. Xiang et al., 2012. Merging network patterns: a general framework to summarize biomedical network data. Netw. Model. Anal. Health Inform. Bioinforma. 1(3), pp. 103–116. [PDF]
G. Van den Broeck & A. Darwiche, 2013. On the Complexity and Approximation of Binary Evidence in Lifted Inference. In NIPS '13. [PDF]