Elementary knowledge of data structures (sparse matrices, hash tables, graphs), statistics (binomial distribution) and combinatorics (permutations, combinations).
The traditional data mining techniques are mainly focused on solving classification, regression and clustering problems. However, the recent developments in ICT led to the emergence of new sorts of massive data sets and related data mining problems. Consequently, the field of data mining has rapidly expanded to cover new areas of research, such as:
• processing huge (tera- or petabytes big) data sets,
• fast searching for similar objects, such as: documents, images, songs, routes, etc., in collections of millions or billions of such objects,
• clustering of massive data sets,
• real-time analysis of data streams (internet traffic, sensor data, electronic transactions),
• recommending items to visitors of internet shops,
• analyzing big (network) graphs, such as web sites, social networks, collaboration networks, etc.
During the course we will focus on these areas. We will start with introducing a powerful framework for processing massive data sets on distributed computers: Hadoop and MapReduce. Then a new, very general similarity search technique, Locality Sensitive Hashing, will be discussed, together with its applications to plagiarism detection, searching databases with fingerprints, finding clients with similar buying behavior, etc. Next, several algorithms for real-time mining of data streams will be introduced: Bloom filters, random sampling, counting, estimating moments. Finally, some state-of-the art recommendation systems will be discussed in depth. The practical part of the course will consists of several programming assignments (in Python) and writing reports.
After completing the course, the students should:
• have a general knowledge of the recent developments in the field of Data Mining
• have detailed knowledge of selected techniques and their applications
• gain some hands-on experience with several algorithms for mining complex data sets
• be able to apply the acquired knowledge and skills to new problems
• gain some experience with mining big data sets on a cluster computer
The most recent timetable can be found at the students' website
Mode of instruction
- Computer Lab
- Practical assignments
- Self-evaluated homework
The final mark is composed of
(1) written exam (40%)
(2) practical assignment (60%)
A. Rajaraman, J. Leskovec, J. Ullman, Mining of Massive Datasets
You have to sign up for classes and examinations (including resits) in uSis. Check this link for more information and activity codes.
Lecturer: Dr. Wojtek Kowalczyk