Any course on Data Mining or Machine Learning (at Bachelor’s level); Programming in C/C++ or Python (in Linux environment).
In recent years we witness a rapidly growing gap between the amount of collected data and data processing capabilities of conventional computers. This is not surprising: according to the Moore’s Law, the processing power of an “average computer” doubles every 18 months, while, according to Lyman and Varian from Berkeley, the amount of stored data doubles every 12 months. In addition to this growing gap, there is an increasing need to analyze the data more quickly, more precisely, and more “intelligently”. In addition to the traditional data mining tasks: classification, regression and clustering, some new challenges emerged, which require completely new algorithms:
- analysis of big networks: web pages, social networks (Facebook, Twitter), traffic, financial networks
- recommender systems: Amazon, Netflix
- digital forensics: analysis of data related to cybercrime
- search advertising on the web and dynamic auctions
- scientific data mining (bioinformatics, astronomy, physics)
- analysis of sensor data
In order to cope with this overwhelming data flow, several hardware and software platforms for distributed data mining, together with specialized data mining algorithms, have been invented. This year we will focus on 3 topics: Recommender Systems, Plagiarism Detection, evaluation of several existing systems for Distributed Data Mining (Hadoop and MapReduce, Mahout, GraphLab).
During the seminar, students should:
- gain some insight into the most recent developments around mining “big data”,
- gain some hands-on experience with distributed data mining by experimenting with our cluster computer,
- gain detailed knowledge of some state-of-the-art techniques used in distributed data mining,
- indentify some promising research directions.
The seminar will consist of weekly meetings during which selected papers will be presented and discussed. Additionally, students (organized in small teams) will experiment with Hadoop, MapReduce, Mahout and GraphLab systems, applying them to some benchmark problems. Results of experiments will have to be documented in reports.
The most recent timetable can be found at the LIACS website
Mode of instruction
- Weekly presentations and discussions
- An experimental research project
The grade will be based on 3 components:
- presentations (30%)
- software developed during the seminar (30%)
- final report (40%)
- A. Rajaraman, J. Leskovec, and J. Ullman, Mining of Massive Datasets
- Additional articles will be distributed during the first meeting.
You have to sign up for classes and examinations (including resits) in uSis. Check this link for more information and activity codes.
There is a limited capacity for students from outside the master Computer Science programme. Please contact the study-advisor.
Study coordinator Computer Science, Riet Derogee