It is preferable, but not necessary that students who take this course have the following background knowledge:
• Data mining (Arno Knobbe)
• Statistiek (Matthijs van Leeuwen)
• Willing to learn to program in Python (We advise students from other programs than computer science to take an online Python course, e.g. https://www.codecademy.com/learn/python or https://www.coursera.org/specializations/python before entering this course)
The course language is English.
Data Science places data mining, machine learning and statistics in context, both experimentally and socially. If you want to correctly deploy data mining techniques, you must be able to translate a (broadly formulated) question by a customer or a co-worker into an experimental set-up, to make the right choices for the methods you use, and to be able to process the data in the right form to apply those methods. After performing your experiments, you should not only be able to evaluate the results but also interpret and translate it back to the original question (e.g. by visualization). Socially, data science is of great importance because the media simplify many data-driven results and statistical research, often making mistakes. Thus, a lot of nonsense comes down on us and it is up to you, the data scientists of the future, to recognize, explain and correct that nonsense. This course is a combination of lectures and practical sessions, in which you take a hands-on approach to solving real-world data science problems.
• You know and can explain the benefits and challenges of big data.
• You know and can explain the following experimental principles in your own words: bias, overfitting, cross validation, sparseness, dimensional reduction, class imbalance.
• You know and can explain the use and importance of measuring the quality and reliability of human-labeled data.
• You can give the definitions of the most important evaluation measures: Accuracy, Mean Squared Error, Precision, Recall, F1 and Mean Average Precision.
• You can recognize statistical nonsense in the media, explain and correct it.
• After completing the course, you can independently take the steps to set up and execute an experiment within data science, given a (broadly formulated) question:
• Task definition: You can create a clear definition of a task based on a general description of a task, consisting of (a) the research question, (b) whether the task is supervised or unsupervised, (c) whether it is a classification, regression or ranking task (or something else), (d) what the data are and (e) what the labels are;
• Data collection: If answering the question requests data is not given, then you can define what data you need and how to collect it. If you need explicit labels, you can set up a data annotation task for human raters;
• Data exploration: You can collect and visualize statistics about the data. You can calculate and interpret the inter-annotator agreement for annotated data.
• Pre-processing (and feature extraction): You can write a Python script to read and process the data, which converts raw data into a format that can be processed by supervised and / or unsupervised machine learning models.
• Model learning: You can apply unsupervised and supervised models to your data. You can choose the right model for the correct type of data. You can create a feature analysis. You can generate output for unseen data.
• Evaluation: You can correctly set up your model evaluation with a train / test split and cross validation if necessary. You can evaluate your output against human data. You know which evaluation measures you should use given the type of data and model. You can do a sensible error analysis
Mode of instruction
•• 14 lectures, 2x45 minutes
• 1st 45 minutes: lecture
• 2nd 45 minutes: practical session (working on a data science problem in small groups)
Assessment method, including grading
The assessment of the course consists of a written exam (60% of course grade) and practical assignments in small groups (40% of course grade). The practical assignments comprise four small tasks (5% each) and one more substantial report (20%). The grade for the written exam should be 5.5 or higher in order to complete the course. The average grade for the practical assignments should be 5.5 or higher in order to complete the course. If one of the tasks is not submitted the grade for that task is 0.
To be announced during the course
Signing up for classes and exams
There is limited space for students who are not enrolled in the BSc programme of Computer Science or the Minor Data Science. Please contact the study coordinator/study adviser.
Please also register for this course in Blackboard.