# Statistical Learning

Vak
2023-2024

1. Familiarity with least squares linear regression
2. Ability to program in R (preferred) or in Python
3. Basic knowledge of university-level probability theory, calculus, and linear algebra

## Description

Statistical learning refers to a vast set of tools for understanding data. These techniques are used in a wide range of industries and research fields. They have, for example, been used for: product and movie recommendations, predicting disease status, identifying fraudulent bank transactions, and identifying genes associated with specific diseases. This course provides a basis for understanding statistical learning techniques and teaches the skills to apply and evaluate them.

The course will cover both supervised and unsupervised learning methods:
Supervised statistical learning involves building a model for predicting an outcome (response, dependent) variable based on one or more input (predictor) variables. The supervised learning methods discussed will include classical and state-of-the-art classification methods: regularized regression (Ridge, Lasso), naive Bayes, linear and quadratic discriminant analysis, decision trees, support vector machines, generalized additive models, random forests and gradient boosting. We explain the interrelations between these methods and analyze their behaviour. We will also discuss model selection, where we consider both classical and state-of-the-art methods, including cross-validation.

In unsupervised statistical learning, there are only input variables but no supervising outcome (dependent) variable; nevertheless, we can learn relationships and structures from such data. We will consider methods for clustering (i.e., the classic k-means and hierarchical clustering) and dimension reduction methods (like PCA).

## Course Objectives

After the course, the student can:

• Explain the key concepts and techniques of supervised and unsupervised learning methods.

• Reason about the relative strength and weaknesses of different statistical learning methods and their resulting suitability for real-world data problems.

• Select appropriate models and performance metrics for a given statistical learning task.

• Create an experiment to select the optimal model parameters.

• Apply the chosen model to the dataset and evaluate its performance using appropriate metrics.

• Evaluate the importance of features and their relationships to the outcome by interpreting the model.

## Timetable

You will find the timetables for all courses and degree programmes of Leiden University in the tool MyTimetable (login). Any teaching activities that you have sucessfully registered for in MyStudyMap will automatically be displayed in MyTimeTable. Any timetables that you add manually, will be saved and automatically displayed the next time you sign in.
MyTimetable allows you to integrate your timetable with your calendar apps such as Outlook, Google Calendar, Apple Calendar and other calendar apps on your smartphone. Any timetable changes will be automatically synced with your calendar. If you wish, you can also receive an email notification of the change. You can turn notifications on in ‘Settings’ (after login).
For more information, watch the video or go the the 'help-page' in MyTimetable. Please note: Joint Degree students Leiden/Delft have to merge their two different timetables into one. This video explains how to do this.

## Mode of Instruction

Lectures and computer practicals. We will use Brightspace to share all course material.

## Assessment method

The final grade is based on (each with a weight of 1/3):
1. a written structured assignment (individual, half way the course)
2. a written structured assignment (individual, at the end of the course)
3. oral presentation regarding the analysis of a data set of students’ own choice (in group, at the end of the course)
4. Students receive (during the lecture) feedback on the assignments and the oral presentation.

1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning: with applications in R. New York: Springer. A free copy and online tutorials are available online
2. Beaujean, A. A. (2014). Latent variable modeling using R. A step by step guide. New York: Routledge.

1. Berk, R. A. (2008). Statistical learning from a regression perspective. Springer. (a PDF is available via Leiden University Library)
2. Kuhn, M. & Johnson, K. (2013). Applied predictive modelling. Springer. (a PDF is available via Leiden University Library)
3. T. Hastie, R. Tibshirani, J. Friedman (2009). The Elements of Statistical Learning, (2nd edition) (available for free at https://web.stanford.edu/~hastie/Papers/ESLII.pdf)
4. Bishop, C. M. (2006). Pattern recognition and machine learning (1st edition). Springer.
5. Murphy, K. P. (2012). Machine learning: A probabilistic perspective. MIT Press.

## Registration

It is the responsibility of every student to register for courses with the new enrollment tool MyStudyMap. There are two registration periods per year: registration for the fall semester opens in July and registration for the spring semester opens in December. Please see this page for more information.

Please note that it is compulsory to both preregister and confirm your participation for every exam and retake. Not being registered for a course means that you are not allowed to participate in the final exam of the course. Confirming your exam participation is possible until ten days before the exam.
Extensive FAQ's on MyStudymap can be found here.

## Contact

j.d.karch@fsw.leidenuniv.nl