Admission requirements
Assumed prior knowledge
A Bachelor in AI or Computer Science is recommended for this course, as well as experience with programming in Python.
Description
Text mining is one of the core application areas of Natural Language Processing (NLP). Key text mining tasks are text classification, information extraction, and sentiment analysis. NLP is a fast developing field that is grounded in fundamental models for text representation. It has attracted much attention from researchers in other fields and the general public, especially in recent times with the increasing power of large language models (LLMs). This course gives an overview of the field from both a theoretical angle (underlying models) and a practical angle (applications, challenges with data). In addition to the lectures, the students work on practical assignments.
Outline:
Fundamentals:
week 1. Introduction
week 2. Text processing
week 3. Vector Semantics
Models and methods:
week 4. Text categorization
week 5. Data collection and annotation
week 6. Transformer models and transfer learning
week 7. Generative large language models
Applications:
week 8. Information Extraction
week 9. Topic Modelling & Text summarization
week 10. Sentiment analysis & stance detection
week 11. Industrial Text Mining (guest lecture)
week 12. Conclusions
Course objectives
After successful completion of this course, students are able to:
explain the theoretical underpinnings and implementation of text representation models, both word based and embeddings based
explain, both at the conceptual and the technical level, natural language processing (NLP) methods for sequence processing, in particular transformer models
build models for a text mining task using machine learning algorithms and text data
evaluate and report on given data, the developed models and methods
reason, from a theoretical perspective, which models are applicable in which situations
discuss which real-world challenges prevent the application of certain techniques, such as domain specificity, language variation, noise in the data, and trustworthiness of models
explain and apply evaluation metrics and data quality metrics
Timetable
The most recent timetable can be found at the Computer Science (MSc) student website.
You will find the timetables for all courses and degree programmes of Leiden University in the tool MyTimetable (login). Any teaching activities that you have sucessfully registered for in MyStudyMap will automatically be displayed in MyTimeTable. Any timetables that you add manually, will be saved and automatically displayed the next time you sign in.
MyTimetable allows you to integrate your timetable with your calendar apps such as Outlook, Google Calendar, Apple Calendar and other calendar apps on your smartphone. Any timetable changes will be automatically synced with your calendar. If you wish, you can also receive an email notification of the change. You can turn notifications on in ‘Settings’ (after login).
For more information, watch the video or go the the 'help-page' in MyTimetable. Please note: Joint Degree students Leiden/Delft have to merge their two different timetables into one. This video explains how to do this.
Mode of instruction
Lectures, literature, practical tutorials, assignments.
Assessment method
a written individual exam, closed book – 50%
practical assignments (in groups) – 50%
- two assignments during the course – 10% each
- one more substantial assignment at the end of the course – 30%
The grade for the written exam should be 5.5 or higher in order to complete the course. The exam has a regular written re-sit opportunity. The weighted average grade for the practical assignments should be 5.5 or higher in order to complete the course. If one of the assignments is not submitted the grade for that assignment is 0. Each assignment has a re-sit opportunity (a later submission); the maximum grade for a re-sit assignment is 6.
Group work is an integral part of the course. You will be expected to complete the assignments together with a team mate.
The teacher will inform the students how the inspection of and follow-up discussion of the exams will take place.
Reading list
The literature will be distributed on Brightspace. The majority of the chapters come from this book (free, online): Dan Jurafsky and James H. Martin, Speech and Language Processing (3rd ed), February 2024 https://web.stanford.edu/~jurafsky/slp3/
Registration
From the academic year 2022-2023 on every student has to register for courses with the new enrollment tool MyStudyMap. There are two registration periods per year: registration for the fall semester opens in July and registration for the spring semester opens in December. Please see this page for more information.
Please note that it is compulsory to both preregister and confirm your participation for every exam and retake. Not being registered for a course means that you are not allowed to participate in the final exam of the course. Confirming your exam participation is possible until ten days before the exam.
Extensive FAQ's on MyStudymap can be found here.
Contact
Lecturer: Prof. dr. S. Verberne
Website: Course website
Remarks
Due to limited capacity, external students can only register after consultation with the programme coordinator/study adviser mastercs@liacs.leideuniv.nl.