Studiegids

nl en

Text Mining

Vak
2024-2025

Admission requirements

Assumed prior knowledge

A Bachelor in AI or Computer Science is recommended for this course, as well as experience with programming in Python.

Description

Text mining is one of the core application areas of Natural Language Processing (NLP). Key text mining tasks are text classification, information extraction, and sentiment analysis. NLP is a fast developing field that is grounded in fundamental models for text representation. It has attracted much attention from researchers in other fields and the general public, especially in recent times with the increasing power of large language models (LLMs). This course gives an overview of the field from both a theoretical angle (underlying models) and a practical angle (applications, challenges with data). In addition to the lectures, the students work on practical assignments.

Outline:

Fundamentals:
week 1. Introduction
week 2. Text processing
week 3. Vector Semantics

Models and methods:
week 4. Text categorization
week 5. Data collection and annotation
week 6. Transformer models and transfer learning
week 7. Generative large language models

Applications:
week 8. Information Extraction
week 9. Topic Modelling & Text summarization
week 10. Sentiment analysis & stance detection
week 11. Industrial Text Mining (guest lecture)

week 12. Conclusions

Course objectives

After successful completion of this course, students are able to:

  • explain the theoretical underpinnings and implementation of text representation models, both word based and embeddings based

  • explain, both at the conceptual and the technical level, natural language processing (NLP) methods for sequence processing, in particular transformer models

  • build models for a text mining task using machine learning algorithms and text data

  • evaluate and report on given data, the developed models and methods

  • reason, from a theoretical perspective, which models are applicable in which situations

  • discuss which real-world challenges prevent the application of certain techniques, such as domain specificity, language variation, noise in the data, and trustworthiness of models

  • explain and apply evaluation metrics and data quality metrics

Timetable

The most recent timetable can be found at the Computer Science (MSc) student website.

You will find the timetables for all courses and degree programmes of Leiden University in the tool MyTimetable (login). Any teaching activities that you have sucessfully registered for in MyStudyMap will automatically be displayed in MyTimeTable. Any timetables that you add manually, will be saved and automatically displayed the next time you sign in.

MyTimetable allows you to integrate your timetable with your calendar apps such as Outlook, Google Calendar, Apple Calendar and other calendar apps on your smartphone. Any timetable changes will be automatically synced with your calendar. If you wish, you can also receive an email notification of the change. You can turn notifications on in ‘Settings’ (after login).

For more information, watch the video or go the the 'help-page' in MyTimetable. Please note: Joint Degree students Leiden/Delft have to merge their two different timetables into one. This video explains how to do this.

Mode of instruction

Lectures, literature, practical tutorials, assignments.

Assessment method

  • a written individual exam, closed book – 50%

  • practical assignments (in groups) – 50%

    • two assignments during the course – 10% each
    • one more substantial assignment at the end of the course – 30%

The grade for the written exam should be 5.5 or higher in order to complete the course. The exam has a regular written re-sit opportunity. The weighted average grade for the practical assignments should be 5.5 or higher in order to complete the course. If one of the assignments is not submitted the grade for that assignment is 0. Each assignment has a re-sit opportunity (a later submission); the maximum grade for a re-sit assignment is 6.

Group work is an integral part of the course. You will be expected to complete the assignments together with a team mate.

The teacher will inform the students how the inspection of and follow-up discussion of the exams will take place.

Reading list

The literature will be distributed on Brightspace. The majority of the chapters come from this book (free, online): Dan Jurafsky and James H. Martin, Speech and Language Processing (3rd ed), February 2024 https://web.stanford.edu/~jurafsky/slp3/

Registration

From the academic year 2022-2023 on every student has to register for courses with the new enrollment tool MyStudyMap. There are two registration periods per year: registration for the fall semester opens in July and registration for the spring semester opens in December. Please see this page for more information.

Please note that it is compulsory to both preregister and confirm your participation for every exam and retake. Not being registered for a course means that you are not allowed to participate in the final exam of the course. Confirming your exam participation is possible until ten days before the exam.

Extensive FAQ's on MyStudymap can be found here.

Contact

Lecturer: Prof. dr. S. Verberne
Website: Course website

Remarks

Due to limited capacity, external students can only register after consultation with the programme coordinator/study adviser mastercs@liacs.leideuniv.nl.