Studiegids

nl en

Text Mining

Vak
2024-2025

Admission requirements

Assumed prior knowledge

A Bachelor in AI or Computer Science is recommended for this course, as well as experience with programming in Python.

Description

Text mining is one of the core application areas of Natural Language Processing (NLP). Key text mining tasks are text classification, information extraction, and sentiment analysis. NLP is a fast developing field that is grounded in fundamental models for text representation. It has attracted much attention from researchers in other fields and the general public, especially in recent times with the increasing power of large language models (LLMs). This course gives an overview of the field from both a theoretical angle (underlying models) and a practical angle (applications, challenges with data). In addition to the lectures, the students work on practical assignments.

Outline:

Fundamentals:
week 1. Introduction
week 2. Text processing
week 3. Vector Semantics

Models and methods:
week 4. Text categorization
week 5. Data collection and annotation
week 6. Transformer models and transfer learning
week 7. Generative large language models

Applications:
week 8. Information Extraction
week 9. Topic Modelling & Text summarization
week 10. Sentiment analysis & stance detection
week 11. Industrial Text Mining (guest lecture)

week 12. Conclusions

Course objectives

After successful completion of this course, students are able to:

  • explain the theoretical underpinnings and implementation of text representation models, both word based and embeddings based

  • explain, both at the conceptual and the technical level, natural language processing (NLP) methods for sequence processing, in particular transformer models

  • build models for a text mining task using machine learning algorithms and text data

  • evaluate and report on given data, the developed models and methods

  • reason, from a theoretical perspective, which models are applicable in which situations

  • discuss which real-world challenges prevent the application of certain techniques, such as domain specificity, language variation, noise in the data, and trustworthiness of models

  • explain and apply evaluation metrics and data quality metrics

Timetable

The most recent timetable can be found at the Computer Science (MSc) student website.

In MyTimetable, you can find all course and programme schedules, allowing you to create your personal timetable. Activities for which you have enrolled via MyStudyMap will automatically appear in your timetable.

Additionally, you can easily link MyTimetable to a calendar app on your phone, and schedule changes will be automatically updated in your calendar. You can also choose to receive email notifications about schedule changes. You can enable notifications in Settings after logging in.

Questions? Watch the video, read the instructions, or contact the ISSC helpdesk.

Note: Joint Degree students from Leiden/Delft need to combine information from both the Leiden and Delft MyTimetables to see a complete schedule. This video explains how to do it.

Mode of instruction

Lectures, literature, practical tutorials, assignments.

Assessment method

  • a written individual exam, closed book – 50%

  • practical assignments (in groups) – 50%

    • two assignments during the course – 10% each
    • one more substantial assignment at the end of the course – 30%

The grade for the written exam should be 5.5 or higher in order to complete the course. The exam has a regular written re-sit opportunity. The weighted average grade for the practical assignments should be 5.5 or higher in order to complete the course. If one of the assignments is not submitted the grade for that assignment is 0. Each assignment has a re-sit opportunity (a later submission); the maximum grade for a re-sit assignment is 6.

Group work is an integral part of the course. You will be expected to complete the assignments together with a team mate.

The teacher will inform the students how the inspection of and follow-up discussion of the exams will take place.

Reading list

The literature will be distributed on Brightspace. The majority of the chapters come from this book (free, online): Dan Jurafsky and James H. Martin, Speech and Language Processing (3rd ed), February 2024 https://web.stanford.edu/~jurafsky/slp3/

Registration

As a student, you are responsible for enrolling on time through MyStudyMap.

In this short video, you can see step-by-step how to enrol for courses in MyStudyMap.
Extensive information about the operation of MyStudyMap can be found here.

There are two enrolment periods per year:

  • Enrolment for the fall opens in July

  • Enrolment for the spring opens in December

See this page for more information about deadlines and enrolling for courses and exams.

Note:

  • It is mandatory to enrol for all activities of a course that you are going to follow.

  • Your enrolment is only complete when you submit your course planning in the ‘Ready for enrolment’ tab by clicking ‘Send’.

  • Not being enrolled for an exam/resit means that you are not allowed to participate in the exam/resit.

Contact

Lecturer: Prof. dr. S. Verberne
Website: Course website

Remarks

Due to limited capacity, external students can only register after consultation with the programme coordinator/study adviser mastercs@liacs.leideuniv.nl.

Software
Starting from the 2024/2025 academic year, the Faculty of Science will use the software distribution platform Academic Software. Through this platform, you can access the software needed for specific courses in your studies. For some software, your laptop must meet certain system requirements, which will be specified with the software. It is important to install the software before the start of the course. More information about the laptop requirements can be found on the student website.