Information Retrieval and Text Analytics, 2019-2020 - Studiegids

Admission Requirements

Recommended prior knowledge

Elementary knowledge of machine learning, probability theory (Bayes’ theorem, probability calculus), linear algebra (vector spaces), data structures (hash tables) is recommended.

Description

Search engines, the internet and cheap powerful hardware have drastically changed the way humans deal with information. Whereas thirty years ago librarians were still classifying books and articles using subject codes, nowadays search technology has become pervasive on desktop computers and mobile devices. This course covers both the theory and practice of the field of Information Retrieval and Text analytics, restricted to textual content (the courses 4343AUDIO and 4343MMIRL focus on audiovisual content). This course can be taken in combination with 4343TXTMN in the fall semester.
The course covers the following aspects:
1. How can we formalize search for information and how can we evaluate search systems?
2. Which document features (e.g. term statistics) could be used to associate a ‘meaning’ to a text?
3. How can we extend the notion of relevance by looking at context and learn from interaction?
4. How can these elements be combined to classify a text or to perform relevance ranking in order to build a search engine?
5. Which data structures and techniques are essential for computational efficiency?
6. Advanced topics such as personalization, recommender systems, learning to rank and responsible information retrieval?

Course Objectives

By the end of the course, the student should have a thorough understanding of:

the foundations of information retrieval models
the pros and cons of various query processing techniques
efficient data structures and complexity of search and indexing algorithms
technologies and relevance models for web search
evaluation methods for IR systems
text clustering and categorization applications
language models and topic models
reviewing a scientific information retrieval publication

In addition, the student should have some practical experience with text processing and/or information retrieval experiments.

Timetable

The most recent timetable can be found at the students' website.

Mode of instruction

Online lectures (2h / week), supported by discussion boards on Brightspace.
Homework (weekly): getting more acquainted with the new lecture material by small exercises.
Group assignments:
- Applying lecture concepts in a real-world practical data science challenge (presentation and report)
- Critical review of a recent IR research paper (presentation and report)

Course load

Total hours of study: 168 hrs.
Lectures: 26 hrs.
Practical work: 40 hrs.
Weekly assignments: 39 hrs.
Examination: 14 hrs.
Reading course materials: 33 hrs.
Critical review: 16 hrs.

Assessment method

The course grade will be computed as follows:

Homework (weekly exercises) – 10%
Practical assignment – 20%
Critical review – 10%
Final online written exam (closed book) – 60%

Reading list

Christopher D. Manning, Hinrich Schütze, and Prabhakar Raghavan: Introduction to information retrieval, 2008, Cambridge University Press. Online version available from the authors.
Additional reading assignments may be added as the course progresses, and will be made available through Brightspace.

Registration

You have to sign up for courses and exams (including retakes) in uSis. Check this link for information about how to register for courses.
Please also register for the course in Brightspace.

Contact information.

Lecturers: Prof.dr. ir. Wessel Kraaij
Skype for business/MS teams: kraaijw@vuw.leidenuniv.nl Dr. Cor Veenman
Skype for business/MS teams: veenmancj@vuw.leidenuniv.nl