Prospectus

nl en

Text Mining

Course
2020-2021

Admission requirements

Assumed prior knowledge

A Bachelor in Computer Science is recommended for this course, as well as experience with programming in Python.

Description

Text mining, also known as 'knowledge discovery from text', is a research and development field that has gained increasing focus in the last decade, attracting researchers from data science, natural language processing, and machine learning. Example key applications text categorization, information extraction, social media mining and automatic summarization. This course gives an overview of the field from both a theoretical angle (underlying models) and a practical angle (applications, challenges with data). In addition to the lectures, the students work on practical assignments.

Outline:
week 1. introduction
week 2. text processing
week 3. vector semantics
week 4. text categorization
week 5. data collection and annotation
week 6. neural NLP and transfer learning
week 7. information extraction
week 8. text summarization
week 9. opinion mining and sentiment analysis
week 10. biomedical text mining
week 11. authorship attribution
week 12. industrial text mining (guest lecture)
week 13. conclusions

Course objectives

After successful completion of this course, students have an understanding, both at the conceptual and the technical level, of the application of natural language processing (NLP) in the text mining area. Students can build models for a text mining task using machine learning algorithms and language data, and they can evaluate and report on the developed models and modules. Also, students understand, from a theoretical perspective, which tools are applicable in which situations, and which real-world challenges prevent the application of certain techniques (such as language variation and noise due to document processing errors).

Timetable

The most recent timetable can be found on the students' website.

Mode of instruction

Lectures.

Course load

Total hours of study: 168 hrs.

lectures: 26 hrs
literature reading: 26 hrs
studying for exam: 30 hrs
examination: 6 hrs
practical exercises: 40 hrs
assignments: 40 hrs

Assessment method

  • a written exam (50% of course grade)

  • practical assignments (50% of course grade)

    • three assignments (10% each) during the course
    • one more substantial assignment (20%) at the end of the course

The grade for the written exam should be 5.5 or higher in order to complete the course. The average grade for the practical assignments should be 5.5 or higher in order to complete the course. If one of the tasks is not submitted the grade for that task is 0.

The teacher will inform the students how the inspection of and follow-up discussion of the exams will take place.

Reading list

  • 6 chapters from: Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd ed. draft, October 2019). https://web.stanford.edu/~jurafsky/slp3/

  • 4 chapters from: ChengXiang Zhai and Sean Massung. Text Data Management and Analysis. A Practical Introduction to Information Retrieval and Text Mining (2016).

  • Additional literature.

Registration

  • You have to sign up for courses and exams (including retakes) in uSis. Check this link for information about how to register for courses.

  • Due to limited capacity, external students can only register after consultation with the programme coordinator/study adviser (mailto:mastercs@liacs.leideuniv.nl).

Contact

Lecturer: dr. S. Verberne
Website: Course website

Remarks