Project

Corpus Annotation from Scratch

31 completions
~ 29 hours
4.3

Build an annotated corpus from scratch, extracting morphological features and certain types of named entities with SpaCy. Conduct a statistical data study using the pandas library.

Provided by

JetBrains Academy JetBrains Academy

About

Any serious NLP experiment requires data processing. In most cases, you use ready-made data, but sometimes you need to compile a corpus yourself. Depending on the task, you may also need certain information about your text: part-of-speech tags, named entities, statistical characteristics, and so on. In this project, you will learn how to convert raw text into a corpus for further research.

Training project icon

Training project

This project allows you to practice and strengthen your coding skills, helping you get ready for more advanced tasks ahead.

What you'll learn

Once you choose a project, we'll provide you with a study plan that includes all the necessary topics from your course to get it built. Here’s what awaits you:
Remove unnecessary titles, links, and punctuation marks and implement a morphological annotation of the text using SpaCy.
Find certain types of named entities in the text and save a corpus to the DataFrame.
Find necessary statistical information: number of a certain POS tag or named entity in the corpus, correlation between a particular POS tag and a named entity.

Reviews

Serhii Nykyforov
4 months ago
Amazing Project 🫶🏻First three steps are fast and easy steps, fourth might be a little bit more challenging
Shashank Gupta avatar
Shashank Gupta
4 months ago
This was a vibrant and rewarding project where I applied my NLP knowledge using the spaCy and NLTK libraries to extract tokens, lemmas, stems, part-of-speech tags, and named entities from large text files. I then used pandas to analyze and extract valuable insights from the processed NLP data.
Brian Smith avatar
Brian Smith
5 months ago
This was in interesting project. I did find the first 3 stages much easier than the last. Less of a curve and more of a cliff of sorts.

4.3

Learners who completed this project within the course rated it as follows:
Usefulness
4.6
Fun
4.3
Clarity
4.1