Project
Corpus Annotation from Scratch
31 completions
~ 29 hours
4.3Build an annotated corpus from scratch, extracting morphological features and certain types of named entities with SpaCy. Conduct a statistical data study using the pandas library.
Provided by
JetBrains Academy
About
Any serious NLP experiment requires data processing. In most cases, you use ready-made data, but sometimes you need to compile a corpus yourself. Depending on the task, you may also need certain information about your text: part-of-speech tags, named entities, statistical characteristics, and so on. In this project, you will learn how to convert raw text into a corpus for further research.
Training project
This project allows you to practice and strengthen your coding skills, helping you get ready for more advanced tasks ahead.
What you'll learn
Once you choose a project, we'll provide you with a study plan that includes all the necessary topics from your course to get it built. Here’s what awaits you:
Load the data and tokenize the text.
Remove unnecessary titles, links, and punctuation marks and implement a morphological annotation of the text using SpaCy.
Find certain types of named entities in the text and save a corpus to the DataFrame.
Find necessary statistical information: number of a certain POS tag or named entity in the corpus, correlation between a particular POS tag and a named entity.
Reviews
SN
Serhii Nykyforov4 months ago
Amazing Project 🫶🏻First three steps are fast and easy steps, fourth might be a little bit more challenging
This was a vibrant and rewarding project where I applied my NLP knowledge using the spaCy and NLTK libraries to extract tokens, lemmas, stems, part-of-speech tags, and named entities from large text files. I then used pandas to analyze and extract valuable insights from the processed NLP data.
4.3
Learners who completed this project within the course rated it as follows: