Tokenizing and POS-tagging a Japanese text

Report a typo

To go further, manually open an IDE on your computer and copy the output to the website.

Let's work with Japanese. Download it in Stanza with the following code:

import stanza
stanza.download('ja')
nlp = stanza.Pipeline(lang="ja", processors="tokenize, pos, lemma, depparse, ner")

You have this text in Japanese:

海が近くにある大学への進学を機に、叔父が経営するダイビングショップ「グランブルー」に居候することになった北原伊織。

It is just a sentence, so you don't have to do sentence tokenization. Carry out word tokenization and find a POS tag for each token. The output should be like this:

[('token', 'POS'), ('token', 'POS')]

Use Stanza 1.4.0. You can install it with this line:

!pip install stanza==1.4.0
Enter a short text
___

Create a free account to access the full topic