Computer scienceData scienceNLPMain NLP tasks

Text clustering

How many clusters?

Report a typo

In this task, you will implement DBSCAN text clustering on the Kaggle dataset. The dataset is available on the Kaggle page with a detailed description of each column.

Open the dataset:

import pandas as pd

file = '/content/pubmed-multilabel-text-classification/PubMed Multi Label Text Classification Dataset Processed.csv'

df = pd.read_csv(file)  ## valid for Colab

Note that /content/ is defined in path name only if you use Google Colaboratory. Correct the path name if you use a platform other than Colab. Amazon reviews are stored in df['abstractText'].

While splitting into train and test sets use the following conditions: test_size=0.15, random_state=42. Implement TF-IDF vectorization with the following parameters: max_df=0.9, min_df=1. Don't forget to add English stopwords. Implement DBSCAN text clustering with the following settings: eps=1.2.

How many clusters have you got? Type the number below.

Enter a number

___

Create a free account to access the full topic