In this task, you will implement DBSCAN text clustering on the Kaggle dataset. The dataset is available on the Kaggle page with a detailed description of each column.
Open the dataset:
import pandas as pd
file = '/content/pubmed-multilabel-text-classification/PubMed Multi Label Text Classification Dataset Processed.csv'
df = pd.read_csv(file) ## valid for Colab
Note that /content/ is defined in path name only if you use Google Colaboratory. Correct the path name if you use a platform other than Colab. Amazon reviews are stored in df['abstractText'].
While splitting into train and test sets use the following conditions: test_size=0.15, random_state=42. Implement TF-IDF vectorization with the following parameters: max_df=0.9, min_df=1. Don't forget to add English stopwords. Implement DBSCAN text clustering with the following settings: eps=1.2.
How many clusters have you got? Type the number below.