Computer scienceData scienceNLPMain NLP tasks

Text clustering

Wine reviews

Report a typo

In this task, you will implement text clustering on the Kaggle dataset. The dataset is available on the Kaggle page with a detailed description of each column.

The dataset contains two CSV files. Open this one:

import pandas as pd


df = pd.read_csv('/content/wine-reviews/winemag-data_first150k.csv')

Note that /content/ is defined in the path name only if you use Google Colaboratory. Correct the path name if you use a platform other than Colab. Wine reviews are stored in df['description'].

Implement TF-IDF vectorization with the following parameters: max_df=0.95, min_df=5. Don't forget to add English stopwords. While splitting into train and test sets use the following conditions: test_size=0.33, random_state=42. Implement KMeans text clustering with the following settings: n_clusters=7, max_iter=100, n_init=2, random_state=42.

In the file below, there is a review of this dataset. Your task is to find that review and type down its cluster (its row number). If the cluster's number is 4, then the correct answer should be 4.

Write code in your IDE to process the text file and display the results below

Time limit: 5 minutes

___

Create a free account to access the full topic