You are already familiar with transformers and their tasks in theory. Let's get down to practice and try using transformers! In this topic, you will learn how to import a transformer using the Hugging Face library, and how to train it on your own data.
Load a model
A huge number of models have been collected on Hugging Face. After you have registered there and linked your account with your IDE, you need to install the necessary libraries to use transformers.
!pip install transformers
!pip install datasets
When you choose a suitable model, you need to go to the model card and click on "Use in Transformers". You will see an instruction on how to download this model in Python. For example, for the BART transformer, this code looks like this:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
Basically, there are many classes for different tasks:
AutoModelForSequenceClassification
AutoModelForSeq2SeqLM
and so on.
But AutoModel is a general class, available for most of the tasks in NLP. Since our task is Text Summarization, we will use AutoModelForSeq2SeqLM. We generally use a tokenizer from the same model. Now, you have installed Facebook's BART model.
Simple usage of transformers
The simplest way to use Python transformers is by calling a pipeline. Call a summarizer with the following code:
from transformers import pipeline
summarizer = pipeline("summarization")
If you want to implement sentiment analysis, for instance, then just change summarization to sentiment analysis. Now you can summarize a text you want:
text = ['On March 1, 1932, Puyi was installed by the Japanese as the ruler of Manchukuo, considered by most historians as a puppet state of Imperial Japan, under the reign title Datong.']
summarizer(text)
## [{'summary_text': ' On March 1, 1932, Puyi was installed by the Japanese as the ruler of Manchukuo, considered by most historians as a puppet state of Imperial Japan, under the reign title Datong . The Japanese installed him as a ruler of the puppet state, under his reign of Datong, in 1932 .'}]
In the output, you have two sentences: the first one is the original, and the second, after the dot, is a summarized one. If you aim to load a specific model, then define that model in the pipeline:
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
This is what we call an already pre-trained model we want.
Fine-tuning transformers
To begin, you should log into HF, load your dataset, and load the model you want. Now we will use the original T5 model:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
And the same tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-base")
Now you need to tokenize the dataset you have.
tokenized_datasets = dataset.map(preprocess_function, batched=True)
The preprocess_function is a custom function to tokenize the dataset. This function may change a lot depending on what conditions you have.
Now you must define the arguments. Again, they may vary:
batch_size = 16
args = Seq2SeqTrainingArguments("fine_tuned_t5_model", ## the name for our future model
evaluation_strategy = "epoch",
learning_rate=2e-3,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=10,
push_to_hub=True,
predict_with_generate=True,
fp16=True
)
Now, collocate your data with the model:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
And now, finally, define the training class. In our case, it's the compute_metrics function. This function depends on what metrics you want to use. For tasks like text summarization, ROUGE and SAMSA are the best.
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['validation'],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics)
Now, your model is ready for training:
trainer.train()
This process can take a long time: an hour or a few. Use GPU to accelerate.
It's vital to save your model on HF Hub. Otherwise, you will lose all the results!
trainer.push_to_hub()Conclusion
In this topic, you have learned:
- how to download the transformer model from Hugging Face,
- how to use this model, and
- how to train it on your own data.