Computer scienceData scienceNLPLanguage representationTransformers

T5 transformers

1 minute read

As artificial intelligence and machine learning rapidly advance, understanding the underlying principles and techniques becomes crucial. This topic delves into the rich realm of transformer models, focusing predominantly on the T5 transformer.

This topic simplifies transformers and provides an accessible gateway into machine learning and AI. We will introduce the Encoder-Decoder Theory, guide you through T5's functions and applications, and explain how to work with it in the Transformers library. Lastly, we will cover fine-tuning T5 for optimal performance in various tasks.

Encoder-decoder theory

There are three types of transformers: encoder, decoder, and encoder-decoder. Encoder and Decoder are essential modules of Transformers and the attention module. The encoder does what its name says: it encodes the text given; the same thing with the decoder — it decodes the text. Encoder Transformers are those that contain only an encoder without the decoder; Decoder Transformers are those that have only a decoder without an encoder. Encoder Transformers are suitable for text classification (sentiment classification) and token classification (NER) since they encode the text and then classify it based on the embeddings. The most eloquent example of Encoder Transformers is BERT. BERT vectorizes a text, likewise CBOW, one-hot encoding, etc. Then, based on BERT embeddings, it calculates later classification. Decoder Transformers are suitable for text generation tasks (GPTs).

Encoder-Decoder Transformers contain both an encoder module and a decoder one. This helps a Transformer model take a text in the input and output the resulting text. BART, a classic Encoder-Decoder Transformer, has a BERT as an encoder and GPT as a decoder. T5 is a member of the encoder-decoder family.

The image shows the structure of Encoder-Decoder Transformer. We have an input text, which is forwarded into the encoder modules. From encoders the weights go to decoders, and we get a translation.

When do we use T5?

This model type is ideal for text2text-generation tasks, though other types of transformers apply to them. Text2text-generation tasks include:

  • Summarization
  • Simplification
  • Paraphrasing
  • Machine Translation
  • Style transfer

For machine translation and paraphrasing, T5 and BART are commonly used. PEGASUS and PRIMERA are more specialized for summarization.

Work with T5 in the Transformers library

Let's try the T5 model for machine translation. For this, import the pipeline:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ko-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-ko-en")

translator = pipeline(‘translation’, model=model, tokenizer=tokenizer)

We have the following text in Korean on Japanese dorama:

text = ''' 
일본의 텔레비전 드라마(약칭 일드)는 일본의 방송국에서 방송되는 드라마로 매일 방송된다. 일본 드라마의 장르는 로맨스, 코미디, 형사물, 호러물을 비롯한 수많은 장르별로 각양각색이다. 대개 장르별로 주제를 갖고 있는 드라마가 많고, 1회나 1-2회로 끝나는 단편 드라마, 혹은 드라마의 종영 이후에, 시청자들의 계속적인 요청에 따라 만들어지는 특별판도 있다.
'''

We can translate it as follows:

translator(text)

# Output:
#  [{'translation_text': 'Japanese television dramas are broadcast daily on Japanese television 
#  stations, and the genres of Japanese dramas are featured in romances, comedys, medias, whistles, 
#  and numerous other genres, many of which have genres, some of which end a one or two circuits, or #  after the end of the drama, are made up of a series of requests of viewers.'}]

The output may be wrong with n-gram repetitions. In this case, we have to make arguments in tokenizer() like max_length or repetition_penalty.

Fine-tuning T5

To improve a model's performance, you can fine-tune it for a specific domain or language. For example, you can customize a Korean-English translator to work with medical texts. Similarly, you can fine-tune Google's T5 model to work with French or low-resource languages. Fine-tuning can also enhance a model's sophistication and ability to handle complex tasks.

Fine-tuning is also a key component in transfer learning, where a pre-trained model is adjusted for a new, related task. Transfer learning can be much more efficient regarding computational resources and time than training a model from scratch. It's important to note that fine-tuning should be done carefully to avoid overfitting.

For fine-tuning, install several libraries: accelerate for a faster training, datasets to get access to the dataset on which we will train our new model, and huggingface_hub to log in to Hugging Face (where we can store our future model).

!pip install accelerate -U
!pip install datasets huggingface_hub

To fine-tune, log in to Hugging Face and have a dataset. Data Scientists may create their datasets to meet specific requirements, but if a suitable one exists on platforms like Kaggle, they can use it. We don't require a task-specific model, only the knowledge of fine-tuning a model with a common dataset. Let's fine-tune our new model on a simple Korean-English Parallel Corpora. Load this dataset and learn its structure.

To tokenize this dataset, define the preprocess_function:

max_input_length = 100
max_target_length = 100


def preprocess_function(examples):
    inputs = [ex for ex in examples['ko']]  # Original Korean text
    targets = [ex for ex in examples['en']]  # English translation
    model_inputs = tokenizer(inputs,
                             max_length=max_input_length,
                             truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

max_input_length and max_target_length correspond to the maximum possible length of the text in the Korean text (input) and its English translation (target). So, before defining these two variables, study your dataset. They may have different values, too: for example, for a summarization task, max_target_length should be much less than the first one.

These two variables are important when we advance with truncation and padding. In the code above, trunctation=True, and padding is omitted. Truncation involves keeping only a certain number of tokens from a text, typically from the beginning or the end. If a text is longer than the model's maximum token limit, truncation helps ensure that the input does not exceed the model's capacity.

Padding, on the other hand, involves adding special tokens (such as [PAD] or [UNK]) to a text to make it reach a particular length. This ensures that all inputs to the model have a consistent shape or size, as models usually expect fixed-length inputs for efficient processing. Truncation and padding help adapt the input data to match the model's expectations, allowing for practical training and better performance on the desired task.

Tokenization itself may take time (to make it a little faster, you can tokenize by batches):

tokenized_datasets = dataset.map(preprocess_function, batched=True)

When you are finished with tokenization, there will be three more columns in your dataset: input_ids, attention_mask, labels. These columns are responsible for training itself.

Now, define training arguments:

from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

batch_size = 1

args = Seq2SeqTrainingArguments("my_own_KO2EN_translator",  
    evaluation_strategy = "epoch",
    learning_rate=3e-6,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=1, 
    push_to_hub=True,  ## when training is 
    predict_with_generate=True
)

Let's talk about each argument here:

  • Name for your new model (whatever you like)
  • An evaluation strategy lets you track the model's performance and adopt weight over time—options: no, epochs, steps.
  • The learning rate is the initial learning rate for AdamW optimizer. The default value is 5-e5
  • Weight decay is a regularization technique used in machine learning to prevent overfitting. It involves penalizing large weights in the model during the training process. In training transformer models, weight decay is substantial because these models typically consist of many parameters. Due to their complex structure and significant capacity, transformers have a higher tendency to overfit. Weight decay helps regularize the model and prevents it from overfitting by adding a small penalty term to the loss function during optimization. By penalizing large weights, weight decay encourages the model to use smaller, more generalizable weights.
  • The number of training epochs is the duration of the training. Since we want to train to show you how things are done, we put 1. The number of epochs may be 10, 100, or even a thousand.
  • Push to Hub allows you to publish your model on Hugging Face as a separate repository (private or public).
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Now we can define a training class itself:

from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,  ## base model
    args,  # the arguments we have defined above
    train_dataset=tokenized_datasets['train'],  # training set
    eval_dataset=tokenized_datasets['validation'],  @ validation set
    data_collator=data_collator,
    tokenizer=tokenizer)

Sometimes, we also add the compute_metrics argument in case we want to add some task-specific metric for evaluation. In our case, it's possibly BLEU. But we will omit this in our topic since you will learn to add a metric in one of our projects ("Automatic Polish Name Transliteration").

Now, we can train the model.

Voilà, the training has begun!

Conclusion

T5 transformers have become indispensable in natural language processing, enabling machine translation, summarization, and style transfer through their innovative encoder-decoder theory. The Korean text translation is an excellent example of how T5 transformers can effectively solve language translation. By optimizing pre-trained models like the Korean-English model, the system can be more versatile, adaptable, and precise, making it applicable to many other use cases beyond language translation. T5 transformers are a promising area of exploration for the Transformers library, with exciting advancements expected in the future.

How did you like the theory?
Report a typo