Fine Tuning GPT-2 for Magic the Gathering Flavour Text Generation

Published in

The Startup

9 min readSep 6, 2020

A template for fine-tuning your own GPT-2 model.

GPT-3 has dominated the NLP news cycle recently with its borderline magical performance in text generation, but for everyone without $1,000,000,000 of Azure compute credits there are still plenty of ways to experiment with language models on your own. Hugging Face is a free open source company focussing on NLP tooling and they provide one of the easiest ways of accessing pre-trained models and tokenizers for NLP experiments. In this article, I will share a method for fine tuning the 117M parameter GPT-2 model with a corpus of Magic the Gathering card flavour texts to create a flavour text generator. This will all be captured in a Colab notebook so you can copy and edit to create generators for your own tasks!

Google Colaboratory

Link to fine tuning notebook

colab.research.google.com

Starting Point

Generative language models require billions of data points and millions of dollars in compute power to train successfully from scratch. For example, GPT-3 cost an estimated $4.6 million dollars to train and 355 years of compute time. However, fine tuning many of these models for custom tasks is easily within reach to anyone with access to even a single GPU. For this project we will be using Colab, which comes with many common data science packages pre-installed, including PyTorch and free access to GPU resource.

First, we will install the Hugging Face transformers library, which will also fetch the excellent (and fast) tokenizers library. Although Hugging Face provide a resource for text datasets in their nlp library, I will be sourcing my own data for this project. If you don’t have a dataset or application in mind, the nlp library would provide an excellent starting place for easy data acquisiton.

This will install the Hugging Face transformers library and the tokenizer dependencies.

The Hugging Face libraries will give us access to the GPT-2 model as well as it’s pretrained weights and biases, a configuration class, and a tokenizer to convert each word in our text dataset into a numerical representation to feed into the model for training. Tokenization is important as the models can’t work with text data directly so they need to be encoded into something more manageable. Below is an example of tokenization on some sample text to give a small representative example of what encoding provides.

In this example “Sample Text is encoded by the tokenizer to a vector [36674, 8255].

The Data

Now it’s time to grab our data. For this project I’ll be using Magic the Gathering card flavour text from the Scryfall api, which returns an easily parsable JSON object of card data. Here, I extracted only English flavour text to avoid introducing new tokens for non-native words, as the GPT-2 model was originally trained on English-only data. After parsing, I was left with an iterable list object of 29222 MtG flavour texts, a preview of which is below.

Here are a few samples of the training corpus.

The Loader

Now that we have our text data, we need to create a structured dataset and dataloader to appropriately feed into the model. For this step, we will use in-built PyTorch classes to define the dataset and dataloader, which will feed the neural network. The dataloader object is comprised of a dataset, a sampler, and provides single- or multi-process iterators over the dataset (see the official documentation for further information). There are a lot of details here, but the important points are:

The dataset object will create a new list, which is a tuple of tensors.
The first tensor is the encoded flavour text, wrapped in a start of text token, an end of text token and padded up to a maximum embedding length (if the length of the string is shorter than the maximum embedding space).
The second tensor is an attention mask, which is a list of 1's and 0's that tells the model which tokens are important, 1’s, and which should be ignored, 0's.

The code for creating this dataset object is below and has been generalized to fit any tokenizer and datalist. It has also been padded up to a maximum length, which can be specified. The maximum length of all of the strings in my corpus was 98, so my tensors are only padded to a max length of 98 tokens. The maximum length possible for the GPT-2 tokenizer is 768, so keep in mind that specifying padding length will make a difference to the training speed of the model and the batch size you are able to allocate.

Code to define a custom dataset in PyTorch.

Example output for above dataset. In the first tensor, 50257 is the ‘start of sentence’ token, the subsequent tokens are the encoded words in the string, and 50256 represents the ‘end of sentence’ token. Combined, these tokens inform the model where the sentence construction starts and stops. The repeating token 50258 encodes the ‘padding’ token and are assigned as 0’s in the attention mask tensor so that the model gives them no value.

We now need to split the dataset into a training and validation set before creating the dataloaders. The code below shows an example of doing this to the MTGDataset we have created from the dataset template code, using the GPT2Tokenizer we instantiated, and dividing the data into 80:20 training/validation sets. It is important to note that different samplers are employed for the training and validation dataloaders. We want random sampling for the training data, but that isn’t required in the validation samples so these are tested sequentially.

The dataset was split using an 80:20 random sample of the MTGDataset.

The Model

Before training, we need to instantiate a few more things. First of all we should load and set the parameters of the GPT-2 model. Next, create an instance of the GPT-2 language model itself and configure it with the parameters we just set. Lastly, to speed up training we should run this on the available GPU, to do this we need to instruct PyTorch to load the data to the cuda device.

At this point, we need to investigate what type of instance we have connected to in Colab. We can check this by running !nvidia-smi, which will display the device information, including the GPU model, P100, K80, T4 etc., as well as the amount of VRAM available to the device. This information is crucial because it will inform our choice of batch size. Setting the batch to the maximum you can fit into memory is generally good practice. On a T4 or K80 we can set the batch to 32 on this particular data, otherwise the batch much be set smaller or the data will fail to load to the GPU and training won’t start. More VRAM will enable larger batch sizes, which will make training faster.

Now we can set the epoch number (number of training cycles) and create the optimizer we will use for training. We will be using the Hugging Face implementation of AdamW, though other optimizers are acceptable. Fastai have a wonderful blog post explaining the AdamW optimizer, including a brief history and the recent tweaks that led to its current state.

At this stage, we could fine tune other hyperparameters, such as the learning rate, the beta and epsilon values of the optimizer, or vary the batch size or epoch number. If we are otherwise happy with the defaults, we can establish the training loop and begin!

Training

The code for the training loop is below. For anyone unfamiliar with neural network training I’ll try to provide an accessible description of the basic work flow this code encapsulate:

The training batch is loaded to the GPU and the network makes predictions on some labels.
The performance of the model is accessed and the loss, how far the predictions are from the truth, is found.
The derivative of the loss is calculated and the optimizer will move down the gradient towards some minima.
The changes to the model that reflect this step are then back propogated through the model, the weights are updated at each layer of the model and the next sample is tested.

This process repeats for every training batch and ideally the model will equilibriate at a position of minimized global loss. To see if the model generalizes well to data it hasn’t seen it is tested on the validation data. After this point, the model is fine tuned on our new dataset and we can examine the overall model performance and test the outputs to see how well this worked!

This is the basic training loop the model will iterate through. Other deep learning frameworks abstract this to a model.fit() method but in PyTorch we need to define our own training loops.

Training Made Easy

Shortly after this article was first published Julien Chaumond pointed me to the new Trainer class in transformers which makes this training loop significantly more concise and offers several other benefits as well. The trainer even makes some of the Dataloader instantiation obsolete as you only need to provide dataset objects and it will automatically create the loaders and even use random samplers in the training and sequential samplers in the validation set, precisely as we have configured. It will even prompt you to log in to a service like Weights and Biases to log your model training and can configure your model to train across multiple devices including TPU’s. Unless you really want to specifiy every detail of the training cycle manually I would highly recommended using this method.

transformers.Trainer equivalent of the PyTorch training loop established earlier.

Evaluation

First, we will examine the shape of the loss curves for the training and validation, shown below. This is a good outcome given that many of the model parameters are defaults, as the training loss doesn’t dip too far below the validation loss, which would have indicated possible over-fitting.

Now for the fun part! We will evaluate the model outputs from a human perspective. Below, are five examples of the model outputs. I’m pretty pleased with these! They read like the flavour texts I put into the model and a quick check shows they aren’t duplicates of any existing cards from the corpus. It even uses in some places the quote attribution to real entities from MtG and in the correct location of the flavour text so this model definitely appears to understand the strucutre of our data. Overall, I would say that the training has gone really well and that the fine tuning has produced a new model that successfully generates novel MtG flavour texts!

Future Changes

There are some obvious changes that we could make to this work flow that might improve the model. Some hyperparameter tuning could create more ‘accurate’ outputs, but that would be a difficult metric to calculate in this instance. Different language models, or a larger version than the 117M parameter GPT-2 model could have been fine tuned as well. It is also possible my scraping removed too many data points, or that the API I chose didn’t contain every possible flavour text and a more exhaustive search would return a richer dataset.

If replicating this work flow isn’t for you, but you want to use this generator to create something on your own, this model is now hosted by Hugging Face. The below embedded URL will take you straight to the model’s home page.

rjbownes/Magic-The-Generating · Hugging Face

🔥 This model is currently loaded and running on the Inference API. ⚠️ This model could not be loaded by the inference…

huggingface.co

By following the instructions on that page, or from the code chunk below you can load the Magic-The-Generating model straight from the transformers library into your local Python environment.

Acknowledgements

I think it’s really important to give credit where credit is due within the open source community. So if you found this blog entertaining or informative, please take a moment to visit these excellent resources for more language modelling content, tutorials and projects:

Rey Farhan, who was my inspiration for this particular project
Chris McCormick’s BERT fine-tuning tutorial (heavily cited by Rey)
Ian Porter’s GPT-2 tutorial
Hugging Face Language model fine-tuning script