Language models are empowering thousands of products and millions of people through language generation like predictive text and classification. For instance, many companies are using NLP for customer service interactions, to help route users to solutions more quickly. Or github and copilot providing code from comments. Maybe most subtly, but most impactfully BERT powers almost every single English based query done on Google Search, fundamentally improving how you find things on the web. In this article I will be using a HuggingFace BERT model to create searchable embeddings from the text metadata of the movielens dataset and then create a personalised movie recommendation search engine.
In this article we will be covering a few important technologies I am going to out line here. I will discuss both the library and API, as well as the implementation and give some back ground on how the algorithm works and pros and cons of selecting these tools. I’ll go over each in turn, but will bring them all together later in the article. This will hopefully serve as a useful refresher course to these topics as well. Before we go any further then, the tools I will be using and discussing in this article are:
- LABSE — Huggingface
- SCANN — Tensorflow
- Neural Collaborative Filtering — Keras
LABSE is a Language-Agnostic BERT Sentence Embedding model, developed by Google AI, arxiv paper for those interested from an academic perspective can be found here. BERT, itself an anagram for Bidirectional Encoder Representations from Transformers, represented a substantial improvement for the performance of Transformer models at the time, for a complete reading here is the paper. Most language models were trained sequentially, predicting masked words by considering only the words that came before it, but BERT was instead given an entire sentence, and learned the relative value of every word compared to all other words at the same time. As history shows this was an important improvement that in some way more clearly captured the nuance of language as BERT achieved SOTA in basically all of the bench marks at the time of its release. This is doubly impressive as instead of training using manicured, labelled data it was pre-trained from unlabeled data from the BooksCorpus of 800M words and the English Wikipedia with 2,500M words. The “Language Agnostic” part is a modern cherry on top, embedding semantically similar sentences, such as “Cute Puppy” and “Chiot Mingnon” in similar vector space (see the diagram below) no matter the language of origin, something that will be incredibly handy when dealing with movie titles that maybe foreign films with non-english titles.
In practice this is going to look something like the below code chunk. Huggingface now also provide a native SentenceTransformer library full of useful models like LABSE with the same low code API. As you can see from the example, the salient point of all of this is that the model is going to take our two example sentences, “This is an Example sentence” and “Each sentence is converted” and translate them into a 768 feature vector which you can think of as an address that describes that sentence in high dimensional space. The useful part of these embeddings is that now that we have represented words as numbers, we can treat them like data, group them, transform them, even find the similarities between them.
Scalable Nearest Neighbors (SCANN) is an approximate nearest neighbors algorithm implementation from Google that reports SOTA performance for the number of queries per-second for any given recall value, see the below diagram. Recall in this use case you can think of as accuracy, there is a very good write up of ANN approaches here that talks a little more about these metrics.
From a high level, SCANN enables efficient vector similarity search, which is exactly the functionality we are looking to use with our newely encoded sentence embedding vectors. An ANN like SCANN becomes important when data starts to scale. Comparing two things is easy. Comparing ten things with ten things is trivial, but to give an example from my real world ML work, when you try and compare millions of users to hundreds of thousands of items, an exhaustive search of that problem space has to make in the order 1x10¹¹ comparisons which becomes infeasible to perform if you don’t throw compute at it, but really should be handled more gracefully. An example of a more graceful solution to that brute force solution is an approximate nearest neighbors algorithm as instead of looking at all possible neighbors it will embed the data in a tree or hash table and return only the “closest” neighbors to a query while trying to approximate the true distances you would have seen in the exact solution. Eyal Trebelsi has a great write up on ANN methods for futher details and the official github, Google AI blog post and arxiv paper provide all of the extra context to deeply understand this library. We will be using SCANN as a method for quickly comparing query strings against the 62,000 movie’s we will be embedding with LABSE into a SCANN hash table space and returning a starting list of movie recommendations quickly and efficiently.
Neural collaborative filtering
Colaborative filtering is a mainstay of recommender systems that seeks to model user-item and user-user iteractions in order to identify the best possible recommendation. You’ve seen this every time you’ve gone on Amazon and been graced with a “People who bought this, also bought that” rail on your page. This method was made most famous by Netflix and the million dollar Netflix Prize but did exist before the streaming giant. The most common form of CF is matrix factorisation, it finds the inner product of user-item matrices and learns the user-item interactions from this operation. NCF replaces the expensive inner product operation with a nerual network which will try and generalise the matrix factorisaion operation, the original paper is a great read on the math and principles behind this. Additionally, Abhishek Sharma has a great read on NCF’s that goes in detail about the technique and also provides code examples in python. Here we will be making a simple NCF model using Keras to rerank and personalise a list of recommended movies from an initial search query to a user based on their movie watching history.
Join it all up
So now that I have covered all of the individual components of the model I want to quickly go over how they will all interact. We will be using LABSE to create embeddings from the titles and genres of the 62,000 films in the Movielens dataset. These will be embedded into a SCANN hash space and queried with new strings like “Scary Zombie Movies”. That string will be transformed by the same LABSE model and the output vector will be the input search query to return a list of candidate movie recommendations from the SCANN model. That list of movie titles will then be reranked based on the user’s collaborative filtering user embedding for the item embeddings of the films provided and deduplicated to provide, hopefully, a novel and interesting selection of unseen movies related to their search of interest and personalised to their tastes.
Language Model Sentence Embeddings
I’m going to walk through the code now to set up each section of the model, the full code can be found on github, but I will be heavily referencing with code snippets and a brief explanation of what each chunk is doing.
The first section of this project was to set up the language model and tokenizer with the data to be embedded. This required very few steps as Huggingface’s API is brilliantly low-code and the data for the movielens dataset is neatly packaged up. All that is needed here is to point the AutoTokenizer/Model at the correct LM which in this case is the “sentence-transformers/LaBSE”. If there are any other sentence embedding models you would like to try, it’s as easy as changing that string for the corresponding model hosted on the HF model repository (https://huggingface.co/models).
The data is pretty simple, but it still needs a touch of cleaning. The parenthesis around the year in the movie title are unimportant, they won’t encode any meaningful information so we can remove them. Likewise with the pipes in the genres, they are not encoding anything useful, they might even hurt at this stage as movies with more genres might end up looking more similiar to other films with more genres if we keep these characters. They can go to. To be safe we will also remove any string which denotes no genres, and replace it with an empty string. The list comprehensions below take care of all of this cleaning for us.
Now that the data is cleaned up and packaged as a list containing the movie title and genres for each film we can create the embedding for each entry. I was using a Google Cloud hosted Jupyter notebook instance with a T4 GPU to accelerate this over a CPU and the Pytorch backend for the LM. Loading the model to the GPU is a simple, similarly loading the inputs to the model after they are tokenized is straightforward. This really, really speeds up inference time and if you have access to GPU’s through Google Cloud, or Colab, or you own a sufficiently powerful card at home I would recommend taking this approach. After converting each string to a set of embeddings, I saved them as a CSV for use later.
Approximate Nearest Neighbor Candidate Generation
The embeddings from the previous step are now converted into item tensors, and then the an instance of SCANN is instantiated with declared parameters for the leaves/leaves to search and K. The leaves are fractions of the embedding space, so for the value 1000 the entire hash space is partioned into 1000 pieces each containing items that have the most similar asymmetric hashes to each other. The leaves to search then are the numbers of these partitions to search through. Increasing these values can have a pretty large impact on the indexing time for the model as well as the inference time as SCANN will have to search through more or fewer spaces. Lastly K is the number of nearest neighbors to the index search query to find and return. The code snipped below shows the creation of the item index, then a test run on the string “Horror films with zombies”, encoding that string with the LaBSE model and finally query our item indexes with this newly embedding string and the resultant search items this model will return.
Neural Colaboratibe Filtering Reranking
This example of neural colaborative filtering is heavily inspired by the Keras documentation, please checkout the offical docs page (here) for further details on this package and the implementation of NCF here.
Just like with the NLP embeddings the first step here is to prep the data. We are deduplicating and creating unique lists of the user and movie ids, then creating mapped indexes of these values as dictionaries. This will help us go back and forth from model index to movie/user ID and vice versa. We will also use these dicts to map these values onto our starting data, then normalize the scores, and split into 90/10 train-test splits.
The architecture of this recommendernet is almost the “default”, the largest difference is my choice of an embedding size of 128 instead of 64. This is primarily because we are training on 10x the data of the Keras NCF tutorial and I reasoned that a larger embedding space might help create more granular embeddings. The important thing to note here is that the NCF model is trying to predict if a user will like or not like an item, and this is mapped onto a sigmoid curve so we are given outputs of 1, liked, or 0, didn’t like, making this a binary classification problem.
The model was then trained on the afore-mentioned T4 GPU for 5 epochs with a batch size of 4096, total training time was about 40 minutes. This would be considerably faster on more performant GPUs, and substantially slower on a CPU. There is room for hyper-parameterization at this stage for example the learning rate and LR decay parameters, epochs and batch size could all be adjusted, there is almost certainly a more optimal set of parameters than this, but I was happy with the result.
Just to check the output for a random user I grabbed the CF recommendations on this dataset and for a user with an awesome (in my opinion) set of favorite movies they received appropriate recommendations. It’s atleast sensible from a high level, similar genres are present, movies geared to an older audience, slightly more mature themes. This looks like it has created reasonable user embeddings and has correctly understood something about the user activity in its training.
I’ve copied out the entire class I created for combining all of these elements, I hope it is useful to see it in its entirety, it can also be found on my github page I linked above. However, I’ll mostly summarise what it is doing. The init for the class loads all of the relevant data, the movies, ratings, our embeddings we created earlier, a snapshot of the NCF model as well as the tokenizor and model for the LaBSE model, then parameterizes and instantiates an instance of SCANN with our item tensors. There are then several helper functions which do a lot of the grunt work to retrieve user histories and convert the SCANN and Keras outputs from indexes to item lists. In the end it is all wrapped up in a beautified print function which will output only the personalised search recommendations for a given user and a given search query. An example of the finished product is below.
The tables below show two things: The pure NCF recommendations for a user, which gives us an idea of their preferences, then their personalised search results for the query we had earlier “Horror films with zombies. So for user 42 the query is encoded, the item tensor embeddings are queried for the items that look most like this query, that list of items is then provided as candidates to the NCF model and the user ID to reorder the search results based on personal preference. Lastly the list is deduplicated of items user 42 has seen before and printed below.
This was a fun exercise in combining different models and technologies to produce something a bit more interesting than your standard Movielens recommender. It’s not ground breaking by any means, each of these technologies is used, or an alalogue of them is used in products you see everyday. Google search, for example, combines multiple models to optimize and refine your search terms on much more than just your own user history to provide “personalised search” experiences to every user. Just to name a few they also consider synonyms of the words in your query, your location to provide local or regionaly interesting results, when you look for something, what is the “freshest” result, and given the success of the search giant, it is clearly doing something correct. If you don’t believe me, ask Jeeves. It was a fun activity in engineering to piece this together though and it provides a nice example of thinking about more than just item similarity in recommenders. ML has the ability to scale personalisation to billions of people, so why settle for something generic?