In this post, we will be implementing a simple recommender system using the BERT4Rec model, which is a BERT-based model for sequential recommendation. The model is based on the paper BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer by Zhen-Hua Ling, et al. The model is a simple BERT model with a few modifications to make it suitable for sequential recommendation. The model is trained on the MovieLens 1M dataset.
Before we get into details about BERT4Rec we need to understand what autoregressive model means. An autoregressive model is a model that generates the next token in the sequence based on the previous tokens in the sequence. For example, if we have a sequence of tokens [I, like, to, watch, movies], the model will generate the next token based on the previous tokens. A sequence could contain words or numbers or anything else.
Most language models, recommender systems, time-series forecasting models, and many other models are autoregressive models. The model generates the next token based on the previous tokens in the sequence.
A simplified illustration of how autoregressive models work
BERT4Rec is a variant of the BERT (Bidirectional Encoder Representations from Transformers) model that is specifically designed for use in recommendation systems. The original BERT model is a general-purpose language model that can be used for a variety of natural language processing tasks, including text classification, machine translation, and question answering. The model is based on the paper BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer by Zhen-Hua Ling, et al. The model is a simple BERT model with a few modifications to make it suitable for sequential recommendation. The model is trained on the MovieLens 1M dataset.
BERT4Rec is different from the original BERT model in a few key ways:
Overall, BERT4Rec is designed to be more effective at modeling and predicting user-item interactions than the original BERT model, which makes it better suited for use in recommendation systems.
What makes BERT4Rec different from the classic BERT is that BERT4Rec’s vocabulary isn’t words but rather ids of items in the sequence. So, a sequence of items, for example movies [“Harry Potter”, “Silence of the lambs”, …] would be represented as a sequence of ids [4, 8, 15, 32, 100]. There are two separate embedding layers, one for items and one for user ids.
BERT4Rec architecture. Image credit: Zhen-Hua Ling, et al.
The dataset we will be using is the MovieLens 1M dataset. The dataset contains 1 million ratings from 6000 users on 4000 movies. The dataset is available in the MovieLens 1M website. The dataset is available in the form of a zip file. We will be using the ratings.dat file from the dataset. The ratings.dat file contains the following columns:
The dataset needs to be preprocessed and converted into a format that is suitable for training. The preprocessing steps are as follows:
We can then create a DataLoader using pytorch library to load the data in batches.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
from torch.utils.data import DataLoader train_loader = DataLoader( train_data, batch_size=batch_size, num_workers=num_workers, shuffle=True, ) val_loader = DataLoader( val_data, batch_size=batch_size, num_workers=num_workers, shuffle=False, )
The model architecture is designed as in the original paper. Here are some of the specifications of the model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
len(train_data) 162541 len(val_data) 162541 GPU available: True, used: True TPU available: False, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params ----------------------------------------------------------- 0 | item_embeddings | Embedding | 7.6 M 1 | input_pos_embedding | Embedding | 65.5 K 2 | encoder | TransformerEncoder | 3.6 M 3 | linear_out | Linear | 7.6 M 4 | do | Dropout | 0 ----------------------------------------------------------- 18.8 M Trainable params 0 Non-trainable params 18.8 M Total params 75.197 Total estimated model params size (MB)
To train a model we prepare a config.yaml file with all the necessary configurations including dataset, model architecture, hyperparameters and more.
The config file below trains a 4-layers BERT4Rec model on the MovieLens 100K dataset for 10 epochs. The framework allows sending metrics and training log to weights & biases . If you prefer to use tensorboard make sure to comment log_wandb out.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
# log config log_wandb: true # Hyperparameter config n_layers: 4 n_heads: 4 # dataset config data_path: ./training_data/ # Attention: The following fields change from dataset to dataset dataset: ml-100k field_separator: "\t" USER_ID_FIELD: user_id ITEM_ID_FIELD: item_id RATING_FIELD: ~ TIME_FIELD: timestamp show_progress: true load_col: inter: [user_id, item_id, timestamp] user: [user_id] item: [item_id] # model config epochs: 10 train_batch_size: 3500 eval_batch_size: 3500 learning_rate: 0.01 user_inter_num_interval: "[0,inf)" item_inter_num_interval: "[0,inf)" filter_inter_by_user_or_item: false neg_sampling: uniform: 1 eval_args: split: 'RS': [4, 1, 1]> group_by: None order: TO mode: uni50 metrics: ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision', 'MAP'] topk: 12 valid_metric: MAP@12
Recbole provides a variety of metrics to evaluate the performance of the model. The metrics are defined in the config.yaml file provided earlier. The metrics are calculated for each epoch and the best model is saved based on the valid_metric defined in the config file.
After many different experiments we can join all the metrics collected to evaluate the best configurations using weights & biases dashboard.
Weights & biases dashboard results
Using wandb dashboard we can quickly and easily compare the results of different models and configurations in a human-readable format. We can also compare the results of different models on different datasets.
The model can be used in inference by loading the saved model and passing the user and item ids to the model. The model will return the predicted rating for the user and item.
The following steps are generally followed to use the model in inference:
In this article we have seen how to use the recbole framework to train a BERT4Rec model. We have also seen how to use the weights & biases dashboard to compare the results of different models and configurations. The recbole framework is a great tool for training and evaluating recommender systems models. It provides a lot of flexibility and allows for easy experimentation with different models and configurations.
In general, BERT4Rec can only be used for sequential recommendation. It is not suitable for cold-start problems. It is also not suitable for recommendation problems where the user’s history is not available. For example, in the case of new items addition to the catalog, the model needs to be retrained to include the new items, which can be computationally expensive.
Furthermore, other state-of-the-art models such as GRU4Rec outperform BERT4Rec in terms of overall performance and training time.
Evaluation of many different recommender systems models using recbole on the ml-100k dataset:
Nevertheless, BERT4Rec one of the first approaches to utilize the Transformer architecture for sequential recommendation. It is a great starting point for anyone who wants to learn more about the Transformer architecture and how it can be used for sequential recommendation.