Recommender System
November 17, 2017
For details and code, please visit my Jupyter Notebook.
This is my capstone project for the Data Science Immersive program at GA. The program is a 12-week full-time course with five projects in total. It covers a very wide range of Data Science topics, but unfortunately for me, Recommender Systems were only covered for about half a day, in week 11! Therefore, the entire project was pretty much a self-learning journey, from scratch.
Objective
The aim of this project is to dive into the world of Recommender Systems, and explore various methods for Collaborative Filtering. The focus will be on understanding the math and algorithm behind them, then applying them to generate recommendations. Although I chose book reviews, it does not actually matter what the rated item is. Collaborative Filtering is based on seeking similar reviews amongst users/items, and predicting ratings from the existing ratings themselves. Therefore, it is considered item-agnostic.
Data
Data: Amazon's Product Review and Data, 1996 - 2014
File: gzip file containing json within
The dataset consists of over 142 million reviews, of which 22 million are book reviews.
However, I will only be using about 100 thousand book reviews due to time/memory constraints,
by setting a minimum of 40 reviews for each book (item) and each user.
Reference (source of data):
R. He, J. McAuley. Modeling the visual evolution of fashion trends with one-class collaborative filtering. WWW, 2016
Clean and Prepare Data
The gzip data is loaded with the following functions. It takes a while to process, so, it is
recommended to pickle the dataframes after loading.
Initially, I attempt computing with 8 million book reviews, which took a serious toll on my
system. It couldn't even finish computing! Thereafter, I reduced the size bit by bit, but
unfortunately, the computation time was still high. Upon further consideration, as my purpose
for the project was only to gain mastery of the methods for Collaborative Filtering, I shrunk
the data to about 100 thousand reviews, by setting a minimum of 40 reviews for each user/item.
From the downsized dataframe, we have 1490 unique users, and 1186 unique items/books. Some
exploratory work was done to view users/items with the highest number of reviews. I also
created a dataframe to reference books with their original ID, name, and a new sequential
ID obtained by using LabelEncoder. This dataframe is mostly used for intuitively understanding
the recommendation results later.
Before proceeding to split the data into train and test sets, I calculated the sparsity to
get a better understanding of what we are dealing with. The sparsity of the 100k reviews
data set is 93.85%.
Train-Test Split
In most traditional Machine Learning applications, train-test splits can be performed by simply
separating a random selection of rows in the data. However, in Collaborative Filtering, that
would not work because we will need all users to be in both train and test sets. If we randomly
take rows out, we will lose a batch of users.
Therefore, what we need to do is to mask/remove some ratings for every user on the train set.
There are a few ways to do this. In this case, we will split a user's ratings into two. This
means, if a user has rated 40 items, the train set would have 25, and the test, 15. The ratings
do not overlap, and both train and test sets are disjoint.
Matrix Factorization using ALS and SGD
For matrix factorization, we have two sets of latent vectors to solve, the user and the item. In Alternating Least Squares, we hold one set of latent vectors constant and we solve for the other non-constant vector. Now this is the alternating part. Once we have this solved vector, we now hold this newly solved vector constant, and then solve again for the new non-constant vector (which was the previous constant). This is done repeatedly until convergence.
That was ALS. Now, let's move on with Stochastic Gradient Descent (SGD). In SGD, we also take
the derivative of the loss function, but now it is with respect to each variable in the model. The
derivative is taken and feature weights are updated one individual sample at a time.
In our SGD model, we will also include biases for each user and item, and a global bias. For example,
some users might have a high average rating, or that certain items might have an average rating that
is less than the global average. We would want to attempt to address this by adding biases.
Now that we have our equations, let's proceed to the computation.
We'll run this by choosing 50 latent vectors, learning rate 0.001, and without regularization. Let's see how this performs from the plot below.
Now that the model is working well, let's try to take it a little further by finding the best hyperparameters. It's like a manual gridsearch in our case.
- learning rate
- latent factors
- regularization (keeping it same for user and item)
Ultimately, this search only improved our result by 0.0003.
Best hyperparameters:
- learning rate: 0.001
- latent factors: 50
- regularization: 0.001
- iterations: 200
It’s time to run our model to generate predictions, and recommendations.
The first impression I had when I ran it was that I knew nothing about the recommended books.
I had no intuition at all, and had to look them up. Interestingly, they do belong to a similar
category. Most of them were Literature/Fiction, and Romance. Let’s look at the recommendations
for two users; user1 has the most ratings, and user2, the least.
User1
The books here shown for User1 mostly fall under Romance/Women’s books. We can also compare our model’s
predicted score to the user’s actual ratings in the second table as well.
User2
Coincidentally, the books that user2 has rated belong to Romance as well. Our model’s recommendations
did well here too, having mostly books in the Romance category.
Conclusion
I performed Matrix Factorization by optimizing through Alternating Least Squares (ALS) and (SGD).
The RMSE results were great. At 0.648, this means that all of our predictions are only 0.648 away
from the actual ratings, on average. The model's predictions were intuitive as well.
However, I believe that the 'Romance' recommendations were probably not a coincidence. If we
recall from above, the ratings dataset was shrunk from 8 million reviews to a mere 100k because
of limitations in time and resources. Therefore, in our data, every user and item would have at
least 40 reviews each. This will likely 'converge' our data to a few major categories. Regardless,
I remain confident in the model's ability to predict and recommend!
Moving On
Immediate actions will probably be to substitute the dataset to the MovieLens 100k benchmark
data to see how the model performs. Thereafter, I will have to explore AWS, Google Cloud
so that I will be able to run the full 8 million reviews dataset and evaluate. This is too
exciting to let go!
Reflection
This is by no means the end. I have learned a lot about python classes, matrix factorization,
optimization, documentation, and more, through the implementation of this method. The process
was certainly a tribulation, having to do everything from scratch and without support because
the course doesn't cover this much, and my instructor was not very acquainted with RecSys. In
addition, I came to realize that RecSys is in a world almost entirely different from traditional
machine learning. It feels like a race to create better algorithms all the time. That said, my
appetite for RecSys remains. Having read countless research papers, articles, updates on RecSys,
my appetite continues to grow as I believe in its power to effect many changes. Although I am
aware that I have just barely scraped the surface and there is still a long way to go, I am
ready to take it on.
TensorFlow Matrix Factorization and Surprise Package
Restricted Boltzmann Machine with TensorFlow
Restricted Boltzmann Machines (RBM) are shallow neural networks that only consist of two layers.
The 'restricted' means that neurons/nodes that are in the same layer, are not connected.
RBM was invented in the 1980s, but it only rose to prominence after Geoffrey Hinton and
collaborators invented fast learning algorithms for them in mid-2000s. RBM was also huge in
the Netflix Prize which was an open competition in search for the best collaborative filtering
algorithm to predict ratings. RBM was also said to be the seed that kickstarted Deep Learning.
Some common applications for RBM include dimensionality reduction, feature extraction, feature
learning, and collaborative filtering.
Let's take a closer look at RBM. RBMs learn patterns and features in data by reconstructing
the input. For instance, let's say the input (also known as the visible layer) is an image. The
pixels are processed in the input, and the learning process consists of several forward and
backward passes between the visible and hidden layer, where the RBM tries to reconstruct the
data. The weights of the neural net are adjusted in such a way that the RBM can find the
relationships among input features that are relevant. After training, the RBM will be able to
reconstruct the input, which means that it tries to create an image similar to the input.
As the image above shows, each hidden node receives input from all nodes from the visible layer
(4 in this case), multiplied by their respective weights. The result is then added to a bias (which
at least forces some activation to occur), followed by the activation algorithm, producing one output
for each node.
During the reconstruction phase, the activations of the hidden layer become the input in a backward pass. They are multiplied by the same weights, and are then added to another bias (the visible layer bias) and the output is considered a reconstruction. Therefore, the reconstruction error is the error between the reconstruction (r in the image) and the original input. The error is backpropagated against the weights in an iterative learning process until a minimum error is reached.
Formatting data for RBM input
A dataframe cannot be fed into the RBM model in TensorFlow. We will have to turn it into a list
of lists. The list will have 1490 lists within, which represents each user. For each user’s list,
there will be 1186 values, which corresponds to the number of items (books). The values are the
user’s ratings, and it is a zero if the user did not rate the item.
First, we use pandas’ groupby to group the dataframe by users. Then, we iterate and append rating
values to a temporary list, which then gets appended to the main list (trX).
We begin by creating the layers, both visible and hidden. In the code, the biases are created first. The number of nodes in the hidden layer can be chosen and set, whereas the visible layer will have number of nodes equal to the number of unique items in our data. This is because each of our input data is a list of all ratings for all items of one user.
Next, we create the Weights layer which will be multiplied with the visible layer and fed into the hidden layer. That is why it has the shape with rows = number of visible nodes, and columns = number of hidden nodes.
The second box is where we create the layers, input and reconstruction. We use a sigmoid activation function because we have formatted the input to be continuous between 0 and 1. Next, we use tf.sign on the hidden layer and have a ReLU function act on it. The reconstruction phase is just a reverse of the above.
First, we set our equations for Contrastive Divergence to update the weights. A learning rate is also set. Next, the updating of Weights and Biases are carried out.
The error function is then defined. It is simple the mean squared error of the output minus the input. This gives us the reconstruction error which tells us how different the output is from the input. We are aiming to reduce this MSE as we want the machine to learn to reconstruct the input, and come up with weights and biases that allow for this. Lastly, we create the values for the placeholders. Finally, we run the model with the code below!
The first and last lines of time.time() shows us how long the entire training took. Each epoch refers to a full training where the entire dataset is used. In this example, we will train the entire dataset 10 times. The ‘update’ equations are run to update the current weights and biases, and thereafter they will be assigned as ‘previous’, which will then be fed into the new epoch. We only specify to run the ‘update’ because everything that is linked (before) to the ‘update’ equation will also be run. These iterations continue to update the weights and biases while reducing the reconstruction error.
The figure only shows the reconstruction error of the first 25 epochs. The error reduces rather quickly and slows down after 0.09. Thereafter, the training ran and completed 1000 epochs which further reduced the error to about 0.81 where it seemed to stagnate (view in Jupyter Notebook).
Now that our training has reached stability, we check its performance by generating recommendations. The first table above shows the machine’s recommendations, and the second table shows the user’s rated books. For both tables, we can see the predicted ratings of our machine from the column “Recommendation Score”.
Please feel free to look up these books to find out what they are! I have done so and discovered that the user rated highly books about Romance. The books that our model recommended belonged to Romance as well, which also coincides with our matrix factorisation model. That’s good to know.
Evaluation
Thus far, we ran 3 models which gave us fairly similar results.
- Matrix Factorisation optimised by ALS/SGD
- Multiple Surprise package models
- Restricted Boltzmann Machine
Now, we can say that we are confident to build an accurate Recommender System based on explicit feedback from users. This is only a part of an entire Recommender System though. What we have done is Collaborative Filtering. A wholesome RecSys would have much more than that which includes Content Based Filtering as well. Content Based filters based on item features, and there are several ways to do that. Wide and Deep neural networks are a great way to model this. In addition, current industry practices value implicit feedback more. It's almost impossible to get a public dataset on this. Implicit feedback are features like whether or not a user clicks on something, how long a user watched something, whether a user watched something, etc. Moreover, more systems are moving toward a 'binary' rating (thumbs up or thumbs down) instead of a Likert scale (1 to 5). There is a lower tendency for people to rate on the Likert scale because of the time/effort required to decide. A thumbs up or thumbs down is much easier and faster. Of course, companies can afford to do this because they have implicit data about the user which is more valuable. Now doesn't it feel like we have barely scratched the surface? While there is indeed lots more to dive into, we have built a solid foundation on understanding how matrix factorisation works, and how to code a neural network like Restricted Boltzmann Machines. We established deeper understanding about the Math behind these algorithms from the countless research papers and journal articles that we have read. These tools are definitely transferable! As long as we are able to redefine the Math equations, and code our equations into python classes or neural networks (like what we did in this project), then we will be able to solve our next Recommender System problem.
Future Work
I'm excited to try Content Based methods, which include Approximate Nearest Neighbours, definitely Wide and Deep networks.
Moving on, I would want to try Deep Matrix Factorisation using TensorFlow and/or Apache MXNet. Recent attempts have shown that deep learning allows to learn more sophisticated models which can take in more complicated inputs (as compared to traditional matrix factorisation), potentially yielding additional insight into the task. As Andrew Ng mentioned, A.I. is the new electricity. Therefore, more work will go into exploring RecSys through deep learning. On a different note, Graph solutions like neo4j is worth exploring too.