So far, we’ve established that data is really important, but the black box remains. To know how exactly the recommender understands data, learns from it and predicts something new, we said we need to understand a bit of data science and AI. We will narrow that even more down to Machine Learning.
Machine learning is the inner magic of recommenders. Machine Learning uses a whole lot of math to make sense out of numbers and information. Now, fear not, we will not go in depth as to understand the math behind it(that’s usually a graduate school major), but luckily there are applications of Machine Learning that do not require any math, just a bit of coding savviness. The more advanced recommenders use a mix of deep learning and machine learning for accurate recommendations, but for right now, we will focus on Machine Learning.
Among many methods, a super popular Machine Learning model for building recommenders is Cosine Similarity. I promised we would stay as far away from math as we can, but cosine similarity needs just a little bit of understanding of mathematics, as it’s a way to measure how similar two vectors are in a multi-dimensional space.
Let's simplify.
In our case we have a list of movies. Each movie can be represented as a vector in a multi-dimensional space, where each dimensionrepresents a different feature of the movie, such as the genre of a movie or the ratings.
Now, we want a recommender system to suggest movies that a user might like based on their past interactions, i.e., how highly they rated a movie. The system uses cosine similarity to measure how similar their preferences are to the movies being recommended.
To compute cosine similarity, the system calculates the cosine of the angle between two movies. The resulting value can range from -1 to 1. A value of 1 means the two vectors are identical, 0 means they are orthogonal or not related, and -1 means they are diametrically opposed.
So, if the system finds movies that have a high cosine similarity to user’s past interactions, it can recommend those movies as they are similar to what they’ve enjoyed before. Similarly, cosine similarity canal so be used to find users with similar preferences and recommend items they have enjoyed in the past.
Back to our Google Colab Notebook:
We will apply cosine similarity to the movie preferences, but first, we need to restructure our data to better represent the interactions between each user and each movie. For that, we will build a matrix, or also known as a pivot table, where each row corresponds to a user and each column corresponds to an item (in this case, a movie).
In this matrix, each cell represents a user's rating for a particular movie, which is an example of explicit feedback from users to items. This type of matrix is commonly used in recommendation systems to model user-item interactions and generate personalized recommendations.
So, how do we do that?
It’s really simple with pandas. We will use its pivot_table() function.
matrix =data.pivot_table(index='userId', columns='title', values='rating')
As usual we take a look, at the data, and observe any changes.
You might be concerned to see so many NaN values. These are values that are missing, or particularly, movies that a specific user hasn’t rated. If tried to get the cosine similarity, with all these NaN values, we would face difficulties to get proper scores.
So, here’s a workaround: We will replace all missing value,with 2.5, which is the closest value to present neutrality in a rating (2.5/5is the middle).
matrix = matrix.fillna(2.5, axis=0)
And once that’s done, let’s calculate similarity. For that we will need to import anew library subset namely, cosine_similarity from sklearn.
from sklearn.metrics.pairwise import cosine_similarity
Now let’s use the cosine_similarity function to get a matrix of similarities between movies.
similarity =cosine_similarity(matrix.T)
Good job! We’re almost there! The last bit involves building the algorithm behind recommending.
Let's Continue