Computer scienceData scienceMachine learningML deployment

Introduction to recommender systems

5 minutes read

At their core, recommender systems predict the preferences or ratings that a user would give to an item (such as a movie, book, or product) based on their past behavior and preferences, as well as the preferences of other users with similar tastes. Recommender systems are widespread in the current products: music apps, marketplaces, content aggregators, etc.

This topic covers the basics of recommender systems, the main approaches, and the evaluation metrics for the task.

Collaborative filtering

Collaborative filtering is based on the user similarity to make recommendations (think of it as analyzing user behavior). The assumption is that if the users have agreed on a quality of certain items in the past, they are likely to agree on quality of the items in the future.

Let’s say a user A has a list of features that describe them — for simplicity, let’s say we have album ratings. Some users rate a lot of albums, and some only rate a few — this can be represented as a matrix with missing values.

What collaborative filtering does is find the similar users based on their existing ratings — one can predict the missing rating for a specific album, thus, generate an album recommendation for a user.

This is done through the matrix factorization, which decomposes the user-item interaction matrix into two lower-rank matrices: one representing the user latent factors and the other representing the item latent factors. To be more specific, suppose there is a a user-item rating matrix $R$ , where an entry ( $r_{ij}$ ) represents a rating from a user ( $i$ ) to an item( $j$ ). The goal is to find two matrices $U$ and $V$ s.t. their product approximates the original ratings matrix:

R \approx U \cdot V^T

where

$U$ is an $m \times k$ matrix, where $m$ is the number of users, and $k$ is the number of latent factors.
$V$ is an $n \times k$ matrix, where $n$ is the number of items, and each row in $V$ represents the $k$ -dimensional latent factor vector for an item.

Each row in $U$ represents the k-dimensional latent factor vector for a user. Similarly, V is an $n \times k$ matrix (n — the number of items), and each row in $V$ represents the k-dimensional latent factor vector for an item. The latent factors in $U$ and $V$ capture the underlying preferences and characteristics of users and items, respectively.

The dot product of the corresponding user and item latent factor vectors ( $u_i \cdot v_j^T$ ) gives an approximation of the rating $r_{ij}$ . To find the optimal values of U and V, the following optimization problem is typically solved using techniques like Alternating Least Squares (ALS) or SGD:

\text{minimize}_{U,V} || R - U V^T ||^2 + \lambda (||U||^2 + ||V||^2)

The first term in the objective function ensures that the product of $U$ and $V^T$ is close to the original rating matrix $R$ , while the second term (weighted by the regularization parameter $\lambda$ ) is a regularization term. Once the optimal U and V matrices are learned, they can be used to make predictions for unseen user-item pairs. For a new user $i$ and a new item $j$ , the predicted rating $r_{ij}$ can be computed as the dot product of the corresponding latent factor vectors:

r_{ij} \approx u_i \cdot v_j^T

With the collaborative filtering setting, you might start to wonder: if we need to have some existing data to make a recommendation, what should be done when a new album is added or a new user signs up for a service? The answer is: we can’t generate a recommendation since there is no information present. This is known as the cold start problem. The cold start problem is typically tackled with the second type of the recommender systems — content-based recommendations.

Content-based recommendations

Content-based filtering is based on the embeddings and the similarity between the vectors in a feature space (typically, this might be 128 or 256-dimensional space, but they are unlikely to be more than 1024 dimensions). Embeddings provide a way to encode the important information about the users and the items.

First, you represent the items in the feature space, and then, you represent the users in the same feature space. Since the features exist in the latent space, it might be hard to interpret them, but to make the things more concrete, lets say a user has a vector that contains values for the genres they are listening to and the metadata (an example of this might be the Rate Your Music site) — e.g., ‘ambient’ might be assigned a score of 1.76, ‘drone’ might have a value of 0.95, etc (corresponding to their previous history of listening to the genres or explicitly rating items in the genres). At the same time, the item (an album in this case), will also have the scores assigned to each index — 0.45 to the ‘ambient’, 1.23 to ‘drone’, etc.

Then, you choose the similarity metric (such as cosine similarity, euclidean distance, the dot product, etc). The closer vectors will represent the similarity. Taking the dot product, let’s say a user’s embedding is $x$ , the item’s embedding is $y$ , and their dot product is

x \cdot y = \sum_{i=1}^{d} x_i y_i

The higher value here will correspond to whether the user will like this item, because it means there is an overlap between the two vectors in the latent space. When the prediction is being made, there is a hyperparameter, $k$ , which determines how many predictions (or recommendations) should be made. Thus, in the content-based recommendations, one might fetch the top ten ( $k=10)$ embeddings based on their value.

The content-based recommenders also relies on the metadata on the user or the item, thus, mitigating the cold start problem. Most of the systems in practice are hybrid recommenders, combining collaborative filtering and content-based recommendations.

Evaluation metrics

Recommender system evaluation metrics could be split into three categories — predictive quality metrics, ranking quality metrics, and behavioral metrics. Before the evaluation takes place, you have to decide on the cut-off point, $k$ (such that only top $k$ predictions are made), after which the predictions will be ignored.

Predictive metrics show how well the system performs in terms of the relevance of the predicted items. If the scores are graded, the usual regression metrics can be used (e.g., MAE or rMSE). If the relevance labels are binary, one would use the the classification metrics, namely, precision and recall at k, or the F-score. Precision at k shows the proportion of relevant items among the top K items:

\text{Precision at k} = \frac{\text{\# of relevant items in k}}{k}

If a user has only a small number of relevant items, say three, then the maximum achievable precision at a recommendation list size of ten is limited to 30%. This cap on precision makes it challenging to compare or calculate an average precision across all users in the dataset, as users with fewer relevant items will inherently have lower maximum precision values. Precision becomes a more useful metric when users have many relevant items, but their interest is constrained. For instance, if a user is expected to have hundreds of potential matches, and the goal is to recommend the top five, precision can effectively measure the quality of this short recommendation set.

Recall at k shows the ratio between the relevant items out of the total number of relevant items:

\text{Recall at k} = \frac{\text{\# of relevant items in k}}{\text{\# of relevant items }}

Recall measures the coverage of the recommendations (how many relevant items the system captured at k). F-score combines the relevance of precision and the coverage of recall when both metrics are important.

The predictive metrics can't evaluate the ranking quality (how good the order of the returned items is), and this calls for ranking metrics, which include Mean Average Precision at k (MAP), MRR (Mean Reciprocal Rank), Normalized Discounted Cumulative Gain (NDCG), and the hit rate. The short summary table is presented below:

Metric	Description
Mean Average Precision at k	Evaluates the ranking quality by considering the average precision of the top-k recommended items, giving higher scores to relevant items ranked higher
Mean Reciprocal Rank	Evaluates the ability to rank the most relevant item at the top of the recommendation list by calculating the reciprocal of the rank of the first relevant item
Normalized Discounted Cumulative Gain	Evaluates the usefulness of a recommendation list based on the graded relevance of the recommended items, assigning higher weights to more relevant items appearing at the top
Hit rate	Measures the proportion of users or sessions for which the recommender system successfully recommended at least one relevant item, indicating its overall ability to make successful recommendations

The last category is behavioral metrics, which typically include diversity, novelty, serendipity and popularity bias. A comparative table of the behavioral metrics is given below:

Metric	Description
Diversity	Evaluates the ability to suggest a varied set of items, ensuring that the recommendations do not become too narrow or repetitive
Novelty	Evaluates the ability to suggest items that are new to the user, promoting the discovery of potentially valuable content
Serendipity	Evaluates the ability to provide unexpected recommendations that are still relevant to the user, introducing them to items they might not have otherwise encountered
Popularity bias	Evaluates the tendency to disproportionately suggest popular items, potentially limiting the exposure to less well-known but relevant items

Conclusion

As a result, you are now familiar with the following:

There are three main types of recommender systems — collaborative filtering, content-based recommenders, and hybrid systems;
Collaborative filtering makes recommendations based on finding similarities between users preferences and behaviors;
Content-based filtering recommends items to users based on the similarities between the attributes of those items and the user's previously expressed preferences;
Recommender systems can be evaluated along three main dimensions: predictive quality metrics (such as Precision at k), ranking quality metrics (such as Mean Average precision at k), and behavioral metrics (such as novelty).

How did you like the theory?

Report a typo