XVI - Recommender Systems

XVI - Recommender Systems

16.1 - Problem Formulation

We use a simple movie rating example:
- \(n_u\) = number of users
- \(n_m\) = number of movies
- r(i,j)=1 if user j has rated the movie i
- \(y^{(i,j)}\) = rating given by user j to movie i (defined only if r(i,j)=1)

16.2 - Content Based Recommendations

We consider we have features like x1 = romance and x2 = action, to measure the content of the movies.
If we introduce the intercept term feature \(x_0 = 1\), we can consider that each movie is represented by a feature vector x = [ x0, x1, x2].
For each user j, we learn a parameter \(\theta^{(j)} \in \mathbb{R}^3\). Then we can predict user j would rate movie i with \((\theta^{(j)})^T \cdot x^{(i)}\)

We use \(m^j\) as the number of movies rated by user j.
To learn \(\theta^{(j)}\), we find the minimal value of the cost function: \( \frac{1}{2m^{(j)}} \sum\limits_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2 m^{(j)}} \sum\limits_{k=1}^n (\theta_k^{(j)})^2 \)
In the previous cost function, we can just remove the \(m^{(j)}\) value, since this is just a constent.
Now we can also minimize the function for all users at the same time, so minimizing:

\[ \frac{1}{2} \sum\limits_{j=1}^{n_u} \sum\limits_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2 } \sum\limits_{j=1}^{n_u} \sum\limits_{k=1}^n (\theta_k^{(j)})^2 \]

16.3 - Collaborative Filtering

Given \(\theta^{(1)},\dots,\theta^{(n_u)}\), to learn \(x^{(i)}\), we minimize on \(x^{(i)}\): \(\frac{1}{2} \sum\limits_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2} \sum\limits_{k=1}^n (x_k^{(i)})^2\)
Now to learn \(x^{(1)},\dots,x^{(n_m)}\), we try to minimize on \(x^{(1)},\dots,x^{(n_m)}\):

\[\frac{1}{2} \sum\limits_{i=1}^{n_m} \sum\limits_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2} \sum\limits_{i=1}^{n_m} \sum\limits_{k=1}^n (x_k^{(i)})^2 \]

Initially we don't have the \(\theta\) vectors or the features, so we start with randomly guessing the values of the \(\theta\) vectors, then we estimate the x features, then the \(\theta\) vectors again, and so on.

16.4 - Collaborative Filtering Algorithm

We put the 2 optimizations objectives we mentioned above together and we try to minimize the following cost function:

\[ J(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots,\theta^{(n_u)}) = \frac{1}{2} \sum\limits_{(i,j):r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 \] \[+ \frac{\lambda}{2} \sum\limits_{i=1}^{n_m} \sum\limits_{k=1}^n (x_k^{(i)})^2 + \frac{\lambda}{2 } \sum\limits_{j=1}^{n_u} \sum\limits_{k=1}^n (\theta_k^{(j)})^2 \].

Note that when using this algorithm, we normally do without the x0 intercept term because here, we are learning the features, so if this is really needed, then the algorithm will learn it by itself. So we get \(x \in \mathbb{R}^n\) and \(\theta \in \mathbb{R}^n\).

Before starting this algorithm, we initialize \(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots,\theta^{(n_u)}\) to small random values.

16.5 - Vectorization: Low Rank Matrix Factorization

For the movie example, we build a matrix \(Y \in \mathbb{R}^{n_m \times n_u}\), containing the \(y^{(i,j)}\) values (this matrix will contain some undefined values).
Moreother we know that the predicted rating for a given movie i and a user j would be: \((\theta^{(j)})^Tx^{(i)}\)
Now if we define the matrix X as a matrix containing the vectors \((x^{(i)})^T\) in \(n_m\) rows, and matrix \(\Theta\) as a matrix containing the vectors \((\theta^{(j)})^T\) in \(n_u\) rows, then the predition matrix is simply: \(Pred = X\Theta^T\).

Finding related movies

How to find movies j related to movie i ?
- We try to find from the feature vectors if \(\|x^{(i)} - x^{(j)}\|\) is small ⇒ this would indicate that the movies are similar.

16.6 - Implementation details: Mean Normalization

For each movie we compute the average of the ratings already provided. (not predicted !).
Then for each row in the matrix Y, we substract the computed average.
When doing the computation with this matrix, then we need to add back the mean value for a given movie to a computed prediction value: \((\theta^{(j)})^Tx^{(i)} + \mu_i\)

Table of Contents