We consider we have features like x1 = romance and x2 = action, to measure the content of the movies.
If we introduce the intercept term feature \(x_0 = 1\), we can consider that each movie is represented by a feature vector x = [ x0, x1, x2].
For each user j, we learn a parameter \(\theta^{(j)} \in \mathbb{R}^3\). Then we can predict user j would rate movie i with \((\theta^{(j)})^T \cdot x^{(i)}\)
We use \(m^j\) as the number of movies rated by user j.
To learn \(\theta^{(j)}\), we find the minimal value of the cost function: \( \frac{1}{2m^{(j)}} \sum\limits_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2 m^{(j)}} \sum\limits_{k=1}^n (\theta_k^{(j)})^2 \)
In the previous cost function, we can just remove the \(m^{(j)}\) value, since this is just a constent.
Now we can also minimize the function for all users at the same time, so minimizing:
\[ \frac{1}{2} \sum\limits_{j=1}^{n_u} \sum\limits_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2 } \sum\limits_{j=1}^{n_u} \sum\limits_{k=1}^n (\theta_k^{(j)})^2 \]
Given \(\theta^{(1)},\dots,\theta^{(n_u)}\), to learn \(x^{(i)}\), we minimize on \(x^{(i)}\): \(\frac{1}{2} \sum\limits_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2} \sum\limits_{k=1}^n (x_k^{(i)})^2\)
Now to learn \(x^{(1)},\dots,x^{(n_m)}\), we try to minimize on \(x^{(1)},\dots,x^{(n_m)}\):
\[\frac{1}{2} \sum\limits_{i=1}^{n_m} \sum\limits_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2} \sum\limits_{i=1}^{n_m} \sum\limits_{k=1}^n (x_k^{(i)})^2 \]
Initially we don't have the \(\theta\) vectors or the features, so we start with randomly guessing the values of the \(\theta\) vectors, then we estimate the x features, then the \(\theta\) vectors again, and so on.
\[ J(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots,\theta^{(n_u)}) = \frac{1}{2} \sum\limits_{(i,j):r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 \] \[+ \frac{\lambda}{2} \sum\limits_{i=1}^{n_m} \sum\limits_{k=1}^n (x_k^{(i)})^2 + \frac{\lambda}{2 } \sum\limits_{j=1}^{n_u} \sum\limits_{k=1}^n (\theta_k^{(j)})^2 \].
Note that when using this algorithm, we normally do without the x0 intercept term because here, we are learning the features, so if this is really needed, then the algorithm will learn it by itself. So we get \(x \in \mathbb{R}^n\) and \(\theta \in \mathbb{R}^n\).
Before starting this algorithm, we initialize \(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots,\theta^{(n_u)}\) to small random values.