So I have been working in R-company for a little while now. I have been exposed to various existing solutions in the company that recommend users products that they might find relevant. One of them was Collaborative Filtering (CF) based solution, which has really caught my attention as my perception towards CF had been a single matrix that could ‘rule it all’. Well, I was a god damn green horn. The CF-centric predictive system is based on a ‘combination’ of several CF-trained similarity matrices.

Firstly, a bit of CF. Predicted rating or score is defined as

\begin{aligned} & \hat{r}_{ui} = x_u{y_i}^\intercal = \sum_{k} x_{uk}y_{ki} \\ \end{aligned}
\begin{aligned} & \text{s.t.} && k \in \{1, \dots, \# Latent Variables\} \\ \end{aligned}

and the loss function, L can be Residual Sum of Squares (RSS) but I was using RSS with L2-regularization and bias. Predicted rating hence became

\begin{aligned} & \hat{r}_{ui} = x_u{y_i}^\intercal + \mu + b_u + b_i \\ \end{aligned}

and loss function became

\begin{aligned} & \textit{L} = \sum_{u,i}(r_{ui} - \hat{r}_{ui})^2 + \lambda_{xf} \sum_u \lVert x_u \rVert ^ 2 + \lambda_{yf} \sum_i \lVert y_i \rVert ^ 2 + \lambda_{xb} \sum_u \lVert b_u \rVert ^ 2 + \lambda_{yb} \sum_i \lVert b_i \rVert ^ 2\\ \end{aligned}

Based on training data, matrix factorization was carried out or in layman’s word learning of the identity of 2 matrices whose multiplication would form the current matrix. Few ways to go about this include Lower Upper  decomposition (LU), SV decomposition (SVD), Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD). Probably the most straight forward and scalable ones would be SGD as the update is a one-liner and that there’s no matrix inversion involved. SGD is just the non-batch version of Gradient Descent (GD) and ALS is the non-batch version Least Squares. Non-batch hereby means finding solution based on a record at a time, instead of the entire training data.

\begin{aligned} & \frac{\partial \textit{L}}{\partial x_u} & & = -2 \sum_{u,i}(r_{ui} - \hat{r}_{ui})y_i + 2 \lambda_{xf} \sum_{u}x_u\\ &&& = -2(r_u - x_u\text{Y}^\intercal)\text{Y} + 2 \lambda_{xf}x_u\\ \end{aligned}

For ALS, those updates would be

\begin{aligned} & \frac{\partial \textit{L}}{\partial x_u} & & = 0\\ & r_u\text{Y} & & = x_u\text{Y}^\intercal\text{Y} + \lambda_{xf}x_u \\ &&& = x_u(\text{Y}^\intercal\text{Y} + \lambda_{xf}I_{\# Latent Variables}) \\ & x_u & & = r_u\text{Y} (\text{Y}^\intercal\text{Y} + \lambda_{xf}I_{\# Latent Variables})^{-1} \end{aligned}

The same applies to $b_u\text{,} y_i$ and $b_i$. Whereas for SGD

\begin{aligned} & x_u \gets x_u - \eta\frac{\partial \textit{L}}{\partial x_u} \end{aligned}

where $\eta$ is the learning rate. The same applies to $b_u\text{,} y_i$ and $b_i$ as well. The updates for both the ALS and SGD should be alternating between user vector and item vector at every other step.

Moving on to similarity matrices (trained by CF) as features. Following 6 matrices have been trained for blending.

1. Browsing – Browsing
2. Browsing – Purchase
3. Browsing – Genre
4. Purchase – Purchase
5. Purchase – Genre
6. Genre – Genre

Those values derived serve as features which would eventually get combined through models like Logistic Regression, etc.

Similarity matrices can be built by extracting features like frequency, standard deviation and so on. Metric used could be Euclidean distance, correlation, etc.

to be continued…
References: