So I have been working in R-company for a little while now. I have been exposed to various existing solutions in the company that recommend users products that they might find relevant. One of them was Collaborative Filtering (CF) based solution, which has really caught my attention as my perception towards CF had been a single matrix that could ‘rule it all’. Well, I was a god damn green horn. The CF-centric predictive system is based on a ‘combination’ of several CF-trained similarity matrices.
Firstly, a bit of CF. Predicted rating or score is defined as
and the loss function, L can be Residual Sum of Squares (RSS) but I was using RSS with L2-regularization and bias. Predicted rating hence became
and loss function became
Based on training data, matrix factorization was carried out or in layman’s word learning of the identity of 2 matrices whose multiplication would form the current matrix. Few ways to go about this include Lower Upper decomposition (LU), SV decomposition (SVD), Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD). Probably the most straight forward and scalable ones would be SGD as the update is a one-liner and that there’s no matrix inversion involved. SGD is just the non-batch version of Gradient Descent (GD) and ALS is the non-batch version Least Squares. Non-batch hereby means finding solution based on a record at a time, instead of the entire training data.
The gradient is computed as
For ALS, those updates would be
The same applies to and
. Whereas for SGD
where is the learning rate. The same applies to
and
as well. The updates for both the ALS and SGD should be alternating between user vector and item vector at every other step.
Moving on to similarity matrices (trained by CF) as features. Following 6 matrices have been trained for blending.
- Browsing – Browsing
- Browsing – Purchase
- Browsing – Genre
- Purchase – Purchase
- Purchase – Genre
- Genre – Genre
Those values derived serve as features which would eventually get combined through models like Logistic Regression, etc.
Similarity matrices can be built by extracting features like frequency, standard deviation and so on. Metric used could be Euclidean distance, correlation, etc.
to be continued…
References: