[Paper Reading] Evaluation of Session Based Recommendation Algorithms

This blog is a reading note for paper Evaluation of Session-based Recommendation Algorithms. They formulate an experimental protocol for evaluating session-based RS algorithms, as well as supply a concrete performance comparison of these algorithms with variety of datasets and evaluation measures. This is a high-quality paper which shows each evaluating metrics as well as protocols in substantial details.

Introduction

In academia, sequential recommendation problems are typically operationalized as the task of predicting the next user action. Several methods have been applied, such as

sequential pattern mining techniques in the early time
Markov models lately
deep learning approaches

There is no true "standard" datasets or evaluation protocols exist in the recommender system community, which makes researchers difficult to compare the various algorithms.

Review of Session-Based Rec Methods

Challenges for Sequence-Aware Recommenders

The mining process can be computationally demanding
Finding good algorithm parameters can be challenging
Sometimes using frequent item sequences does not lead to better recommendation than using more simple item co-occurrence patterns.

Recent Focus on Sequence Learning

Markov Chain (MC) models
Reinforcement Learning (RL)
Markov Decision Processes (MDP)
Recurrent Neural Networks (RNN)

Details of Methods

Baseline Methods

All baselines implement very simple prediction schemes and have low computational complexity both for training and recommending

Simple Association Rules (AR)

Let a session s be a chronologically ordered tuple of item click events s = (s₁, s₂, ..., s_m) and S_p the set of all past sessions. Given a user's current session s with S_|s| being the last item in s, we can define the score for a recommendable item i as follow, where the indicator function 1_EQ(a, b) is 1 in case a and b refer to the same item and 0 otherwise.

Markov Chains (MC)

Sequential Rules (SR)

Nearest Neighbors

Item-Based KNN (IKNN)

only consider the last element in a given session
then returns those items as recommendations that are most similar to it

Session-based kNN (SKNN)

the SKNN method compares the entire current session with the past sessions.

Vector Multiplication Session-Based kNN (V-SKNN)

we use real-valued vectors to encode the current session
only the very last element of the session obtains a value of "1"
using the dot product as a similarity function between the current weight-encoded session and binary-encoded past session

Sequential Session-based kNN (S-SKNN)

Sequential Filter Session-based kNN (SF-SKNN)

Neural Networks - GRU4REC

Figure 1. Architecture of the GRU4REC

(I am not an expert in Deep Learning. Please refer to the paper for further details.)

Factorization-based Methods

Completing...

Experiment Setup

Evaluation Protocol

Item Selection

predict the immediate next item given the first n elements
consider all subsequent elements for the given session beginning

Training and Test Splits

applying a sliding-window protocol: split the data into several slices of equal size
e.g. one month for training and the subsequent data (the following day) for testing
Then report the average of the performance results for all slices

Performance Measures

Accuracy

For each session, iteratively increment n; measure the hit rate (HR) and the Mean Reciprocal Rank (MRR); finally determine the average HR and MRR

For example:

Proposed Results	Correct Item	Rank (k)	Reciprocal Rank (1/k)
cat, dog, bird	bird	3	1/3
pork, lamb, steak	ham	/	0
Dell, Apple, ThinkPad	Dell	1	1

MRR = (1/3 + 0 + 1) / 3 = 4/9

Use Precision and Recall at defined list lengths
Precision: returning mostly useful stuff
\[ \text{Precision} = \frac{N_{rs}}{N_s} = \frac{TP}{TP+FP} \]
Recall: not missing useful stuff
\[ \text{Recall} = \frac{N_{rs}}{N_r} = \frac{TP}{TP+FN} \]

	Like	Dislike
Recommend	True Positives (TP)	False Negatives (FN)
Not Recommend	False Positives (FP)	True Negative (TN)

Additional Quality Factors

Coverage: how many different items have ever presented in the recommending list. (also referred as aggregate diversity)
Popularity bias: report the average popularity score for the elements of the top-k recommendations of each algorithm
Cold Start: Some methods might only be effective when a large amount of data have been trained. Therefore, we could manually removed parts of the older training data to simulate such situations.
Scalability: Report how many steps the algorithms needed to train the models and to make predictions at runtime as well as the memory costs.

Datasets

Measure datasets from three different domains: e-commerce, music, and news.

Results

E-Commerce Datasets

Accuracy

Hit rate and the MRR:

Factorized Markov Chain approaches (IFSM, FPMC and FOSSIL) indicate lowest accuracy values.
These results indicate that the methods designed based on the assumption of longer-term and richer user profiles are generally not particularly well suited for the specifics of session-based recommendation problems.
The simple pairwise association methods (AR and SR) mostly occupy the middle places in the comparison. These methods also have low computational complexity.
The novel factorization method SMF performs well in REC15 and RSC15-S. Due to the embedding of the current user session, the SMF method outperforms the factorization-based methods.
GRU4REC exhibits a well performance in the hit rate and MRR among most datasets.

Precision and Recall:

The best performance is achieved by the neighborhood-based methods, with V-SKNN working very well across all datasets. Generally, KNN-based methods are successful in predicting the next element and the relevant items in the sessions.

Impact of Different List Lengths:

The ranking of the algorithms is unaffected by the change of the list length.

Cold-Start / Sparsity Effects

Previous experiments revealed that discarding major parts of the older data has no strong impact on the prediction accuracy.

Coverage and Popularity Bias

The factorization-based methods result to the high values in coverage issue.

Conclusion

In this work, researchers compared a number of algorithms. The experimental analyses on various datasets reveal that some simple methods is able to outperform even the most recent methods based on RNN in prediction accuracy. In the meanwhile, the computational demands are comparatively low.

The results indicate several improvements for KNN methods can lead to the dramatic enhancement of performance for different datasets.

Several aspects should be explored. Firstly, the additional information like different types of user actions (e.g., add-to-cart, add-to-wishlist, etc.), item metadata, external context information, trends in community, or long-term user preference data when users are logged in. In addition, researchers should understand in which scenarios certain algorithms are better than others. These insights could further facilitate in building the hybrid recommendation approaches.

ml paper