抄録
Serendipity in recommender systems (RSs) has attracted increasing attention
as a concept that enhances user satisfaction by presenting unexpected and
useful items. However, evaluating serendipitous performance remains challenging
because its ground truth is generally unobservable. The existing offline
metrics often depend on ambiguous definitions or are tailored to specific
datasets and RSs, thereby limiting their generalizability. To address this
issue, we propose a universally applicable evaluation framework that leverages
large language models (LLMs) known for their extensive knowledge and reasoning
capabilities, as evaluators. First, to improve the evaluation performance of
the proposed framework, we assessed the serendipity prediction accuracy of LLMs
using four different prompt strategies on a dataset containing user-annotated
serendipitous ground truth and found that the chain-of-thought prompt achieved
the highest accuracy. Next, we re-evaluated the serendipitous performance of
both serendipity-oriented and general RSs using the proposed framework on three
commonly used real-world datasets, without the ground truth. The results
indicated that there was no serendipity-oriented RS that consistently
outperformed across all datasets, and even a general RS sometimes achieved
higher performance than the serendipity-oriented RS.