ORCA

Dense Passage Retrieval (DPR) models are a family of neural retrievers, employing only the encoder part of a Transformer architecture like BERT to encode queries and short text documents into a shared latent vector space. In this context, the notion of relevance is captured through the distance metric defined in that shared vector space. Typically, DPR models are trained using a ranking or retrieval objective. However, in this study, we demonstrate that for specific retrieval scenarios, such as overview passage retrieval, these models can be enhanced by incorporating an additional clustering objective that leverages query-specific subtopic information.

In the realm of information retrieval, many queries encompass a wide array of subtopics. Addressing this complexity, we introduce Topic-Mono-BERT, a model that seamlessly integrates neural ranking and query-specific clustering to enhance the relevance and coherence of search results. Our model is built on the hypothesis that embeddings designed to cluster topically similar content together will inherently improve ranking accuracy. This synergy between clustering and ranking leads to more relevant and comprehensive search results. Our extensive evaluations on two publicly available passage retrieval datasets demonstrate a remarkable 16% improvement in the identification of relevant overview passages.

Paper: Topic-Mono-BERT: A Joint Retrieval-Clustering System for Retrieving Overview Passages

Github repo: ORCA