In context of search engines, search result clustering is much more effective in communicating the relevant subtopics related to the query than the traditional list view of hyperlinks. This is crucial in case of broader queries (resulting from conscious information need when user does not know enough about a subject to form a narrow specific query) for which documents from multiple subtopics can be relevant. Typically, clustering of search results is envisioned as a post processing step of the retrieved results where a clustering algorithm is applied only on the vector representations (e.g. tfidf) of the retrieved documents. However, we hypothesize a more direct involvement of the query to the clustering process is necessary to be able to generate subtopic clusters that are particularly relevant for the query (this is more apparent for broad queries where we can have multiple clustering of the same set of retrieved documents depending on the query context) i.e. we need query-specific subtopic clustering. In this work, we develop a query-specific siamese similarity metric (QS3M) that can be leveraged with any distance-based clustering algorithm (e.g. HAC) to obtain query-specific subtopic clustering.
How QS3M works - QS3M provides a similarity score for a triplet of a query and a pair of passage/ documents represented as document embeddings (q, pi, pj) using Sentence-BERT. A siamese neural network is used to project the pair of passages and the query in some latent embedding space (q', pi', pj'). Then another MLP layer obtains the similarity score from the concatenated interaction vectors (pi'-pj', pi'-q', pj'-q') and the individual passage vectors (pi', pj'). This similarity score is then used to govern the clustering mechanism. Therefore, the query context has a direct influence on the similarity metric resulting in query-specific clusters.
How QS3M is trained - It is trained as a binary classifier; each training sample is a triplet of a query and a pair of passages with a target binary label denoting whether the pair of passages should share the same cluster in context of the query. Binary classification loss (binary cross entropy) can be employed to optimize. Negative sampling is used to balance the dataset.
For more details, please refer to our paper and source code.
Paper: Query-specific subtopic clustering
Github repo: QS3M
Now, maybe you are wondering "instead of training the similarity metric from triplets sampled from the clustering dataset, can we not directly optimize for a clustering metric?". Our COB project and the related paper explores this very question and proposes a novel method to address the problem.