From Embeddings to ANN: Practical Performance on MS MARCO Web Search

In this experiment, we measure the MRR and recall of all the baseline embedding models. From the result, we can see that SimANS with the ambiguous samples as negative training examples performs the best on MS MARCO Web Search 100M dataset. The ranking of the baseline models is aligned with the model evolution trend in the literature. Nonetheless, when compared with the evaluation results in Natural Question (NQ) [28] and MS MARCO Passage Ranking [35], the gap in performance between ANCE and SimANS in MS MARCO Web Search becomes less significant.

We also evaluate the system performance for the three baseline embedding models. Since they use the same model architecture and the same number of parameters, their serving time cost is similar. At the peak of 698 QPS, the latency percentiles of 50, 90, and 99 are 9.896 ms, 10.018 ms, and 11.430 ms, respectively.

4.5 Evaluation of ANN Algorithms

In this experiment, we evaluate the ANN performance with vectors generated by the best baseline model. We build both DiskANN and SPANN indices and evaluate both their serving performance and result quality. Here we only focus on evaluating the gap between ANN and KNN. Therefore, we use the brute-force search results as the ground truth to measure recall. Table 5 summarizes the recall and system performance of the two baselines. From the results, we can see that it is difficult to achieve high recall when the number of return results K is large. One of the reasons is that the distributions of queries and documents are highly-skewed and far away from each other (see figure 6). We also observe this phenomenon in DPR and ANCE embeddings.

Authors:

(1) Qi Chen, Microsoft Beijing, China;

(2) Xiubo Geng, Microsoft Beijing, China;

(3) Corby Rosset, Microsoft, Redmond, United States;

(4) Carolyn Buractaon, Microsoft, Redmond, United States;

(5) Jingwen Lu, Microsoft, Redmond, United States;

(6) Tao Shen, University of Technology Sydney, Sydney, Australia and the work was done at Microsoft;

(7) Kun Zhou, Microsoft, Beijing, China;

(8) Chenyan Xiong, Carnegie Mellon University, Pittsburgh, United States and the work was done at Microsoft;

(9) Yeyun Gong, Microsoft, Beijing, China;

(10) Paul Bennett, Spotify, New York, United States and the work was done at Microsoft;

(11) Nick Craswell, Microsoft, Redmond, United States;

(12) Xing Xie, Microsoft, Beijing, China;

(13) Fan Yang, Microsoft, Beijing, China;

(14) Bryan Tower, Microsoft, Redmond, United States;

(15) Nikhil Rao, Microsoft, Mountain View, United States;

(16) Anlei Dong, Microsoft, Mountain View, United States;

(17) Wenqi Jiang, ETH Zürich, Zürich, Switzerland;

(18) Zheng Liu, Microsoft, Beijing, China;

(19) Mingqin Li, Microsoft, Redmond, United States;

(20) Chuanjie Liu, Microsoft, Beijing, China;

(21) Zengzhong Li, Microsoft, Redmond, United States;

(22) Rangan Majumder, Microsoft, Redmond, United States;

(23) Jennifer Neville, Microsoft, Redmond, United States;

(24) Andy Oakley, Microsoft, Redmond, United States;

(25) Knut Magne Risvik, Microsoft, Oslo, Norway;

(26) Harsha Vardhan Simhadri, Microsoft, Bengaluru, India;

(27) Manik Varma, Microsoft, Bengaluru, India;

(28) Yujing Wang, Microsoft, Beijing, China;

(29) Linjun Yang, Microsoft, Redmond, United States;

(30) Mao Yang, Microsoft, Beijing, China;

(31) Ce Zhang, ETH Zürich, Zürich, Switzerland and the work was done at Microsoft.

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

Measuring Search Excellence: Result Quality and System Performance

Up Next →

Mind the Gap: End-to-End Quality Drop with ANN in Web Search AI