Open Datasets Compiled by HackerNoon's blog

References for Web-Scale Information Retrieval Challenges

2 Jul 2025

A list of scholarly references at the intersection of deep learning in information retrieval, large-scale approximate nearest neighbor search

Navigating Skew: Addressing Language & Domain Biases in Web Data

2 Jul 2025

Explore the challenges posed by high-skewed language and topic distributions in web data, acknowledging potential model biases

Mind the Gap: End-to-End Quality Drop with ANN in Web Search AI

2 Jul 2025

Discover how integrating ANN indices leads to a substantial drop in final retrieval quality compared to brute-force search

From Embeddings to ANN: Practical Performance on MS MARCO Web Search

1 Jul 2025

Dive into the practical evaluation of embedding models and ANN algorithms on MS MARCO Web Search, revealing insights into real-world search system behavior.

Measuring Search Excellence: Result Quality and System Performance

1 Jul 2025

Explore the robust evaluation framework for MS MARCO Web Search baselines, covering both result quality and system performance under resource constraints.

Establishing Baselines: MS MARCO Web Search's Foundational Methods

1 Jul 2025

Explore the cutting-edge embedding models and disk-based ANN algorithms selected as initial baselines for the new MS MARCO search benchmark.

MS MARCO Web Search: Unveiling Initial Benchmark Results

1 Jul 2025

Explore the foundational benchmark results on the MS MARCO Web Search 100M dataset, featuring state-of-the-art embedding models

Unlocking Web Search AI: MS MARCO's Three Grand Challenges

1 Jul 2025

Discover how MS MARCO Web Search sparks new research, posing formidable challenges in large-scale embedding model generalization

Deep Dive into MS MARCO Web Search: Unpacking Dataset Characteristics

29 Jun 2025

Explore a comprehensive analysis of the MS MARCO Web Search dataset, detailing its multilingual distribution and significant data skew