Yahoo makes its largest-ever machine learning dataset available for researchers

Yahoo largest-ever machine learning dataset — Yahoo dataset sets benchmark for large-scale machine learning and recommender systems Reuters

Yahoo has just announced the release of its largest-ever machine learning dataset that weighs 13.5TB. The dataset is completely anonymised and consists of millions of users who visit its news website.

The interaction data, collected during February-May 2015 of about 20 million users, includes the Yahoo homepage, news, sports, finance, movies and real estate. In addition to the interaction data, the dataset contains demographic information such as age, gender and geographic data. Yahoo is also releasing the titles, summaries and key-phrases of the news articles.

Yahoo is not the only tech major to release its large-scale database. In November 2015, Google released its machine learning technology called TensorFlow that powers a number of products such as Google Photo search, speech recognition in Google apps and Smart Reply feature for its email app.

Tom Mitchell, machine learning department chair, Carnegie Mellon University, states that Yahoo's News Feed database marks a significant contribution to the research community, who will have access to realistic scale data. They can study which news articles interest which user.

"Access to datasets of this size is essential to design and develop machine learning algorithms and technology that scales to truly 'big' data. At the Jacobs School of Engineering at UC San Diego, it will directly and significantly benefit the wide variety of ongoing research in machine learning, artificial intelligence, information retrieval, and big data applications," said Gert Lanckriet, professor, department of Electrical and Computer Engineering, University of California, San Diego.

The dataset is available as part of the Yahoo Labs Webscope data-sharing programme, which is a reference library of interesting data used by academics and scientists for non-commercial use. The primary reason behind the release of the dataset is to promote independent research in the segment of large-scale machine learning.

Yahoo believes its database creates a benchmark for large-scale machine learning. It adds that having access to large-scale database is a privilege that has always been reserved for machine learning researchers and data scientists and academic researchers.

Suju Rajan, director of research for personalisation science at Yahoo Labs, said: "We hope that this data release will similarly inspire our fellow researchers, data scientists, and machine learning enthusiasts in academia, and help validate their models on an extensive, 'real-world' dataset. We strongly believe that this dataset can become the benchmark for large-scale machine learning and recommender systems, and we look forward to hearing from the community about their applications of our data."

Yahoo