Datasets

Home

A tour of datasets in different domains
Roles datasets play
Harms associated with data
Beyond datasets
Summary
Chapter notes
References

It’s become commonplace to point out that machine learning models are only as good as the data they’re trained on. The old slogan “garbage in, garbage out” no doubt applies to machine learning practice, as does the related catchphrase “bias in, bias out”. Yet, these proverbs still understate—and somewhat misrepresent—the significance of data for machine learning.

It’s not only the output of a learning algorithm that may suffer with poor input data. A dataset serves many other vital functions in the machine learning ecosystem. The dataset itself is an integral part of the problem formulation. It implicitly sorts out and operationalizes what the problem is that practitioners end up solving. Datasets have also shaped the course of entire scientific communities in their capacity to measure and benchmark progress, support competitions, and interface between researchers in academia and practitioners in industry.

If so much hinges on data in machine learning, it might come as a surprise that there is no simple answer to the question of what makes data good for what purpose. The collection of data for machine learning applications has not followed any established theoretical framework, certainly not one that was recognized a priori.

In this chapter, we take a closer look at popular datasets in the field of machine learning and the benchmarks that they support. We will use this to tease apart the different roles datasets play in scientific and engineering contexts. Then we will review the harms associated with data and discuss how they can be mitigated based on the dataset’s role. We will conclude with several broad directions for improving data practices.

We limit the scope of this chapter in some important ways. Our focus will be largely on publicly available datasets that support training and testing purposes in machine learning research and applications. Our focus excludes large swaths of industrial data collection, surveillance, and data mining practices. It also excludes data purposefully collected to test specific scientific hypotheses, such as, experimental data gathered in a medical trial.

A tour of datasets in different domains

The creation of datasets in machine learning does not follow a clear theoretical framework. Datasets aren’t collected to test a specific scientific hypothesis. In fact, we will see that there are many different roles data plays in machine learning. As a result, it makes sense to start by looking at a few influential datasets from different domains to get a better feeling for what they are, what motivated their creation, how they organized communities, and what impact they had.

TIMIT

Automatic speech recognition is a machine learning problem of significant commercial interest. Its roots date back to the early 20th century.Xiaochang Li and Mara Mills, “Vocal Features: From Voice Identification to Speech Recognition by Machine,” Technology and Culture 60, no. 2 (2019): S129–60.

Interestingly, speech recognition also features one of the oldest benchmarks data sets, the TIMIT (Texas Instruments/Massachusetts Institute for Technology) data. The creation of the dataset was funded through a 1986 DARPA program on speech recognition. In the mid-eighties, artificial intelligence was in the middle of a “funding winter” where many governmental and industrial agencies were hesitant to sponsor AI research because it often promised more than it could deliver. DARPA program manager Charles Wayne proposed that a way around this problem was establishing more rigorous evaluation methods. Wayne enlisted the National Institute of Standards and Technology to create and curate shared datasets for speech, and he graded success in his program based on performance on recognition tasks on these datasets.

Many now credit Wayne’s program with kick starting a revolution of progress in speech recognition.Mark Liberman, “Fred Jelinek,” Computational Linguistics 36, no. 4 (2010): 595–99.Kenneth Ward Church, “Emerging Trends: A Tribute to Charles Wayne,” Natural Language Engineering 24, no. 1 (2018): 155–60. Mark Liberman and Charles Wayne, “Human Language Technology.” AI Magazine 41, no. 2 (2020). According to Kenneth Ward Church,

It enabled funding to start because the project was glamour-and-deceit-proof, and to continue because funders could measure progress over time. Wayne’s idea makes it easy to produce plots which help sell the research program to potential sponsors. A less obvious benefit of Wayne’s idea is that it enabled hill climbing. Researchers who had initially objected to being tested twice a year began to evaluate themselves every hour.

A first prototype of the TIMIT dataset was released in December of 1988 on a CD-ROM. An improved release followed in October 1990. TIMIT already featured the training/test split typical for modern machine learning benchmarks. There’s a fair bit we know about the creation of the data due to its thorough documentation.John S Garofolo et al., “DARPA TIMIT Acoustic-Phonetic Continous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1,” NASA STI/Recon Technical Report n 93 (1993): 27403.

TIMIT features a total of about 5 hours of speech, composed of 6300 utterances, specifically, 10 sentences spoken by each of 630 speakers. The sentences were drawn from a corpus of 2342 sentences such as the following.

She had your dark suit in greasy wash water all year. (sa1)
Don't ask me to carry an oily rag like that. (sa2)
This was easy for us. (sx3)
Jane may earn more money by working hard. (sx4)
She is thinner than I am. (sx5)
Bright sunshine shimmers on the ocean. (sx6)
Nothing is as offensive as innocence. (sx7)

The TIMIT documentation distinguishes between 8 major dialect regions in the United States, documented as New England, Northern, North Midland, South Midland, Southern, New York City, Western, Army Brat (moved around). Of the speakers, 70% are male and 30% are female. All native speakers of American English, the subjects were primarily employees of Texas Instruments at the time. Many of them were new to the Dallas area where they worked.

Racial information was supplied with the distribution of the data and coded as “White”, “Black”, “American Indian”, “Spanish-American”, “Oriental”, and “Unknown”. Of the 630 speakers, 578 were identified as White, 26 as Black, 2 as American Indian, 2 as Spanish-American, 3 as Oriental, and 17 as unknown.

Demographic information about the TIMIT speakers
	Male	Female	Total (%)
White	402	176	578 (91.7%)
Black	15	11	26 (4.1%)
American Indian	2	0	2 (0.3%)
Spanish-American	2	0	2 (0.3%)
Oriental	3	0	3 (0.5%)
Unknown	12	5	17 (2.6%)

The documentation notes:

In addition to these 630 speakers, a small number of speakers with foreign accents or other extreme speech and/or hearing abnormalities were recorded as “auxiliary” subjects, but they are not included on the CD-ROM.

It comes to no surprise that early speech recognition models had significant demographic and racial biases in their performance.

Today, several major companies, including Amazon, Apple, Google, and Microsoft, all use speech recognition models in a variety of products from cell phone apps to voice assistants. There is no longer a major open benchmark that would support training models competitive with the industrial counterparts. Industrial speech recognition pipelines are generally complex and use proprietary data sources that we don’t know a lot about. Nevertheless, today’s speech recognition systems continue to exhibit performance disparities along racial lines.Allison Koenecke et al., “Racial Disparities in Automated Speech Recognition,” Proceedings of the National Academy of Sciences 117, no. 14 (2020): 7684–89.

UCI Machine Learning Repository

The UCI Machine Learning Repository currently hosts more than 500 datasets, mostly for different classification and regression tasks. Most datasets are relatively small, consisting of a few hundred or a few thousand instances. The majority are structured tabular data sets with a handful or a few tens of attributes.

The UCI Machine Learning Repository contributed to the adoption of the train-test paradigm in machine learning in the late 1980s. Pat Langley recalls:

The experimental movement was aided by another development. David Aha, then a PhD student at UCI, began to collect data sets for use in empirical studies of machine learning. This grew into the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/), which he made available to the community by FTP in 1987. This was rapidly adopted by many researchers because it was easy to use and because it let them compare their results to previous findings on the same tasks.Pat Langley, “The Changing Science of Machine Learning” (Springer, 2011).

The most popular dataset in the repository is the Iris Data Set containing taxonomic measurements of 150 iris flowers, 50 from each of 3 species. The task is to classify the species given the measurements.

As of October 2020, the second most popular dataset in the UCI repository is the Adult dataset. Extracted from the 1994 Census database, it features nearly 50,000 instances describing individuals in the United States, each having 14 attributes. The task is to classify whether an individual earns more than 50,000 US dollars or less. The Adult dataset remains popular in the algorithmic fairness community, largely because it is one of the few publicly available datasets that features demographic information including gender (coded in binary as male/female), as well as race (coded as Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, and White).

Unfortunately, the data has some idiosyncrasies that make it less than ideal for understanding biases in machine learning models. Due to the age of the data, and the income cutoff at $50,000, almost all instances labeled Black are below the cutoff, as are almost all instances labeled female. Indeed, a standard logistic regression model trained on the data achieves about 85% accuracy overall, while the same model achieves 91% accuracy on Black instances, and nearly 93% accuracy on female instances. Likewise, the ROC curves for the latter two groups enclose actually more area than the ROC curve for male instances. This is an atypical situation: more often, machine learning models perform worse on historically disadvantaged groups.

MNIST

The MNIST dataset contains images of handwritten digits. Its most common version has 60,000 training images and 10,000 test images, each having 28x28 black and white pixels.

MNIST was created by researchers Burges, Cortes, and Lecun from an earlier dataset released by the National Institute of Standards and Technology (NIST). The dataset was introduced in a research paper in 1998 to showcase the use of gradient-based deep learning methods for document recognition tasksYann LeCun et al., “Gradient-Based Learning Applied to Document Recognition,” Proceedings of the IEEE 86, no. 11 (1998): 2278–2324.. Since then cited over 30,000 times, MNIST became a highly influential benchmark in the computer vision community. Two decades later, researchers continue to use the data actively.

The original NIST data had the property that training and test data came from two different populations. The former featured the handwriting of two thousand American Census Bureau employees, whereas the latter came from five hundred American high school students.Patrick J Grother, “NIST Special Database 19,” Handprinted Forms and Characters Database, National Institute of Standards and Technology, 1995, 10. The creators of MNIST reshuffled these two data sources and split them into training and test set. Moreover, they scaled and centered the digits. The exact procedure to derive MNIST from NIST was lost, but recently reconstructed by matching images from both data sourcesChhavi Yadav and Léon Bottou, “Cold Case: The Lost Mnist Digits,” arXiv Preprint arXiv:1905.10498, 2019..

The original MNIST test set was of the same size as the training set, but the smaller test set became standard in research use. The 50,000 digits in the original test set that didn’t make it into the smaller test set were later identified and dubbed the lost digits.Yadav and Bottou.

From the beginning, MNIST was intended to be a benchmark used to compare the strengths of different methods. For several years, LeCun maintained an informal leaderboard on a personal website that listed the best accuracy numbers that different learning algorithms achieved on MNIST.

A snapshot of the original MNIST leaderboard from February 2, 1999. Source: Internet Archive (Retrieved: December 4, 2020)
Method	Test error (%)
linear classifier (1-layer NN)	12.0
linear classifier (1-layer NN) [deskewing]	8.4
pairwise linear classifier	7.6
K-nearest-neighbors, Euclidean	5.0
K-nearest-neighbors, Euclidean, deskewed	2.4
40 PCA + quadratic classifier	3.3
1000 RBF + linear classifier	3.6
K-NN, Tangent Distance, 16x16	1.1
SVM deg 4 polynomial	1.1
Reduced Set SVM deg 5 polynomial	1.0
Virtual SVM deg 9 poly [distortions]	0.8
2-layer NN, 300 hidden units	4.7
2-layer NN, 300 HU, [distortions]	3.6
2-layer NN, 300 HU, [deskewing]	1.6
2-layer NN, 1000 hidden units	4.5
2-layer NN, 1000 HU, [distortions]	3.8
3-layer NN, 300+100 hidden units	3.05
3-layer NN, 300+100 HU [distortions]	2.5
3-layer NN, 500+150 hidden units	2.95
3-layer NN, 500+150 HU [distortions]	2.45
LeNet-1 [with 16x16 input]	1.7
LeNet-4	1.1
LeNet-4 with K-NN instead of last layer	1.1
LeNet-4 with local learning instead of ll	1.1
LeNet-5, [no distortions]	0.95
LeNet-5, [huge distortions]	0.85
LeNet-5, [distortions]	0.8
Boosted LeNet-4, [distortions]	0.7

In its capacity as a benchmark, it became a showcase for the emerging kernel methods of the early 2000s that temporarily achieved top performance on MNIST.Dennis DeCoste and Bernhard Schölkopf, “Training Invariant Support Vector Machines,” Machine Learning 46, no. 1 (2002): 161–90. Today, it is not difficult to achieve less than 0.5% classification error with a wide range of convolutional neural network architectures. The best models classify all but a few pathological test instances correctly. As a result, MNIST is widely considered too easy for today’s research tasks.

MNIST wasn’t the first dataset of handwritten digits in use for machine learning research. Earlier, the US Postal Service (USPS) had released a dataset of 9298 images (7291 for training, and 2007 for testing). The USPS data was actually a fair bit harder to classify than MNIST. A non-negligible fraction of the USPS digits look unrecognizable to humansJ Bromley and E Sackinger, “Neural-Network and k-Nearest-Neighbor Classifiers,” Rapport Technique, 1991, 11359–910819., whereas humans recognize essentially all digits in MNIST.

ImageNet

ImageNet is a large repository of labeled images that has been highly influential in computer vision research over the last decade. The image labels correspond to nouns from the WordNet lexical database of the English language.George A Miller, WordNet: An Electronic Lexical Database (MIT Press, 1998). WordNet groups nouns into cognitive synonyms, called synsets. The words car and automobile, for example, would fall into the same synset. On top of these categories WordNet provides a hierarchical tree structure according to a super-subordinate relationship between synsets. The synset for chair, for example, is a child of the synset for furniture in the wordnet hierarchy. WordNet existed before ImageNet and in part inspired the creation of Imagenet.

The initial release of ImageNet included about 5000 image categories, each corresponding to a synset in WordNet. These ImageNet categories averaged about 600 images per categoryJia Deng et al., “Imagenet: A Large-Scale Hierarchical Image Database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition (Ieee, 2009), 248–55.. ImageNet grew over time and its Fall 2011 release had reached about 32,000 categories.

The construction of ImageNet required two essential steps: retrieving candidate images for each synset, and labeling the retrieved images. This first step utilized online search engines and photo sharing platforms with a search interface, specifically, Flickr. Candidate images were taken from the image search results associated with the synset nouns for each category.

For the second labeling step, the creators of ImageNet turned to Amazon’s Mechanical Turk platform (MTurk). MTurk is an online labor market that allows individuals and corporations to hire on-demand workers to perform simple tasks. In this case, MTurk workers were presented with candidate images and had to decide whether or not the candidate image was indeed an image corresponding to the category that it was putatively associated with.

It is important to distinguish between this ImageNet database and a popular machine learning benchmark and competition, called ImageNet Large Scale Visual Recognition Challenge (ILSVRC), that was derived from it.Olga Russakovsky et al., “Imagenet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision 115, no. 3 (2015): 211–52. The competition was organized yearly from 2010 until 2017, reaching significant notoriety in both industry and academia, especially as a benchmark for emerging deep learning models.

When machine learning practitioners say “ImageNet” they typically refer to the data used for the image classification task in the 2012 ILSVRC benchmark. The competition included other tasks, such as object recognition, but image classification has become the most popular task for the dataset. Expressions such as “a model trained on ImageNet” typically refer to training an image classification model on the benchmark data set from 2012.

Another common practice involving the ILSVRC data is pre-training. Often a practitioner has a specific classification problem in mind whose label set differs from the 1000 classes present in the data. It’s possible nonetheless to use the data to create useful features that can then be used in the target classification problem. Where ILSVRC enters real-world applications it’s often to support pre-training.

This colloquial use of the word ImageNet can lead to some confusion, not least because the ILSVRC-2012 dataset differs significantly from the broader database. It only includes a subset of 1000 categories. Moreover, these categories are a rather skewed subset of the broader ImageNet hierarchy. For example, of these 1000 categories only three are in the person branch of the WordNet hierarchy, specifically, groom, baseball player, and scuba diver. Yet, more than 100 of the 1000 categories correspond to different dog breeds. The number is 118, to be exact, not counting wolves, foxes, and wild dogs that are also present among the 1000 categories.

What motivated the exact choice of these 1000 categories is not entirely clear. The apparent canine inclination, however, isn’t just a quirk either. At the time, there was an interest in the computer vision community in making progress on prediction with many classes, some of which are very similar. This reflects a broader pattern in the machine learning community. The creation of datasets is often driven by an intuitive sense of what the technical challenges are for the field. In the case of ImageNet, another important consideration was scale, both in terms of the number of images and the number of classes.

The large scale annotation and labeling that went into Imagenet falls into a category of labor that Gray and Suri call ghost work in their book of the same name.Mary L Gray and Siddharth Suri, Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass (Eamon Dolan Books, 2019). They point out:

MTurk workers are the AI revolution’s unsung heroes.

Indeed, ImageNet was labeled by about 49,000 MTurk workers from 167 countries over the course of multiple years.

The Netflix Prize

The Netflix Prize was one of the most famous machine learning competitions. Starting on October 2, 2006, the competition ran for nearly three years ending with a grand prize of $1M, announced on September 18, 2009. Over the years, the competition saw 44,014 submissions from 5169 teams.

The Netflix training data contained roughly 100 million movie ratings from nearly 500 thousand Netflix subscribers on a set of 17770 movies. Each data point corresponds to a tuple <user, movie, date of rating, rating>. At about 650 megabytes in size, the dataset was just small enough to fit on a CD-ROM, but large enough to be pose a challenge at the time.

The Netflix data can be thought of as a matrix with n=480189 rows and m=17770 columns. Each row corresponds to a Netflix subscriber and each column to a movie. The only entries present in the matrix are those for which a given subscriber rated a given movie with rating in \{1,2,3,4,5\}. All other entries—that is, the vast majority—are missing. The objective of the participants was to predict the missing entries of the matrix, a problem known as matrix completion, or collaborative filtering somewhat more broadly. In fact, the Netflix challenge did so much to popularize this problem that it is sometimes called the Netflix problem. The idea is that if we could predict missing entries, we’d be able to recommend unseen movies to users accordingly.

The hold out data that Netflix kept secret consisted of about three million ratings. Half of them were used to compute a running leaderboard throughout the competition. The other half determined the final winner.

The Netflix competition was hugely influential. Not only did it attract significant participation, it also fueled much academic interest in collaborative filtering for years to come. Moreover, it popularized the competition format as an appealing way for companies to engage with the machine learning community. A startup called Kaggle, founded in April 2010, organized hundreds of machine learning competitions for various companies and organizations before its acquisition by Google in 2017.

But the Netflix competition became infamous for another reason. Although Netflix had replaced usernames by pseudonymous numbers, researchers Narayanan and Shmatikov were able to re-identify some of the Netflix subscribers whose movie ratings were in the datasetArvind Narayanan and Vitaly Shmatikov, “Robust de-Anonymization of Large Sparse Datasets,” in 2008 IEEE Symposium on Security and Privacy (Sp 2008) (IEEE, 2008), 111–25. by linking those ratings with publicly available movie ratings on IMDB, an online movie database. Some Netflix subscribers had also publicly rated an overlapping set of movies on IMDB under their real identities. In the privacy literature, this is called a linkage attack and it’s one of the ways that seemingly anonymized data can be de-anonymized.Cynthia Dwork et al., “Exposed! A Survey of Attacks on Private Data,” Annual Review of Statistics and Its Application 4 (2017): 61–84.

What followed were multiple class action lawsuits against Netflix, as well as an inquiry by the Federal Trade Commission over privacy concerns. As a consequence, Netflix canceled plans for a second competition, which it had announced on August 6, 2009.

To this day, privacy concerns are a legitimate obstacle to public data release and dataset creation. Deanonymization techniques are mature and efficient. There provably is no algorithm that could take a dataset and provide a rigorous privacy guarantee to all participants, while being useful for all analyses and machine learning purposes. Dwork and Roth call this the Fundamental Law of Information Recovery: “overly accurate answers to too many questions will destroy privacy in a spectacular way.”Cynthia Dwork, Aaron Roth, et al., “The Algorithmic Foundations of Differential Privacy.” Foundations and Trends in Theoretical Computer Science 9, no. 3-4 (2014): 211–407.

Summary

Benchmark datasets are central to machine learning. They play many roles including enabling algorithmic innovation, measuring progress, and providing training data. Since its systematization in the late 1980s, performance evaluation on benchmarks has gradually become a ubiquitous practice because it makes it harder for researchers to cheat intentionally or unintentionally.

But an excessive focus on benchmarks brings many drawbacks. Researchers spend prodigious amounts of effort optimizing models to achieve state of the art performance. The results are often both scientifically uninteresting and of little relevance to practitioners because benchmarks omit many real-world details. The approach also amplifies the harms associated with data including downstream harms, representational harms, and privacy violations.

As we write this book, the benchmark approach is coming under scrutiny because of these ethical concerns. While the benefits and drawbacks of benchmarks are both well known, our overarching goal in this chapter has been to provide a single framework that can help analyze both. Our position is that the core of the benchmark approach is worth preserving, but we envision a future where benchmarks play a more modest role as one of many ways to advance knowledge. To mitigate the harms associated with data, we believe that substantial changes to the practices of dataset creation, use, and governance are necessary. We have outlined a few ways to do this, adding to the emerging literature on this topic.

References

Amatriain, Xavier, and Justin Basilico. “Netflix Recommendations: Beyond the 5 Stars (Part 1).” Netflix Tech Blog 6 (2012).

Bandalos, Deborah L. Measurement Theory and Applications for the Social Sciences. Guilford Publications, 2018.

Bao, Michelle, Angela Zhou, Samantha Zottola, Brian Brubach, Sarah Desmarais, Aaron Horowitz, Kristian Lum, and Suresh Venkatasubramanian. “It’s COMPASlicated: The Messy Relationship Between RAI Datasets and Algorithmic Fairness Benchmarks.” arXiv Preprint arXiv:2106.05498, 2021.

Bender, Emily M, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In Conference on Fairness, Accountability, and Transparency, 610–23, 2021.

Billsus, Daniel, Michael J Pazzani, et al. “Learning Collaborative Information Filters.” In Icml, 98:46–54, 1998.

Blodgett, Su Lin, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wallach. “Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets.” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1004–15, 2021.

Blum, Avrim, and Moritz Hardt. “The Ladder: A Reliable Leaderboard for Machine Learning Competitions.” In International Conference on Machine Learning, 1006–14. PMLR, 2015.

Bouthillier, Xavier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Nazanin Mohammadi Sepahvand, et al. “Accounting for Variance in Machine Learning Benchmarks.” Proceedings of Machine Learning and Systems 3 (2021).

Bowker, Geoffrey C., and Susan Leigh Star. Sorting Things Out: Classification and Its Consequences. MIT Press, 2000.

boyd, danah, and Kate Crawford. “Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon.” Information, Communication & Society 15, no. 5 (2012): 662–79.

Branwen, Gwern. “The Neural Net Tank Urban Legend,” 2011.

Breiman, Leo et al. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16, no. 3 (2001): 199–231.

Bromley, J, and E Sackinger. “Neural-Network and k-Nearest-Neighbor Classifiers.” Rapport Technique, 1991, 11359–910819.

Buolamwini, Joy, and Timnit Gebru. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” In Conference on Fairness, Accountability and Transparency, 77–91, 2018.

Chaaban, Ibrahim, and Michael R Scheessele. “Human Performance on the USPS Database.” Report, Indiana University South Bend, 2007.

Church, Kenneth Ward. “Emerging Trends: A Tribute to Charles Wayne.” Natural Language Engineering 24, no. 1 (2018): 155–60.

Cortes, Corinna, and Vladimir Vapnik. “Support-Vector Networks.” Machine Learning 20, no. 3 (1995): 273–97.

Couldry, Nick, and Ulises A. Mejias. “Data Colonialism: Rethinking Big Data’s Relation to the Contemporary Subject.” Television & New Media 20, no. 4 (2019): 336–49.

Crawford, Kate. The Atlas of AI. Yale University Press, 2021.

Crawford, Kate, and Trevor Paglen. “Excavating AI: The Politics of Training Sets for Machine Learning.” Excavating AI (Www.excavating.ai), 2019.

Dawes, Robyn M, David Faust, and Paul E Meehl. “Clinical Versus Actuarial Judgment.” Science 243, no. 4899 (1989): 1668–74.

DeCoste, Dennis, and Bernhard Schölkopf. “Training Invariant Support Vector Machines.” Machine Learning 46, no. 1 (2002): 161–90.

Deerwester, Scott, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. “Indexing by Latent Semantic Analysis.” Journal of the American Society for Information Science 41, no. 6 (1990): 391–407.

Dehghani, Mostafa, Yi Tay, Alexey A Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. “The Benchmark Lottery.” arXiv Preprint arXiv:2107.07002, 2021.

Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A Large-Scale Hierarchical Image Database.” In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–55. Ieee, 2009.

Denton, Emily, Alex Hanna, Razvan Amironesei, Andrew Smart, and Hilary Nicole. “On the Genealogy of Machine Learning Datasets: A Critical History of ImageNet.” Big Data & Society 8, no. 2 (2021): 20539517211035955.

Duda, Richard O, Peter E Hart, and David G Stork. Pattern Classification and Scene Analysis. Vol. 3. Wiley New York, 1973.

Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. “The Reusable Holdout: Preserving Validity in Adaptive Data Analysis.” Science 349, no. 6248 (2015): 636–38.

Dwork, Cynthia, Aaron Roth, et al. “The Algorithmic Foundations of Differential Privacy.” Foundations and Trends in Theoretical Computer Science 9, no. 3-4 (2014): 211–407.

Dwork, Cynthia, Adam Smith, Thomas Steinke, and Jonathan Ullman. “Exposed! A Survey of Attacks on Private Data.” Annual Review of Statistics and Its Application 4 (2017): 61–84.

Evans, Richard. “RA Fisher and the Science of Hatred.” The New Statesman, 2020.

Fabris, Alessandro, Stefano Messina, Gianmaria Silvello, and Gian Antonio Susto. “Algorithmic Fairness Datasets: The Story so Far.” arXiv Preprint arXiv:2202.01711, 2022.

Fisher, Ronald A. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics 7, no. 2 (1936): 179–88.

Funk, Simon. “Try This at Home.” Http://Sifter.org/~Simon/Journal/2006, 2006.

Gao, Leo, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, et al. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling.” arXiv Preprint arXiv:2101.00027, 2020.

Garofolo, John S, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S Pallett. “DARPA TIMIT Acoustic-Phonetic Continous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1.” NASA STI/Recon Technical Report n 93 (1993): 27403.

Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. “Datasheets for Datasets.” arXiv Preprint arXiv:1803.09010, 2018.

Gonen, Hila, and Yoav Goldberg. “Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings but Do Not Remove Them.” arXiv Preprint arXiv:1903.03862, 2019.

Gray, Mary L, and Siddharth Suri. Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass. Eamon Dolan Books, 2019.

Grother, Patrick J. “NIST Special Database 19.” Handprinted Forms and Characters Database, National Institute of Standards and Technology, 1995, 10.

Hand, David J. Measurement: A Very Short Introduction. Oxford University Press, 2016.

Hand, David J. Measurement Theory and Practice: The World Through Quantification. Wiley, 2010.

Hardt, Moritz, and Benjamin Recht. Patterns, Predictions, and Actions: Foundations of Machine Learning. Princeton University Press, 2022.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2017.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification.” In International Conference on Computer Vision, 1026–34, 2015.

Henrich, Joseph, Steven J Heine, and Ara Norenzayan. “The Weirdest People in the World?” Behavioral and Brain Sciences 33, no. 2-3 (2010): 61–83.

Herlocker, Jonathan L, Joseph A Konstan, Loren G Terveen, and John T Riedl. “Evaluating Collaborative Filtering Recommender Systems.” ACM Transactions on Information Systems (TOIS) 22, no. 1 (2004): 5–53.

Huang, Gary B., Manu Ramesh, Tamara Berg, and Erik Learned-Miller. “Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments.” University of Massachusetts, Amherst, 2007.

Jacobs, Abigail Z, and Hanna Wallach. “Measurement and Fairness.” In Conference on Fairness, Accountability, and Transparency, 375–85, 2021.

Jo, Eun Seo, and Timnit Gebru. “Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning.” In Conference on Fairness, Accountability, and Transparency, 306–16, 2020.

Johnson, Melvin. “A Scalable Approach to Reducing Gender Bias in Google Translate.” Google Blog, 2020.

Kapoor, Sayash, and Arvind Narayanan. “Leakage and the Reproducibility Crisis in ML-Based Science.” arXiv Preprint arXiv:2207.07048, 2022.

Kaufman, Shachar, Saharon Rosset, Claudia Perlich, and Ori Stitelman. “Leakage in Data Mining: Formulation, Detection, and Avoidance.” ACM Transactions on Knowledge Discovery from Data (TKDD) 6, no. 4 (2012): 1–21.

Kiritchenko, Svetlana, and Saif Mohammad. “Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems.” In Conference on Lexical and Computational Semantics, 43–53. Association for Computational Linguistics, 2018.

Koch, Bernard, Emily Denton, Alex Hanna, and Jacob G Foster. “Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research.” arXiv Preprint arXiv:2112.01716, 2021.

Koenecke, Allison, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R Rickford, Dan Jurafsky, and Sharad Goel. “Racial Disparities in Automated Speech Recognition.” Proceedings of the National Academy of Sciences 117, no. 14 (2020): 7684–89.

Koh, Pang Wei, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, et al. “Wilds: A Benchmark of in-the-Wild Distribution Shifts.” arXiv Preprint arXiv:2012.07421, 2020.

Koren, Yehuda, Robert Bell, and Chris Volinsky. “Matrix Factorization Techniques for Recommender Systems.” Computer 42, no. 8 (2009): 30–37.

Kuczmarski, James. “Reducing Gender Bias in Google Translate.” Google Blog 6 (2018).

Kumar, Neeraj, Alexander Berg, Peter N Belhumeur, and Shree Nayar. “Describable Visual Attributes for Face Verification and Image Search.” IEEE Transactions on Pattern Analysis and Machine Intelligence 33, no. 10 (2011): 1962–77.

Langley, Pat. “Machine Learning as an Experimental Science.” Springer, 1988.

———. “The Changing Science of Machine Learning.” Springer, 2011.

LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE 86, no. 11 (1998): 2278–2324.

Li, Xiaochang, and Mara Mills. “Vocal Features: From Voice Identification to Speech Recognition by Machine.” Technology and Culture 60, no. 2 (2019): S129–60.

Liao, Thomas, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt. “Are We Learning yet? A Meta Review of Evaluation Failures Across Machine Learning.” In Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.

Liberman, Marc. “Reproducible Research and the Common Task Method.” Simmons Foundation Lecture Https://Www. Simonsfoundation. Org/Lecture/Reproducible-Research-and-Thecommon-Task-Method 2 (2015).

Liberman, Mark. “Fred Jelinek.” Computational Linguistics 36, no. 4 (2010): 595–99.

Liberman, Mark, and Charles Wayne. “Human Language Technology.” AI Magazine 41, no. 2 (2020).

Louçã, Francisco. “Emancipation Through Interaction–How Eugenics and Statistics Converged and Diverged.” Journal of the History of Biology 42, no. 4 (2009): 649–84.

Marie, Benjamin, Atsushi Fujita, and Raphael Rubino. “Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers.” arXiv Preprint arXiv:2106.15195, 2021.

Messick, Samuel. “Test Validity: A Matter of Consequence.” Social Indicators Research 45, no. 1 (1998): 35–44.

Miller, George A. WordNet: An Electronic Lexical Database. MIT Press, 1998.

Miller, John P, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt. “Accuracy on the Line: On the Strong Correlation Between Out-of-Distribution and in-Distribution Generalization.” In International Conference on Machine Learning, 7721–35. PMLR, 2021.

Mitchell, Tom M. The Need for Biases in Learning Generalizations. Department of Computer Science, Laboratory for Computer Science Research …, 1980.

Narayanan, Arvind, and Vitaly Shmatikov. “Robust de-Anonymization of Large Sparse Datasets.” In 2008 IEEE Symposium on Security and Privacy (Sp 2008), 111–25. IEEE, 2008.

Olteanu, Alexandra, Carlos Castillo, Fernando Diaz, and Emre Kıcıman. “Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries.” Frontiers in Big Data 2 (2019): 13.

Paullada, Amandalynne, Inioluwa Deborah Raji, Emily M Bender, Emily Denton, and Alex Hanna. “Data and Its (Dis) Contents: A Survey of Dataset Development and Use in Machine Learning Research.” arXiv Preprint arXiv:2012.05345, 2020.

———. “Data and Its (Dis) Contents: A Survey of Dataset Development and Use in Machine Learning Research.” Patterns 2, no. 11 (2021): 100336.

Prabhu, Vinay Uday, and Abeba Birhane. “Large Image Datasets: A Pyrrhic Win for Computer Vision?” arXiv Preprint arXiv:2006.16923, 2020.

Radin, Joanna. “‘Digital Natives’: How Medical and Indigenous Histories Matter for Big Data.” Osiris 32, no. 1 (2017): 43–64.

Raji, Inioluwa Deborah, Emily M Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. “AI and the Everything in the Whole Wide World Benchmark.” arXiv Preprint arXiv:2111.15366, 2021.

Recht, Benjamin, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. “Do ImageNet Classifiers Generalize to ImageNet?” In International Conference on Machine Learning, 2019.

Reichman, Nancy E, Julien O Teitler, Irwin Garfinkel, and Sara S McLanahan. “Fragile Families: Sample and Design.” Children and Youth Services Review 23, no. 4-5 (2001): 303–26.

Rosenblatt, Frank. “Perceptron Simulation Experiments.” Proceedings of the IRE 48, no. 3 (1960): 301–9.

Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. “Imagenet Large Scale Visual Recognition Challenge.” International Journal of Computer Vision 115, no. 3 (2015): 211–52.

Saitta, Lorenza, and Filippo Neri. “Learning in the ‘Real World’.” Machine Learning 30, no. 2 (1998): 133–63.

Salganik, Matthew J, Ian Lundberg, Alexander T Kindel, Caitlin E Ahearn, Khaled Al-Ghoneim, Abdullah Almaatouq, Drew M Altschul, et al. “Measuring the Predictability of Life Outcomes with a Scientific Mass Collaboration.” Proceedings of the National Academy of Sciences 117, no. 15 (2020): 8398–8403.

Salzberg, Steven L. “On Comparing Classifiers: A Critique of Current Research and Methods.” Data Mining and Knowledge Discovery 1, no. 1 (1999): 1–12.

Shankar, Vaishaal, Rebecca Roelofs, Horia Mania, Alex Fang, Benjamin Recht, and Ludwig Schmidt. “Evaluating Machine Accuracy on Imagenet.” In International Conference on Machine Learning, 8634–44. PMLR, 2020.

Steed, Ryan, and Aylin Caliskan. “Image Representations Learned with Unsupervised Pre-Training Contain Human-Like Biases.” In Conference on Fairness, Accountability, and Transparency, 701–13, 2021.

Stevenson, Megan T, and Jennifer L Doleac. “Algorithmic Risk Assessment in the Hands of Humans.” Available at SSRN, 2022.

The Federal Reserve Board. “Report to the Congress on Credit Scoring and Its Effects on the Availability and Affordability of Credit.” https://www.federalreserve.gov/boarddocs/rptcongress/creditscore/, 2007.

Tufekci, Zeynep. “Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls.” In Proc. 8Th International AAAI Conference on Weblogs and Social Media, 2014.

———. “Engineering the Public: Big Data, Surveillance and Computational Politics.” First Monday, 2014.

Veale, Michael, Reuben Binns, and Lilian Edwards. “Algorithms That Remember: Model Inversion Attacks and Data Protection Law.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 376, no. 2133 (2018): 20180083.

Wang, Angelina, Arvind Narayanan, and Olga Russakovsky. “REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets.” In European Conference on Computer Vision, 733–51. Springer, 2020.

Yadav, Chhavi, and Léon Bottou. “Cold Case: The Lost Mnist Digits.” arXiv Preprint arXiv:1905.10498, 2019.

Yang, Kaiyu, Klint Qinami, Li Fei-Fei, Jia Deng, and Olga Russakovsky. “Towards Fairer Datasets: Filtering and Balancing the Distribution of the People Subtree in the ImageNet Hierarchy.” In Conference on Fairness, Accountability, and Transparency, 547–58, 2020.

Last updated: Wed Dec 13 14:44:39 CET 2023

Datasets

A tour of datasets in different domains

TIMIT

UCI Machine Learning Repository

MNIST

ImageNet

The Netflix Prize

Roles datasets play

A source of real data

A catalyst and measure of domain-specific progress

A source of (pre-)training data

The scientific basis of machine learning benchmarks

Benchmark praxis and culture

Harms associated with data

Downstream and representational harms

Mitigating harms: an overview

Mitigating harms by separating the roles of datasets

Beyond datasets

Lessons from measurement

Problem framing: comparisons with humans

Problem framing: focusing on a single optimization objective

Limits of data and prediction

Summary

Chapter notes

References