How can independent researchers reliably detect bias, discrimination, and other systematic errors in software-based decision-making systems? 

Austin Hounsel, Nick Feamster (@feamster) and I help answer that question in a new peer reviewed paper just accepted by the computer science conference CSCW (pre-print here).

We wrote this paper to offer a helpful guide to anyone who wants to conduct audit studies, recruit volunteers into those studies, and create novel software to provide nuanced, fast analyses of decision-making systems.

Auditing Decision-Makers

Decision-making software is now a common part of our lives, influencing who gets hired, who receives social services, and what we are allowed to say online. These systems can dramatically scale an organization’s decision-making capacity with a relatively small number of human workers. For example, Facebook alone needs to review millions of advertisements per week for content that violates their policies. To achieve that scale, they combine AI systems with human moderators.

Like any decider, these systems also make errors—problems and injustices that can quickly add up at massive scales. How can independent researchers observe those errors, hold institutions accountable, and motivate change? Ever since the 1964 Civil Rights Act, US social scientists have used field experiments to detect discrimination. These audit studies involve sending multiple testers to present a decision-making system with choices that are nearly identical—except for characteristics like gender, race, or some other potential source of error. If the decider consistently produces different outcomes than their policies dictate, we can use statistics to estimate the bias in a system.

In crowd-sourced audit studies, volunteers combine their histories and identities to detect patterns and hold decision-makers accountable

Crowd-Sourced Audit Studies

Audit studies have been most used when testing for discrimination along binary dimensions such as race or gender. That’s because more complex questions (or more fluid ideas of race and gender) require more testers. There’s another problem. As the question becomes more complex, with more dimensions, it’s harder for researchers to present decision-makers with realistic options as part of the audit study.

That’s the challenge our team faced during the US midterm elections in 2018 when trying to audit Facebook and Google’s political advertising policies. After foreign governments attempted to influence US voters in 2016, Facebook created policies to prevent unauthorized accounts from publishing political ads. Complaints about this system abounded—especially about legitimate, non-political ads that were being blocked by Facebook. Concerned by those complaints, we decided to audit the political advertising policies of both Facebook and Google.

Because content moderation systems could make many kinds of systematic errors, auditing those systems is a challenge with high multi-dimensional complexity. In our study of political advertising policies, we were concerned with variations in:

  • The account posting the ad:
    • The location the ad was posted from 
    • The currency used to pay for the ad
  • The content of the ad:
    • Whether the ad had any relation at all to political topics
    • The commonly-perceived political leaning of the topics
  • The context that the ad appeared in:
    • Whether the ad was related to a federal or state election
    • Which region the ad was targeted to

With so much possible variation, a realistic audit study might require hundreds to thousands of decision-making results. That’s a big reason for researchers to crowd-source audit studies, where volunteers combine their histories and identities to detect patterns and hold decision-makers accountable. By working with people in different locations, currencies, and histories, researchers can orchestrate a more realistic audit.

Software-Supported Volunteer Audit Studies

Working with volunteers makes audit studies more complex by increasing the coordination complexity of the audit. In the paper, we summarize areas where software can improve the realism and efficiency of audit studies. We then built prototype software to do just that. 

  • Choosing what characteristics to test in the audit based on estimates of cost and complexity.
  • Generating realistic audit prompts: We created software that auto-generated advertisements based on public APIs on products, public events, and local election details.
  • Choosing sample sizes. We created software to simulate possible research outcomes, decide how many testers and ads we needed, and budget the project—inspired by the MIDA technique in the social sciences
  • Assign testers to actions. We prototyped software to match testers with the ads they needed to submit to Facebook.
  • Generate statistical results and illustrations. By creating software to automate the generation of reports before we started collecting data, we were able to pre-register the analysis and produce our findings rapidly during an election season in time to make a difference.

Ethical and Legal Challenges for Volunteer Audit Studies

Crowdsourced audit studies, especially those involving elections, present a number of unique legal and ethical issues. When designing these studies, researchers should be careful not to expose volunteers to any risks that take them by surprise. Audit studies also represent a risk to workers at the organizations being tested. Although audits summarize errors on average across an organization, we worried that Facebook or Google might punish individual workers (often poorly compensated) who were unlucky enough to encounter our study in one of their work shifts.

Researchers also face a number of direct legal and ethical risks if the organizations being audited decide to fight back. That’s a real problem in the United States, where there aren’t enough protections for independent research. Our paper summarizes these risks and what we did to manage them.

So, What’s the Verdict on Crowd-sourced Audit Studies?

What did we learn by doing this? To understand our answer, it’s helpful to think of our paper as a turducken of computer science, social science, and policy evidence. 

large-scale volunteer audits could lead to break-throughs in the nuance and realism of evidence for accountability

Our main goal (the turkey) is to outline the potential of volunteer-based audit studies and thoroughly describe the design challenge of supporting and automating high-quality crowdsourced audits. Based on our experience, we believe that large-scale volunteer audits could lead to break-throughs in the nuance and realism of evidence for accountability. We hope this paper will introduce computer scientists to the audit study method and inspire them to develop more software tools.

We also describe the specific software we built (the chicken) to support a specific study and report on what we learned from this experience. In computer science Systems papers like this one, making something is considered a valuable form of intellectual inquiry in itself. In this case, we wanted to see how software could help break through some of the traditional limitations of audit studies. At the end of the paper, we report on what problems can be solved with software, and what technical, legal, and ethical aspects of audits might be much more difficult to automate.

Systems papers often present a case study in the actual use of the software (the duck in this multi-layered delight). For us, this was our 2018 audit of Facebook and Google’s political advertising policies. You can read our results in The Atlantic and see our code and data on github. While companies tried to downplay our results, our findings were especially useful to state governments who were trying to understand why Facebook’s advertising policies were blocking them from reaching residents with information about public services.

To learn more, see our pre-print on the Arxiv. The final version will be published and presented at CSCWC in November 2022.

Matias, J. N., Hounsel, A., Feamster, N., (2022) Software-Supported Audits of Decision-Making Systems: Testing Google and Facebook’s Political Advertising policies. The 25th ACM Conference on Computer-Supported Cooperative Work and Social Computing.

Acknowledgements

We are grateful to M.R. Sauter, who provided logistical support to this study, and to Jon Penney,
who provided helpful advice and feedback. We are also grateful to Chris Peterson, Melissa Hopkins,
Jason Griffey, and Ben Werdmuller for participating in our study as testers. This research project
was supported financially and logistically (the stories we could tell!) by the Princeton University Center for Information Technology Policy and their amazing staff.