How can researchers verify the claims made by technology platforms about data access when the realities of access may depend on nuanced details of implementation?

Consider the case of Reddit. Many news reports about the Reddit IPO have touted the valuable role of community leaders to support and preserve successful community conversations. That work depends on the quality of data provided by the company to its volunteer moderators and to researchers. In this technical report, we analyze the reliability of the data Reddit currently provides— along with methods for independently determining data quality for Reddit that we hope apply to other platforms.

Here at CAT Lab, we have worked with Reddit communities for almost a decade to collect data, analyze it together, and collaborate on experiments to improve people’s experiences on the platform. Thanks to the industry-independent research we’ve done directly with many of Reddit’s most visible communities, the platform is measurably safer for millions of people—preventing harassment, reducing the spread of unreliable news, advancing research ethics, and making moderation more humane.

When Reddit announced extensive restrictions to academic research last summer, we got questions from regulators, community leaders, and other researchers about whether the era of independent research with Reddit is over. Basically, people feared that Reddit’s data restrictions might undermine the reliability of any research about the platform.

When Reddit officially responded with an automated message rather than a conversation to our (and our colleagues) initial inquiries about seeking increased access, we realized we would have to conduct an independent test of the API capabilities that were provided instead. This technical report includes the results of that analysis. (We’re relieved and cautiously hopeful to share that staff at the company have said they’re reviewing our findings.)

Our top-line findings are:

  • Working within Reddit’s API restrictions, we can do research with one large community at a time— but we can’t promise that there won’t be missing data at some of the most important moments in our study.
  • We will be able to work concurrently with multiple communities if we create a new API endpoint for each community. CAT Lab’s software has conducted as many as six large-scale studies at a time, and we have plans to grow that number much further. This method, which will require new development, should work so long as Reddit doesn’t consider this a circumvention of their API policies and restrict our access.
  • Overall, the design of Reddit’s APIs make their systems vulnerable to missing data problems on high-stakes issues like online harassment and hate speech (see below), especially as Reddit grows. We have raised with Reddit the long-term value of architectures that are more fit-for-purpose.

