When Facebook changes how it works, how does that affect our lives and those of our friends? When we test a technology in our own life or community to find out, how can we know if it’s a wider issue for others?

In a series of posts, I’ve shown how anyone can audit Facebook’s NewsFeed and how careful research design can identify the causes behind observed effects. Throughout the series, I’ve focused on field experiments, tools of research that help us monitor, explain, and wisely intervene on outcomes that matter.

In a personal experiment, I found that using colored backgrounds caused poems I shared on Facebook to receive 2.1x more likes, comments, and shares on average. After I found that Facebook may promote images differently from text, I wondered: were those effects specific to me and my friends?

What we needed, in the language of research, were replications: further experiments to learn if my findings were also a general effect for others too.

Replicating the Social Media Color Experiment

This semester, the longsuffering students in my class on the art and ethics of field experiments replicated the study I did in my own Facebook feed. Over two weeks, we randomly assigned status updates to receive colored or white backgrounds and counted the number of shares, likes, and comments by friends in the next 24 hours (full details here).

In the Social Media Color Experiment, students studied the effect of using colorful backgrounds on their posts, which ranged from lyrics to poems to what fruit they ate that day.

Replications, Accountability, and the Search for General Knowledge

At first glance, replications seem simple: if someone does an experiment, we can find out if the results are applicable elsewhere by doing more experiments in other settings. Then, if we compare or add up the results, we can find out if the original study was a fluke, a context-dependent discovery, or something general to a wider group.

Since replications help us understand how widely an effect is experienced, replications are essential for behavioral consumer protection

Researchers care about replication because many scientists prioritize general knowledge: explanations of the world that apply in a wide range of settings. Similarly, policymakers care about replication because they yearn for common solutions that they can apply anywhere. General knowledge isn’t the only kind of knowledge or even the most important. Yet when we’re investigating the impacts of products in our lives, discoveries that apply widely are especially important to observe.

I’ve argued that tech firms should be accountable to the public based on the magnitude and the scale of any harms they introduce: we should care if a product introduces great harms to a small number of people, and we also should care if a product introduces smaller harms to a great number of people. Since replications help us understand how widely an effect is experienced, they are essential for behavioral consumer protection.

Replications are supposed to help us map the terrain, but it’s possible that with algorithms, each of us may be in a completely different territory

There’s one catch: with algorithmic systems like aggregators, recommenders, and machine learning, we don’t yet know how to think about general knowledge– systems like the Facebook NewsFeed, Google Search, and YouTube recommendations behave differently with each of us. Replications are supposed to help us map the terrain, but it’s possible that with algorithms, each of us may be in a completely different territory (a topic I discuss in an upcoming paper). Where we can’t rely on maps from others, we may need our own experiments even more.

Two Questions For Any Replication

When planning any replication, we should ask two basic questions:

  • What should be different?
  • What should be the same?

Replications need to be different so we can say that the effect applies in a variety of situations. On the other hand, if too many things change, we might wonder whether if the new studies are still asking the same question. For example, replications of studies about how people perceive famous people shouldn’t use the same list of famous people when conducted in different countries– we wouldn’t expect everyone worldwide to recognize Canadian celebrities. Yet if the study focused on family members rather than celebrities, it might be too different to be a replication.

With the Social Media Color Experiment, students changed several things about the study when conducting it in their own social media feeds:

  • Whose social media feed was involved (their own, rather than mine)
  • When the study happened: March 2018, rather than October 2017 (potentially important, since we know Facebook changed the NewsFeed algorithm in January 2018)
  • What kind of posts they shared (like me, some used poems. Others shared Cardi B lyrics, what fruit they ate, or top posts from reddit)
  • Which social media platform (most used Facebook, some used Instagram or Twitter)

We also kept some things the same:

  • We all tested the same basic intervention: randomly assigning some posts to have a colorful background and others to use black on a white background
  • We all measured the same thing: the number of likes, shares, and comments on a post over 24 hours

By testing colored backgrounds in different people’s feeds, we were changing something that often changes between replications: the context of research. Our decision to try different kinds of content could be debated, but it was also an important change: we care about effects that apply to more than just poems. Finally, the decision to test the idea on different platforms might be so far away from the original experiment that we wouldn’t consider it to be the same anymore. Yet if we were to find similar effects on Twitter and Instagram as we did on Facebook, it would definitely raise new questions.

What We Learned: Do Facebook and Facebook Friends Prioritize Posts with Colorful Backgrounds?

Over two weeks, each student posted colored or plain posts to social media and tracked the number of likes, shares, and comments that the post received. At the end of the experiment, each student calculated their own result. We then pooled our results into a single model (code, methods, and data here).

Across all our studies, two of them found a statistically-significant result, and both were positive. More importantly, when we combined everyone’s results, we found that using color backgrounds caused a 19% increase in the number of likes, shares, and comments that a post received on average, adjusting for whether the post was shared on the weekend.

By combining our results, we had conducted what researchers call a meta-analysis. Normally, researchers who analyze multiple experiments have to worry about publication bias from studies that people chose not to report- something that can lead to inflated, over-confident estimates. With the Social Media Color Experiment, we have full data from every single time this study was ever conducted (if you try it yourself, please share your data with us).

On many questions, having 11 experiments would be a rare area of scientific strength. Yet as the class discovered, our replications also left many questions unanswered.

How Many Replications Are Enough?

By replicating this experiment, students developed stronger evidence that the effect could be more widespread than just me. But just what can we claim from these replications?

First, notice that while our question was focused on Facebook, two studies were on Instagram and one was on Twitter. If we want to ask about the effect on Facebook, it’s best to leave out the Twitter and Instagram studies. When we focus just on Facebook, the average effect is 11 percentage points higher.

Based on these findings, can we say that the effects are different between Facebook and other platforms? Unfortunately, we have too few replications to differentiate between platforms, and the individual studies are too small.

First, most of the studies themselves are too small by themselves. Since we only had two weeks for this assignment, most of the experiments had inconclusive results, as you can see from error bars that span positive and negative values.

Since the studies had small samples, our results presented some students with an ambiguous decision. How should they interpret the results if they found a negative estimate that wasn’t statistically significant? Is their negative result the opposite from others on average, or was their sample too small to observe the true, positive effect? Our results can’t answer that question.

for experiments to guide how we manage the impact of tech platforms in our lives, we need to scale research by 100x or more

Second, any attempt to compare between platforms needs a large enough number of replications from each platform. Even though our sample includes 162 posts across all experiments, we only have three non-Facebook studies– too few for reliable comparison. Any difference between studies might come from differences between people or topics or from chance, not platforms.

Without a solid basis for comparison, a too-small collection of studies could easily lead us astray. For experiments to guide how we manage the impact of tech platforms in our lives, we need to scale experimentation by 100x or more. And if we hope to reach even 1/10th the scale of research that is routine for tech companies, we will also need to re-make the politics and ethics of behavioral research and redesign the oversight of industry-independent experiments.

The Relationship of Situated and General Knowledge in Citizen Behavioral Science

In the social media color experiment, students tried an example of citizen behavioral science- an experiment that anyone can do in our own social media feeds. Wherever experiments are rare, the safest bet will always be an experiment you did yourself. If many people conduct experiments and share our results, we can also contribute to general knowledge while learning what works for us.

On questions where experiments are rare, the safest bet will always be an experiment you did yourself

In a world where personalized algorithms may put us into unique bubbles, society might benefit from the situated knowledge of many localized experiments. Yet no matter how easy we make citizen behavioral science, only some people will be patient enough to run behavioral studies consistently for weeks or months. In future blog posts, I plan to discuss practical steps we can take to address that issue.

These are all active questions that I’m asking at Princeton and at CivilServant, the nonprofit I lead that organizes citizen behavioral science. I also organize industry-independent evaluations of the social impact of tech companies design and policy ideas. If you’re interested to learn more about our work, drop me a note!

Teaching the Social Media Color Experiment

This experiment was an optional assignment for students in my class on designing field experiments at scale, starting in the second week of class. My goals for the assignment were to:

  • Motivate students to complete their IRB training
  • Start experimenting right away
  • Confront students with questions of ethics, power, and risk in their own lives before asking them to design experiments that affect others
  • Provide a practical anchor for our discussions of methodology, statistics, and research ethics

The Princeton ethics board worked with me to review the study in advance, and when any remaining students completed their IRB training, they added the students within days. They also reviewed and supported another option I gave students (the Cornhole Challenge) so grades weren’t dependent on their choice to participate in human subjects research.

For those of you who teach social media breaching experiments in your classes, I would encourage you to try this one. If you do, here are some things I plan to change, and that I would suggest you do differently:

  • plan for students to conduct the study for 30 days rather than 14
  • conduct your own experiment before class or use our dataset to introduce students to analysis methods before they finish their own study
  • provide clear instructions for what to do if students miss a day
  • pre-register your study and meta-analysis plan
  • send me your results and anonymized data → let’s try to avoid creating publication bias!


I am profoundly grateful to the thoughtful and creative students of SOC412 who undertook this project with me: Amanda, Austin, Dennis, Emily, Frances, Kevin, Monica, Ryan, Tyler, and Zenobia. I’m also grateful to Matt Salganik, who encouraged me to teach a class with lots of hands-on experimentation, and to the amazing Princeton ethics board, who worked with me to review and refine an endless cascade of research designs on short deadlines throughout the semester.