Ranking Content On Signals Other Than User Engagement
A new paper lays out the options, and points to how they could optimize for prosocial outcomes
ORIGINALLY POSTED ON TECH POLICY PRESS, transcribed from the latest Podcast episode of The Sunday Show
JUSTIN HENDRIX /FEB 18, 2024
Today's guests are Jonathan Stray, a senior scientist at the Center for Human Compatible AI at the University of California Berkeley, and Ravi Iyer, managing director of the Neely Center at the University of Southern California's Marshall School. Both are keenly interested in what happens when platforms optimize for variables other than engagement, and whether they can in fact optimize for prosocial outcomes.
With several coauthors, they recently published a paper based in large part on discussion at an 8-hour working group session featuring representatives from seven major content-ranking platforms and former employees of another major platform, as well as university and independent researchers. The authors say "there is much unrealized potential in using non-engagement signals. These signals can improve outcomes both for platforms and for society as a whole."
What follows is a lightly edited transcript of the discussion.
Justin Hendrix:
You've both been working on a set of problems around how social media platforms operate and the degree to which they are either contributing to or detracting from some aspects of social cohesion, questions around polarization, et cetera. Explain to me how you would characterize your work right now. Jonathan, we'll start with you.
Jonathan Stray:
So I am interested in the ways that social media algorithms, the algorithms that select and order content, have effects on individuals in society in a bunch of different ways. Conflict and polarization, wellbeing, addiction, that kind of stuff. Also, news knowledge, misinformation, that whole sphere. A bunch of people are concerned about these effects, and I think correctly, so then the question for me is always, okay, if we're worried about this stuff, how do we design the platforms to have better effects? And so that's really what a lot of my work is about and what this paper is about.
Ravi Iyer:
And I'll dovetail with that. Jonathan and I worked together a lot on design, especially at Recommender Systems. I've been on this podcast before talking about design as a better solution than moderation, which has failed in many cases in my experience working at Meta. So for those of you who don't know, I worked at Meta for about four and a half years before I had my current job. And right now I've had a lot of interest and a lot of people have independently come to the conclusion that moderation wasn't going to solve the problems they were seeing, that they needed more upstream solutions. And so I've been trying to be more specific about the kinds of design solutions.
We released the design code, which we talked about on this podcast. And this paper I think is really interesting because it shows sometimes I will say to people, "There were things that we did at Facebook that many other companies have also found, and there are a number of things in this paper that dovetail with that idea that these aren't just things that Meta found. These are kind of basic physics of social media questions that we are all starting to understand the answer to.
Justin Hendrix:
So let's get into some aspects of this paper, what we know about using non-engagement signals in content ranking, which I believe has just been made public on archive. First off, let's just talk about how this came together because this is not a normal set of authors. You are two among many, but there are also folks represented from the Integrity Institute, others from UC Berkeley, and then also platforms, people from LinkedIn and Pinterest as well as other academics from Cornell Tech, for instance. How did this all come together? How did you put this inquiry together?
Jonathan Stray:
So many people have noticed that optimizing for engagement, which is the dominant way that most, really if not all, large platform select content has some drawbacks. It's useful in a number of ways, but it also has some bad side effects. And I'm sure you can read a dozen articles on tech policy press on that particular question. So if that's the case, what else is there? What can we do other than optimizing for engagement? And what we did in this paper is we got together some academic experts, as you just talked about, but also people from eight different platforms in a day long off the record workshop. And we tried to come to an agreement on what it is that we collectively as an industry and as researchers studying in the area actually know about how engagement relates to other signals, what other signals you can use to rank content, what actually is the state of best practice here?
And then we turned that into a series of propositions, I think there's 20 of them in the paper, and we documented the best public evidence for each of those propositions. And so that's a combination of papers that people have written publicly, company blog posts, news reports, and also quite a lot of material from what is now known as the Facebook Archive. That's the leaks that came out of Facebook via Frances Haugen in 2021. In some sense, we are just documenting conventional wisdom, but it is conventional wisdom that is not widely known and has never been publicly documented before to this level.
Justin Hendrix:
So we're trying to square the circle here somewhat. I mean, the argument against the platforms, as you point out, is always, "You're just optimizing for engagement. You just want more profit. You just want more people to stay on your sites as long as possible so that you can show them more advertisements. And that's how it works. Senator, we run ads." That's the basic premise. And maybe, Ravi, I'll come to you. Why is it more complicated than that?
Ravi Iyer:
I mean, the dominant narrative is that companies are optimizing for engagement and they maybe don't care and they're just doing it for profit. There are a lot of good people at companies, many of whom are authors on this paper, who are working on that problem. So it's not like companies don't have solutions to these problems or understand some of the solutions to these problems and they don't have good people working on it. Companies admit that they often optimize for engagement. It improves retention. It improves business metrics. But there are many cases where engagement does trade off with quality, and this is a known problem that companies, and so the paper documents some of that evidence. And there are known solutions to some of those problems, things like content level surveys.
I think in some ways it reinforces the dominant narrative that, yes, there is a problem with engagement based ranking often. It often trades off against quality, but I think it complicates the narrative in terms of the responsibility of companies in the sense that companies are working on this and hopefully we can all work together in a constructive way to figure out the best ways we can mitigate those negatives in a world where we all share some common goals.
Justin Hendrix:
So where do we start in terms of going through some of the findings or the consensus that you emerged at with your partners?
Jonathan Stray:
Let's start with the top line ones. So first of all, ranking by engagement is good for long-term user retention. That's why companies do it. And it's not universally bad for users either. There's a pretty good argument that if someone is returning to a product over and over a period of months or years, they're getting something out of it. And indeed, we genuinely do spend more of our time on things that matter to us. That's why people are doing it all. It's not just about the economic benefits to the companies. It's that it actually makes a better product in many cases. So we start there, but then we go to the problems, and so one of our propositions is there seems to be in many cases a negative correlation between engagements and what we call quality. And so part of what we do in this paper is we try to define terms to provide a common language, and we say quality is a way you can rate a particular item of content independent of context or not thinking about showing it to any particular person as good or bad in some way.
It's click-bait. It's inspiring. It's misinformation. It's some sort of thing you can say about a piece of content. And it seems to be the case that low quality items in many senses have high engagement and nobody's going to be shocked by this. [inaudible 00:12:25], you click on it. But we have meticulously documented this in all of the cases where we found it to be true and some cases where we see the other direction where in fact you get lower quality stuff if you don't optimize for engagement. But mostly it's the other way around. And then another proposition is that when you put more weight on quality content, you get better long-term retention.
This was really one of the goals of this effort as well, was to document the places where it's a win, where the problem is not that companies don't want to do the right thing, but that they are not necessarily aware that the right thing will produce better retention for them. In other words, business results and growth in the long term. We wanted to pay special attention to the cases where actually you can do the right thing that is good for you and good for users, and there's lots of evidence that optimizing for quality is better for platforms. So we wanted to put that evidence together.
Justin Hendrix:
I'm sure there are skeptics amongst my listeners who are listening to you and wondering before they've read the paper, of course, what is that evidence? What do you point to that would support that claim? And if it's the case, why aren't more platforms already putting in place measures that would perhaps drive towards quality over perhaps the more salacious or conflict oriented content?
Jonathan Stray:
So for each of these propositions, we have a list, a section called public evidence, and in this one, the evidence of stuff like Google has long used paid rater search guidelines. There's this 200 page document that you can read which tells people how to rate search results for quality, and that's been part of their pipeline for a long time. YouTube predicts whether people would rate something as sensationalistic or authoritative and uses that as a ranking signal. All of these have public references. We have sites to each of these. LinkedIn predicts whether posts share "knowledge and advice." Facebook uses professional rater scores and a page rank algorithm. And then there's a bunch of stuff that we know from the leaks.
One of them is there was this experiment that Facebook did in 2019 called the Minimum Integrity Holdout, which is let's remove all of the quality terms that already existed, prediction of click-bait, prediction of authoritativeness, all of that, and just look at, what is the minimal feed without any of these quality scores? And what they found is that people used the platform less. By these platforms' own experience, both public and leaked to this point, they find that this is generally the case. So now your second question is really interesting. That's true. Why aren't they doing this already? And there's two answers. One is they already are doing a lot of this stuff. Nobody has ranked by engagement alone since, I don't know, probably 2014, 2015, somewhere around there. But there's another answer which is that nobody talks about this stuff.
The companies don't talk about it with each other because trade secret competitive reasons, and academics and researchers don't talk about it because they don't know because they can't run these experiments. Particularly for smaller platforms or let's say sized platforms who aren't running these huge experiments or are maybe newer and don't have 15 years of experience, they don't know. And that was part of what we were trying to do here, is just come up with some way where the people who actually have experience in this stuff can share it publicly. And so we have this weird process where we had a very candid conversation in a room full of people that were not publishing that conversation, but what we are publishing is, what are all of the public data points that represent the consensus of the people in that room in an effort to make this stuff not secret?
Justin Hendrix:
Rather, you were one of the people whose work on these questions essentially came to light in the leaks that Frances Haugen supplied to the Wall Street Journal and to other journalists. Is it the case that you're working on these issues so long ago in the way that Jonathan suggests? Was this a common area that researchers at Facebook are concerned about?
Ravi Iyer:
This is definitely a common area that people are concerned about, but the dominant terms in ranking are still engagement, have always been engagement, and remain engagement. If you read public blog posts by the company, there are some quality signals that are added on. And part of the public work that I do is to push the companies to do even more of that, because certainly we know that it can be beneficial and there's evidence in the paper about it being beneficial, but there's a lot more that's left to be done. And you asked, why don't companies do this more? And I think what Jonathan mentioned about the minimum integrity holdouts is a good case. Here's a case where if you remove all these protections, these quality terms from optimization, you get a short-term hit for engagement in business metrics. But in the longterm over a two-year period you get a benefit, but it takes two years to see that benefit.
It's a question of short-term versus long-term benefit. It's hard to convince someone that the thing that is bad for their current metrics will be good in two years and nobody knows what actually will be good in two years. A lot of people optimize for the short-term benefit because it's just more convenient and we as a society and the people who are more thoughtful about this have a role to play to push companies, and I'm hopeful this paper can do that, to be like, look, no, actually, think about the long-term. Don't just optimize for these short-term metrics.
Justin Hendrix:
A lot of the conversation over the last couple of weeks when it regards social media has been around the choices of certain platforms to diminish or otherwise push away political content. Presumably that also means, in some cases, news. Is there a political dimension to not optimizing for engagement? How do you think about that? How do you think about the possible consequences of that on public discourse?
Jonathan Stray:
Yeah, very interesting question. It's a huge question. I'm going to narrow it down to just some of the things we talk about in this paper. I'm going to note that we have public evidence that Facebook removed some of the engagement terms, I think particularly re-shares and comments, for content that was classified as political or civic content. I think also health. So that says that for this particular type of content they are reducing the effective optimizing for engagement because they see that it's associated with lower quality in this domain. And one of the interesting things with this particular effort is, in the course of this workshop, we discovered that two more platforms said that they were doing the same thing. Now, because of the nature of the discussion, we can't name those platforms, so we only have public evidence for the Facebook case, but that should hopefully be a signal to other people in the industry that this is a reasonable and useful thing to do.
Now, is it bad to have less news content on platforms generally? Possibly. I think that's a very complicated question which starts to get into issues like, which types of news are productive? What is the news that individual people actually need to see? And I teach you a whole course on this so we're not going to do it in 10 minutes, but certainly we know that certain types of news are better than other types. In particular, one of the things that we seem to be getting more consensus and evidence for is that diversity of engagement is a quality signal. If there's a news article that Republicans love and Democrats hate, or vice versa, it's probably not as good an article in some societal sense as an article that both Democrats and Republicans say, "Hey, this is pretty good."
And so that is one of our propositions as well. The way we put it is, quality is often positively related to diversity of engagement across viewers. And so we have some evidence from the Facebook Archive on this. We cite also Twitter's Community Notes Program, which explicitly does this, and this is an idea which has been brewing in the research community for a few years now called bridging based ranking. To take it all the way back, I think it's less about the amount of news and the algorithms used to select which news you see that is the important parameter here.
Justin Hendrix:
So I guess implicit in this, of course, is that the platforms are making essentially social engineering decisions. They're deciding what we see, how much of what we see, the extent to which, that we perhaps it will engage with certain ideas over others or certain groups or certain people over others. We're at a different time now. It's 2024. We've had the crises of 2016, we've had 2020, and the insurrection on Capitol Hill and the attack on the capital in Brazil and various other things that have occurred around the world. How are we thinking about this at this point? Do these companies begin to recognize their responsibility? Do you think they're going to take seriously the types of ideas that are in a paper like this?
Ravi Iyer:
This paper was written in collaboration with many people at companies, so clearly there are many people at companies taking these ideas seriously. There are also many people at these companies who are not, and so hopefully by having these conversations publicly and the many avenues that we on the outside have, we can push these discussions in the right direction. As far as social engineering, I think it's important just to realize that everything is a choice. Not making a choice is still a choice. It's not the natural state of things that political and health content should be optimized for engagement. Effectively, many people that have observed that turns things into entertainment, and that is not necessarily the incentive that we want to set in the ecosystem.
And so when I was at Meta, we had many publishers who would say that we're publishing things that we're not proud of because of the incentives that you guys are setting. And that inspired a lot of this work, and these things can be undone. And to your point about having less political content because we're not optimizing for engagement as much, another thing that we haven't talked about in the paper that is there is also, what can you optimize for instead? It talks a little bit about content level surveys or item level surveys versus user level surveys and, what are the actual practical things that you can do to replace engagement incentives with quality incentives? And hopefully, if we can get those best practices right, then we can not just remove the political content but actually incentivize higher quality political content to replace it.
Justin Hendrix:
There's a whole lot of folks right now excited about the idea of decentralization and federation in part because they believe that possibly giving people access to more user controls or even a layer of middleware that provides them with more opportunity may be one big answer to problems of content moderation and other problems on the platforms. What does your look into user controls tell you about how it relates to this overall question around quality and the choices that people make in terms of what they want to see?
Jonathan Stray:
So I think the first thing to say about controls is that there's widespread evidence that most people don't use them. We're talking 1% or 2% of users. We collect all the evidence for that. And so there's a bunch of ways you can respond to that. One is you can say, "Well, that means we have to get the defaults right," which is true. I do think we have to get the defaults right because most people are going to run the defaults. The other way you can go is, okay, well, maybe we just haven't put enough thought and energy into building controls that people will use. And in particular, there's one type of control that has been pretty successful for addressing some of these quality problems, and in the paper we call that lightweight negative feedback.
One of the better documented examples of this is that there's a little X on posts on both Facebook and Instagram now. And when you hit that X, it removes the post for your feed, but it also is a control signal which is interpreted as this item's no good or at least it was no good for me. And because that's right out there on every post, people use that control quite a bit more than other types of controls. There's good evidence that this is strongly correlated to information that is much more expensive to get. For example, if you were to ask that user, what do you think of this item? They would say, "Well, this is trash. I don't want to see this." This is a much cheaper and more scalable way to get that type of information.
We also speculate at the end of the paper a little bit about how generative AI, in particular large language models, may enable new types of controls. And so I think we in the next few years are going to be living in a world where you can say, "Hey, TikTok, less videos about trains," or, "Hey, TikTok, this is some depressing stuff. Find me something happier," where you can start to talk to the machine in this way. And so that potentially could open up a new era of user controls.
Justin Hendrix:
So one thing you do also look at is user level surveys, so the extent to which platforms are actually asking folks to rank certain aspects of their experience or talk about that experience and how that ends up feeding back into the algorithms that choose which type of content they see.
Jonathan Stray:
This was definitely one of the things we wanted to do with this effort, is document the now fairly widespread use of surveys to provide information on what people want to see, what gives them positive experiences. And there's two kinds that we discuss. One is surveys about an item. There's quite a bit of this now. Common ones are, do you want to see more or less posts like this? Or what is the comment about high or low quality? What do you think of this video? YouTube has this survey where they ask if a video was informative, inspiring, funny, misleading, and so on. Or even Facebook's famous, is this good for the world survey? And the basic idea here is that if someone clicks on an item, you don't really know what that click meant. You don't know what their experience was after they read that article or watched that video.
This is basically why optimizing for engagement falls down is, because it's an ambiguous signal. So what you need to do to fix this is you need to get better information about what was going on there. And one way to get better information is just ask people. So all of these companies use these surveys and say one way or another, what did you think of this item? And then what they do is they use that as training you to build a predictive model. And when they're trying to decide what to show you, they actually try to guess what you would answer on that survey if they asked you that survey about the item, which sounds ridiculous, but actually, if you think about it, it isn't really any different than trying to guess if you're going to click on it.
It's the same technology. The difference is, by choosing the survey wording appropriately, you can get at very complex and subtle constructs as to what actually matters for people and what makes a better experience for them in society. Very poorly known so one of the things we're trying to do is we have a whole section where we document the evidence that people do this and this does work and here's how to do it.
Ravi Iyer:
I often talk about this in terms of giving people what they want versus what they want to want. There's a myth at platforms that what people engage with is what they want, and to some degree it is what they want in the sense that it's what they will consume, it's what they will engage with. But when you ask people in a survey, you get to more their aspirational preference. What is it they want to consume? People will often say that they want more informative content or say that they want content that is higher quality, even though their engagement signals will indicate lower quality content. And so that's one reason why surveys are often helpful.
The other thing is, we talk a little bit about different kinds of surveys. So you can ask people things like, how is your wellbeing today? And you can optimize for a question for ranking that improves wellbeing or reduces [inaudible 00:29:03]. Broadly that has not been successful at companies. Those kinds of measures don't move, even measures of things like user satisfaction. So it's really these content level surveys that have been helpful as opposed to these more user level surveys across companies.
Justin Hendrix:
There are a bunch of opportunities for the folks that are listening to this if you are researchers or folks that are studying these questions in different ways to pick up on some of the open questions that you collect in this paper. I want to give you guys the opportunity to challenge tech policy press listeners to perhaps pick up some of these questions. What are some of the ones that you think are entry points for potential new research or new investigations that need to be done?
Jonathan Stray:
Well, as Ravi said, it's hard to move broad indicators. Some of the things that are hottest in the policy discussions, so polarization, wellbeing, for example, there hasn't been a lot of success either publicly or privately moving these broad indicators by changing ranking algorithms. Now, that does not at all mean it's impossible. It's a very complicated thing to do because, for example, you can change something on one platform, but they might be getting similarly harmful content on another platform and people publishing content might have incentives to produce bad stuff. Anyway, it's scientifically hard to do. But I would say the effect of these platforms on wellbeing, polarization, false beliefs, all that stuff is more than the platforms would have you believe and perhaps less than the regulators would have you believe. So there's broad space there to do science and to figure out, can we actually learn anything about what moves an anxiety indicator?
And I just want to give one shout out to a project that we're running out of Berkeley called the Prosocial Ranking Challenge, where we're actually soliciting new ranking algorithms and we're going to test them with a browser extension and give prizes for the people who submit the best algorithm. So there is science to be done here and papers like this, by bringing this knowledge into the public, really set the groundwork for what's been tried, what questions are still open, and where do we go from here? We want the basic science of this stuff to be public, not private. And in fact right now it's worse than private. It's not just that a bunch of companies know this stuff and we don't. It's that the individual companies know things, know parts of the story and aren't talking to each other, and that's a bad place to be as a society.
Ravi Iyer:
I'd love to challenge the tech policy press listeners, and this is speaking for myself and not every author might agree with this, but not to engage in debates about whether social media as a whole is improving mental wellbeing or polarization because a lot of those measures are not very sensitive. Companies have trouble even moving satisfaction with their own products in ranking experiments. So we're going to get a lot of nil results and we're going to be tempted to interpret those nil results if that's the question we ask. Instead, there's been a lot more success having ranking changes that move indicators of quality, that move experiences of bullying and harassment, that move distribution of misinformation. And if we can just agree that these are things that we don't want these platforms to incentivize, we can work on the specific design decisions, the specific ranking algorithm choices that incentivize or dis-incentivize these things rather than these broader questions, which we'll never have an answer, and which will just distract us from the more specific design changes that are talked about in this paper that could actually improve quality.
Jonathan Stray:
And I know, Ravi, for you, that also means making connection between those things and the types of policies that we think folks should be advancing.
Ravi Iyer:
Yeah. I mean, I am working on policy, so I've talked about our design code. We are working with the state of Minnesota on specific legislation that might take some of those design principles and move them into policy, and we're hopeful that whether voluntary or through legislation that some of these ideas in the paper, the idea that quality is often negatively related to engagement and therefore in many domains we should be optimizing for quality, and so there are some specific ways that companies know how to do this that we can make that the reality for most of us in society.
Justin Hendrix:
Well, I want to thank you both for speaking to me about this paper and thank all of the authors who are not represented on this podcast but are listed just at the top of the paper, which you can find linked in the show notes. Thank you, Jonathan, thank you, Ravi, for talking to me today.
Jonathan Stray:
Thank you so much. And I also want to give a shout-out to the people who engage in these conversations and whose names are not listed anywhere. I think it's really important to have these conversations, even if we can't always have them as publicly as we want to.
Ravi Iyer:
Yeah. Thank you, Jonathan, and thank you, Justin.
Justin Hendrix is CEO and Editor of Tech Policy Press, a new nonprofit media venture concerned with the intersection of technology and democracy. Previously, he was Executive Director of NYC Media Lab. He spent over a decade at The Economist in roles including Vice President, Business Development & Innovation. He is an associate research scientist and adjunct professor at NYU Tandon School of Engineering. Opinions expressed here are his own.