Fairness in Question: Do Music Recommendation Algorithms Value Diversity?

Fairness in question: Do music recommendation algorithms value diversity on streaming platforms?

This piece is a follow-up to the panel discussion with streaming executives on Fairness in Discovery Algorithms, which I’ve organized for Réseau MAP’s JIRAFE Paris event in September 2021. Even though the perspective presented in this article is largely shared across the streaming market, all opinions are still my own, and they don’t represent the official views of any of the companies mentioned in this text.

As a former product director at a streaming platform, I often receive questions like “do streaming services choose to promote popular artists over indies and niche music?”. Intuitively, people assume that on these big platforms “the rich get richer” — wondering if algorithms prefer artists that are already popular, which, in turn, makes them even more popular. 

Those concerns are even more relevant now that Spotify released its Discovery Mode. This new feature enables artists to boost tracks across Spotify’s algorithm-driven properties in exchange for a commission on the revenue, making their recommendation system even less transparent than it already was. 

Concerns about algorithmic bias and discrimination aren’t limited to the music industry. Not so long ago, algorithms’ fairness received a lot of attention in the field of artificial intelligence, as numerous parties published papers investigating the extent to which recommender systems might discriminate against specific subsets of the population. Spotify also contributed to that string of reports, publishing an article on the impact of diverse listening on business metrics — and standing out from the crowd as they were the only ones who were accounting for business KPIs as well as raw consumption.

In today’s piece, we’ll go through the main elements of this problem of fair and diverse recommendations within the streaming system: 

  • What’s the definition of “fair”?
  • Can fairness be measured?
  • Do streaming platforms have an incentive to favor some creators over others, i.e., decrease diversity?
  • If not, why does it still happen? 
  • What can we do to prevent it?

If you’re interested in digging a little bit deeper, here is a quick summary of the research papers that form the basis of my analysis.

The Definition of “Fair”

In the Oxford Dictionary, fairness is defined as “impartial and just treatment or behavior without favoritism or discrimination”. In the context of music recommendation, however, fairness is often defined in terms of exposure or attention. Streaming services are also a two-sided marketplace, which means that “impartial and just treatment” must apply to both streaming services’ users and artists. In this piece, we will mainly focus on fairness in the context of artist exposure — equality of opportunity for all artists would be considered fair.

Can fairness be measured?

Diversity metrics have been defined to measure and quantify the extent of bias and discrimination within a particular system. When it comes to streaming recommendations, these metrics usually take into account such factors as:

  • How broad is the number of artists users engage with: the more artists users listen to, the more diverse their experience is.
  • How similar are these artists? Adding music similarity to the diversity metric makes it more accurate, as it considers not only the raw volume of recommended artists but also how diverse these artists are: the more users listen to artists from different genre/mood categories, the more diverse their experience is.

Research has shown that digital music consumption does entail a broader listening behavior — but that doesn’t necessarily mean that the resulting artist sets are very diverse. Yet, the yearly number of unique artists played is increasing year after year — for those my age or older: remember how we could only buy a couple of CDs every year?

Do streaming platforms have incentives to favour some creators over others, i.e., decrease diversity?

My own experience is that (and Spotify’s paper confirms it) key business metrics like conversions to premium offers and retention correlate to diverse listening. It means that streaming services have an incentive to increase the diversity of consumption on the platform in the long term.

It quantifies something we already knew: as you discover music you enjoy through a platform, you become more likely to stay with that service. The opposite is also true: if you get stuck with the same old set of recommendations, you are more likely to get frustrated and churn.

Recommendation diversity lead to higher chance of premium conversion on streaming services
Users with a more diverse listening behaviour are much more likely to get a Premium subscription. Source: Spotify Research

The graph above also illustrates a major dilemma for people working on music recommendations. To optimize short-term satisfaction, they have to play it safe and recommend music users will surely love. Very likely, these recommendations will be very similar to their recent preferences. On the other hand, to optimize long-term satisfaction, they need to make sure that users don’t get stuck in their music bubble and their listening remains diverse.

“This is a tension between short- and long-term goals: if we need to recommend content urgently, a good strategy is to promote relevance (and thus discourage diversity), but if we want to attract and retain users, ensuring that consumption is sufficiently diverse appears important. The challenge going forward will be to develop methods and algorithms that can simultaneously deal with these conflicting incentives.”

Spotify

Why are recommender systems unfair?

Several previous investigations found that regardless of the streaming platform, recommender systems are biased and unfair to specific groups of both users and artists

Bias in recommendations can originate from different sources, primarily: 

  1. From the datasets used to train the algorithms: if there is more consumption or more data on a specific group of users or artists;
  2. From algorithmic bias: recommendation algorithms might propagate the existing bias in data, or even worse, intensify them; for example, algorithms might recommend popular content even to users who are not interested in popular content.

The feedback loop

Popularity is probably the most apparent bias known in recommender systems. If your goal is to optimize for short-term satisfaction, taking the song’s popularity into account is usually a good bet, especially if you know little about users’ tastes – an issue known as the cold start problem. These popular recommendations are then passed down to users, and their feedback is logged and added to databases (like play counts and listening histories), used by algorithms to build a new set of recommendations. This “self-feeding” system is known as a feedback loop — and as a consequence of it, the rich indeed get richer as smaller artists get poorer.

Feedback loop problem in music recommendation algorithms

This diversity problem is also relevant with regards to gender bias — both on the artist and user level. Several investigations showed that most recommendation algorithms studied tend to be unfair towards female users, who receive less accurate recommendations. As for female creators, I couldn’t summarize it better than Ferraro, Serra, and Bauer in their research paper about the gender imbalance in music recommenders:

“Female artists are less likely to reach an audience only for being female, they are underrepresented in charts and awards nominations, and less radio air time is dedicated to females. This bias is also present in streaming services. For example, the largest proportion of female and mixed-gender artists appear in the lower levels of popularity.”

Ferraro et al, 2021

Best case scenario, algorithms take this imperfect reality of the music business and then reproduce, thinking that these imbalances are based on actual user preferences. Worst case, these biases are fed into the feedback loop and further amplified.

Filter bubbles

​​The term “filter bubble” refers to the results of the algorithms guiding our online media consumption. According to Eli Pariser, personalized algorithms create “a unique universe of information for each of us … which fundamentally alters the way we encounter ideas and information.” In the context of music consumption, that means focusing recommendations on what users already like and leaving out unsure or unproven music styles, thus “trapping” users in their own musical universe. If users don’t go out of their way to break that bubble, they will miss out on other styles they might like to discover.

Spotify’s research demonstrates this very point: algorithm-driven listening through recommendations is associated with reduced consumption diversity. When users become more diverse in their listening, they shift away from algorithmic consumption and increase their “organic” consumption. Why? Recommender systems encourage people to concentrate on an overly narrow set of content and settle them in these music “filter bubbles”. And if users stick with algorithmic-only listening, they risk getting over-specialized recommendations that only worsen over time.

What can we do to counteract recommendation bias?

How can platforms recommend content that users are likely to enjoy in the short term while simultaneously ensuring they can remain diverse in their long-term consumption? The good news is that algorithms are very good at doing what we set them out to do. 

Including diversity measures and accuracy and other metrics used to judge the algorithms is a good starting point. Fortunately, as we’ve seen above, DSPs have solid business reasons to develop such diversity-aware recommendation methods.

TikTok’s recommendation algorithm was recently mentioned among the top 10 breakthrough technologies by MIT technology review. What’s innovative in their approach isn’t the algorithm itself — it’s the metrics they’re optimizing for, weighing in more on diversity than other factors. TikTok broke traditional rules by voluntarily interrupting repetitive patterns to avoid filter bubbles and favoring emerging creators over the most popular ones to avoid feedback loops. That said, such an approach is not necessarily a silver lining for streaming services and music recommendations. The immersive nature of video consumption makes it a lot easier to apply such concepts than in the realm of off-screen audio, where repeat listens and familiarity play a prominent role in song adoption.

When it comes to algorithmic music consumption, there are a few debiasing techniques currently being investigated:

  • Re-ranking: shuffling the songs within the final recommendation lists to build a more diverse listening experience, e.g., bumping up less popular artists or placing the first female artist within the top-3, etc. So far, re-ranking has proved to be the most effective way of building a more diverse set of recommendations. 
  • Rebalancing: relabelling data and resampling training datasets to get an equal number of labels for all the various style/genre/gender groups.
  • Regularization: adding a regularization term for fairness and diversity among the target metrics used to judge the performance of recommender systems.
  • Adversarial training: introducing entirely new training datasets: for example, limiting the dataset to organic-only listening to avoid the feedback loop problem.
  • Counterfactual intervention: testing the algorithm’s bias by manually switching some of the sensitive attributes within the dataset. For example, you could set all artists labeled as male to female (I’m afraid that’s the level of detail for gender metadata currently available to streaming services). If the algorithm is unbiased in regards to gender, that intervention shouldn’t affect the final recommendation output.

The other way of creating a more diverse recommendation system is to affect the non-algorithmic listening within the platform. Consumption stemming from editorial recommendations usually accounts for about 15% of total streams on a streaming platform — which is on par with the share of algorithm-mediated consumption. So, optimizing editorial programming with regards to diversity can have a significant impact on other types of recommendations and overall consumption on the platform.

The biggest and brightest example of that technique can be found in Spotify’s EQUAL initiative, designed to create more space for female artists within Spotify’s editorial landscape. To find out a bit more about how Spotify is tackling the problem of fair recommendation, I’ve reached out to Antoine Monin, Head Of Music at Spotify France & Benelux, and asked him to share his perspective on the topic:

“If you approach the recommendation fairness as just a product problem, you really are stuck between a rock and a hard place keeping the users happy with their recommendations while also driving them to experience new music and expand their tastes. We know that diverse listening is a good thing, even from a strictly business perspective, but we also can’t force our users into it. So the whole new way to approach that problem is to try and change the overall consumption landscape itself and plug into the editorial side of Spotify. 

If we influence that editorial-mediated listening by creating a more diverse set of playlists that our users love and go back to, we can use that data to better our recommendations. That way, we can gradually guide both our users and our recommender systems into a more diverse place. That’s one of the reasons behind the launch of our EQUAL program — and we are already see its first impact on listening diversity across the platform.”

Antoine Monin, Head of Music at Spotify France & Benelux

— 

A mix of these debiasing techniques and overarching initiatives designed to influence listening across entire platforms might prove to be a solution to the problem of fairness and diversity in music recommender systems. That said, however, we have to beware not to fall into yet another trap while reaching for more diversity. Some musical genres are often perceived as more innovative and noble than others: for example, it might seem more diverse and “cultured” to listen to classical music than hard rock. Does that mean that algorithms suggest free jazz to everyone to avoid trapping us in our filter bubbles and satisfy a normative view of what should be admired? I’m not sure either 🙂

So, can algorithms be blamed for recommending similar content? As a product person, I would argue that making a good recommendation is as much a matter of user experience as an algorithmic/technical one. What matters is setting expectations: users are much more likely to enjoy an adventurous recommendation that takes them out of their bubble if they expect it. Is it reasonable to expect an algorithm to guess what next music style we’ll like? Probably. Is it reasonable to expect an algorithm to guess when to suggest such adventurous recommendations? Maybe, but it’s a high expectation.