What Makes A Song Popular?

by Justin Fink

Popular music has evolved through many different forms in the last several decades. From the Beatles, to Led Zeppelin, to Madonna, to Nirvana, to Kanye West, and to Rihanna. There is no one singular sound or approach to popular music. With a huge variation in instruments, lyrical content, and genre, it's amazing that different songs can break into the mainstream. But is there soemthing in common with all of these songs? Does Cardi B's songs or Harry Styles' songs have anything in common that might be able to explain or predict their popularity? That is what I will be investigating in this tutorial. Are there any characteristics that can predict a song's popularity?

Why ask this question? The answer to these questions would be beneficial to music streaming services/radio stations who would have better insight into what songs are more likely to be hits. It would also be benficial to artists who would gain powerful insight into what aspects of a song are more predicitive of its success. For musicians looking to break into radio play or popular Spotify playlists, this would be a crucial insight for them to acquire.

Data Acquisition and Tidying

First, I need to import the appropriate packages that will be used in this project.

Now, I will import my data. My data comes from https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=tracks.csv. Kaggle has a dataset of 600 thousand songs on Spotify, their popularity and a variety of other charactersitics of the song. The dataset can be downloaded from Kaggle from making a free account.

Now, that we have the data imported, it's time to clean up the data. First, I will remove the columns that will not be considered in the analysis at all. Since we are just looking at the characteristics of the song, identfying aspects of the song such as name, artist, etc. can be dropped. I will also be renaming the remaining columns just for appearance.

One final touch for data tidying will be to convert the original release dates to integers of just the year. This will be used in future analysis.

Now, we have our tidied data. Here is a list of the columns and their meanings:

1) Popularity: a measure from 0-100 of how popular the song is

2) Duration: how long the song is in milliseconds

3) Danceability: a continuous scale from 0-1 of how conducive the song is to dancing

4) Energy: a continuus scale from 0-1 of how energetic the song is

5) Acoustincness: a continuous scale from 0-1 of how acoustic the song is

6) Valence: a contnuous scale from 0-1 of how positive and upbeat the song is

7) Tempo: a measure of the BPM of the song

More info can be found at https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks.

Exploratory Data Analysis

Now let's take a look at the actual distribution of popularity by plotting a histogram.

There are a very large number of entries that have 0 popularity and that is greatly skeweing the distribution. So to remedy this I am going to remove the 0 popularity entries.

The data is still very skewed to the right, but this is certainly an improvement.

Now, I will plot popularity against the remaining characteristics. This is a preliminary step to get a sense of any visible relationships between some of the characteristics and popularity. This will give us an idea of whether or not there is a relationship between certain song characteristics and song popularity. From here on out, I will also be using a subset of the data considering the actual data is too large to work with. I defined a new dataframe rand that is a random subset of 10000 entries from the original table.

Not much luck here. Other than potentially energy and danceability, no visible correlations exist within the data. As a last ditch effort to find any correlations, I am going to plot the Spearman coefficients between all of the charateristics. The reason I am using the Spearman coefficient is because popularity is not continuous, but ordinal. Therefore, I can't use the Pearson coefficient.

None of the correlations here seem overly strong, but it is important to keep in mind that this song popularity is ultimately based on human choice and opinion which is an extremely variable condition. With this in mind, I will continue to consider any characteristic whose absolute value of correlation with popularity is greater than 0.1. That leaves just three charactersitics: energy, danceability, and acousticness.

With this in mind, I am going to attempt to make some preliminary regression models of these three characteristics against popularity. To do this, I used the sklearn regression functions and separate testing and training sets. It took some trial and error to determine the sizes of the testing and training sets, but I finally settled on a 30/70 split because it yielded the best results. For each model, I will print out some defining features such as the slope and intercept, and also the coefficient of determination and mean squared error to get an idea of predictive power of the models.

So as you can see, the coefficients of determination are rather low. This could imply that despite any possible correlation, these variables have very little predictive power when it comes to predicting a song's popularity. However, as I noted before, popularity is inherently reflective of human behavior and opinions. So, while these coefficients of determination are certainly low, that may not necessarily rule out any predictive power just yet.

The residuals do appear to be approximately normally distributed aswell which is in keeping with our assumptions of linear regression.

Testing Different Time Periods

Where to go now? As a next step, I am going to see if the residuals differ at all by year. To do this, I will divide up the years into 5 distinct categories and make a violin plot of the residuals to see if it is worth testing whether or not separate models for the different time periods should be created. I will use the pandas cut function to accomplish this.

The distribution of residuals changes drastically over time with the model consistently underpredicting older song popularity and a more even spread as time goes on. This warrants a deeper investigation into whether or not these three song characteristics have varying prediciton power throughout time. To accomplish this I will start by making separate regression analyses for the different time periods.

First, I am going to check if the characteristic values differ significantly throughout time. If this is the case, it may be necessary to standardize energy, danceablity, and acousticness.

Energy and acousticness vary noticeably over the different time periods so I will standardize them so they can be compared across different time periods. Danceability remains fairly consistent but for consistency, I will standardize this variable as well. In order to standardize, I will calculate the mean and standard deviation of these values for each individual year and subtract the mean from values in that particular year and divide that result by the standard deviation. I use dictionaries to create a map of the means and standard deviations for each year.

Next, I create a new dataframe, drop all of the columns that are no longer going to be used in the analysis, and standardize energy, danceability, and acousticness. I also drop energy, danceability, and acoustincess since the unstandardized values will no longer be necessary.

Now, in order to complete this regression analysis, I will need to split the data into 5 separate tables for the 5 different time periods.

And finally, I will create the regression models for each time period following the same procedure as before with sklearn.

As you can see, the coefficients of determination are actually much lower than they were without the separate time periods. This would indicate that these characteristics do not wield predictive power in these separated time periods.

Conclusion

So what is our final answer to the original question? Are there certain characteristics that can predict whether or not a song will be popular. Well, there might be, but we have not found them here today. Based on the dataset I used and my analysis of the different characteristics, there is little to no evidence to suggest significant (or any) predictive power for any of the characteristics tested. There was very little predicitive power or even correlation between the song characterisitics in the dataset and the popularity of a song. Even after testing to see whether these characterisitics were more meaningful in determing song popularity in songs from different time periods, there was actually less predicting power in each of the individual time periods than there was in the aggregate dataset.

So what does this actually mean? It could mean a number of things. There could be issues with the dataset. I did not personally collect this data, so how they actually quantified qualitative characteristics such as energy, danceability, or even popularity is not immediately obvious to me and there could certainly be biases or even flaws in this methodology. It could also possibly be some combination of the characteristics that can predict a song's popularity. It could also be some quality (or combination of qualities) that were not present in this dataset. Aspects of a song that would not be able to be quantified such as lyrical content, melody, harmony, cadence, etc. can also absolutely play a crucial role in predicting a song's popularity. With more time and more people, all of these possibilities could be explored in greater detail to see if there is a more satisfying answer to my original question. But for now, I can safely say that based on this particular dataset and methodology, these tested characteristics do not explain a song's popularity.

Thank you so much for reading through this tutorial! I hope you enjoyed it. If you are interested in the topic of predicting song popularity, I have linked a paper from Stanford University about using machine learning to predict song popularity.