Popular music has evolved through many different forms in the last several decades. From the Beatles, to Led Zeppelin, to Madonna, to Nirvana, to Kanye West, and to Rihanna. There is no one singular sound or approach to popular music. With a huge variation in instruments, lyrical content, and genre, it's amazing that different songs can break into the mainstream. But is there soemthing in common with all of these songs? Does Cardi B's songs or Harry Styles' songs have anything in common that might be able to explain or predict their popularity? That is what I will be investigating in this tutorial. Are there any characteristics that can predict a song's popularity?
Why ask this question? The answer to these questions would be beneficial to music streaming services/radio stations who would have better insight into what songs are more likely to be hits. It would also be benficial to artists who would gain powerful insight into what aspects of a song are more predicitive of its success. For musicians looking to break into radio play or popular Spotify playlists, this would be a crucial insight for them to acquire.
First, I need to import the appropriate packages that will be used in this project.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import seaborn as sb
Now, I will import my data. My data comes from https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=tracks.csv. Kaggle has a dataset of 600 thousand songs on Spotify, their popularity and a variety of other charactersitics of the song. The dataset can be downloaded from Kaggle from making a free account.
songs = pd.read_csv('archive/tracks.csv')
songs.head()
id | name | popularity | duration_ms | explicit | artists | id_artists | release_date | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | time_signature | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 35iwgR4jXetI318WEWsa1Q | Carve | 6 | 126903 | 0 | ['Uli'] | ['45tIt06XoI0Iio4LBEVpls'] | 1922-02-22 | 0.645 | 0.4450 | 0 | -13.338 | 1 | 0.4510 | 0.674 | 0.7440 | 0.151 | 0.127 | 104.851 | 3 |
1 | 021ht4sdgPcrDgSk7JTbKY | Capítulo 2.16 - Banquero Anarquista | 0 | 98200 | 0 | ['Fernando Pessoa'] | ['14jtPCOoNZwquk5wd9DxrY'] | 1922-06-01 | 0.695 | 0.2630 | 0 | -22.136 | 1 | 0.9570 | 0.797 | 0.0000 | 0.148 | 0.655 | 102.009 | 1 |
2 | 07A5yehtSnoedViJAZkNnc | Vivo para Quererte - Remasterizado | 0 | 181640 | 0 | ['Ignacio Corsini'] | ['5LiOoJbxVSAMkBS2fUm3X2'] | 1922-03-21 | 0.434 | 0.1770 | 1 | -21.180 | 1 | 0.0512 | 0.994 | 0.0218 | 0.212 | 0.457 | 130.418 | 5 |
3 | 08FmqUhxtyLTn6pAh6bk45 | El Prisionero - Remasterizado | 0 | 176907 | 0 | ['Ignacio Corsini'] | ['5LiOoJbxVSAMkBS2fUm3X2'] | 1922-03-21 | 0.321 | 0.0946 | 7 | -27.961 | 1 | 0.0504 | 0.995 | 0.9180 | 0.104 | 0.397 | 169.980 | 3 |
4 | 08y9GfoqCWfOGsKdwojr5e | Lady of the Evening | 0 | 163080 | 0 | ['Dick Haymes'] | ['3BiJGZsyX9sJchTqcSA7Su'] | 1922 | 0.402 | 0.1580 | 3 | -16.900 | 0 | 0.0390 | 0.989 | 0.1300 | 0.311 | 0.196 | 103.220 | 4 |
Now, that we have the data imported, it's time to clean up the data. First, I will remove the columns that will not be considered in the analysis at all. Since we are just looking at the characteristics of the song, identfying aspects of the song such as name, artist, etc. can be dropped. I will also be renaming the remaining columns just for appearance.
drop_columns = ['name', 'artists', 'id', 'explicit', 'id_artists', 'key', 'loudness', 'mode', 'instrumentalness', 'liveness', 'time_signature']
songs = songs.drop(drop_columns, axis=1)
songs = songs.rename(columns ={'popularity': 'Popularity', 'duration_ms':'Duration', 'release_date': 'Year', 'danceability': 'Danceability', 'energy': 'Energy', 'speechiness': 'Speechiness', 'acousticness': 'Acousticness', 'valence': 'Valence', 'tempo': 'Tempo'})
songs.head()
Popularity | Duration | Year | Danceability | Energy | Speechiness | Acousticness | Valence | Tempo | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 126903 | 1922-02-22 | 0.645 | 0.4450 | 0.4510 | 0.674 | 0.127 | 104.851 |
1 | 0 | 98200 | 1922-06-01 | 0.695 | 0.2630 | 0.9570 | 0.797 | 0.655 | 102.009 |
2 | 0 | 181640 | 1922-03-21 | 0.434 | 0.1770 | 0.0512 | 0.994 | 0.457 | 130.418 |
3 | 0 | 176907 | 1922-03-21 | 0.321 | 0.0946 | 0.0504 | 0.995 | 0.397 | 169.980 |
4 | 0 | 163080 | 1922 | 0.402 | 0.1580 | 0.0390 | 0.989 | 0.196 | 103.220 |
One final touch for data tidying will be to convert the original release dates to integers of just the year. This will be used in future analysis.
for index, rows in songs.iterrows():
s = songs.at[index, 'Year']
songs.at[index, 'Year'] = int(s[0:4])
songs.head()
Popularity | Duration | Year | Danceability | Energy | Speechiness | Acousticness | Valence | Tempo | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 126903 | 1922 | 0.645 | 0.4450 | 0.4510 | 0.674 | 0.127 | 104.851 |
1 | 0 | 98200 | 1922 | 0.695 | 0.2630 | 0.9570 | 0.797 | 0.655 | 102.009 |
2 | 0 | 181640 | 1922 | 0.434 | 0.1770 | 0.0512 | 0.994 | 0.457 | 130.418 |
3 | 0 | 176907 | 1922 | 0.321 | 0.0946 | 0.0504 | 0.995 | 0.397 | 169.980 |
4 | 0 | 163080 | 1922 | 0.402 | 0.1580 | 0.0390 | 0.989 | 0.196 | 103.220 |
Now, we have our tidied data. Here is a list of the columns and their meanings:
1) Popularity: a measure from 0-100 of how popular the song is
2) Duration: how long the song is in milliseconds
3) Danceability: a continuous scale from 0-1 of how conducive the song is to dancing
4) Energy: a continuus scale from 0-1 of how energetic the song is
5) Acoustincness: a continuous scale from 0-1 of how acoustic the song is
6) Valence: a contnuous scale from 0-1 of how positive and upbeat the song is
7) Tempo: a measure of the BPM of the song
More info can be found at https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks.
Now let's take a look at the actual distribution of popularity by plotting a histogram.
ax = songs['Popularity'].plot.hist(title='Distribution of Popularity', bins=20)
ax.set(xlabel='Popularity', ylabel='Frequency')
[Text(0.5, 0, 'Popularity'), Text(0, 0.5, 'Frequency')]
There are a very large number of entries that have 0 popularity and that is greatly skeweing the distribution. So to remedy this I am going to remove the 0 popularity entries.
songs = songs[songs['Popularity'] > 0]
ax = songs['Popularity'].plot.hist(title='Distribution of Popularity', bins=20)
ax.set(xlabel='Popularity', ylabel='Frequency')
[Text(0.5, 0, 'Popularity'), Text(0, 0.5, 'Frequency')]
The data is still very skewed to the right, but this is certainly an improvement.
Now, I will plot popularity against the remaining characteristics. This is a preliminary step to get a sense of any visible relationships between some of the characteristics and popularity. This will give us an idea of whether or not there is a relationship between certain song characteristics and song popularity. From here on out, I will also be using a subset of the data considering the actual data is too large to work with. I defined a new dataframe rand that is a random subset of 10000 entries from the original table.
no_plot_array = ['Popularity', 'Year']
rand = songs.sample(n = 2500)
for column, items in rand.iteritems():
exists = column in no_plot_array
if exists == False:
plt.scatter(x = column, y = 'Popularity', data = rand)
plt.title("Popularity vs. {}".format(column))
plt.xlabel('Popularity')
plt.ylabel('{}'.format(column))
plt.show()
Not much luck here. Other than potentially energy and danceability, no visible correlations exist within the data. As a last ditch effort to find any correlations, I am going to plot the Spearman coefficients between all of the charateristics. The reason I am using the Spearman coefficient is because popularity is not continuous, but ordinal. Therefore, I can't use the Pearson coefficient.
subset = songs[['Popularity', 'Energy', 'Danceability', 'Valence', 'Tempo', 'Acousticness']]
subset.corr(method='spearman')
Popularity | Energy | Danceability | Valence | Tempo | Acousticness | |
---|---|---|---|---|---|---|
Popularity | 1.000000 | 0.253049 | 0.180074 | -0.013187 | 0.050153 | -0.297017 |
Energy | 0.253049 | 1.000000 | 0.202052 | 0.367993 | 0.223387 | -0.699404 |
Danceability | 0.180074 | 0.202052 | 1.000000 | 0.502802 | -0.050562 | -0.187001 |
Valence | -0.013187 | 0.367993 | 0.502802 | 1.000000 | 0.121502 | -0.167450 |
Tempo | 0.050153 | 0.223387 | -0.050562 | 0.121502 | 1.000000 | -0.204376 |
Acousticness | -0.297017 | -0.699404 | -0.187001 | -0.167450 | -0.204376 | 1.000000 |
None of the correlations here seem overly strong, but it is important to keep in mind that this song popularity is ultimately based on human choice and opinion which is an extremely variable condition. With this in mind, I will continue to consider any characteristic whose absolute value of correlation with popularity is greater than 0.1. That leaves just three charactersitics: energy, danceability, and acousticness.
With this in mind, I am going to attempt to make some preliminary regression models of these three characteristics against popularity. To do this, I used the sklearn regression functions and separate testing and training sets. It took some trial and error to determine the sizes of the testing and training sets, but I finally settled on a 30/70 split because it yielded the best results. For each model, I will print out some defining features such as the slope and intercept, and also the coefficient of determination and mean squared error to get an idea of predictive power of the models.
new_subset = rand[['Energy', 'Danceability', 'Acousticness']]
fig, ax = plt.subplots(1, 3, sharey=True, tight_layout=True)
fig.suptitle("Deviation From Linear Regression Model")
for s in new_subset:
y = rand['Popularity'].to_numpy().reshape((-1,1))
X = rand[s].to_numpy().reshape((-1,1))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
reg = LinearRegression()
reg.fit(X_train, y_train)
intercept = reg.intercept_
coefficient = reg.coef_
r_squared = reg.score(X_test, y_test)
pred = reg.predict(X_test)
mse = mean_squared_error(y_test, pred)
print("Regression analysis for Popularity vs. {} ".format(s))
print("Intercept: \n", intercept)
print("Slope: \n", coefficient)
print("Coefficient of Determination: \n", r_squared)
print("Mean Squared Error: \n", mse)
print("\n\n")
rand['{} Residuals'.format(s)] = rand['Popularity'] - (intercept[0] + coefficient[0,0]*rand[s])
i = 0
if s == 'Energy':
i = 0
elif s == 'Danceability':
i = 1
else:
i = 2
ax[i].hist(rand['{} Residuals'.format(s)].to_numpy(), bins=10)
ax[i].set_xlabel("Residuals for {}".format(s))
ax[i].set_ylabel("Frequency")
plt.show()
Regression analysis for Popularity vs. Energy Intercept: [20.35687817] Slope: [[16.88840861]] Coefficient of Determination: 0.09350707221084065 Mean Squared Error: 292.054481299843 Regression analysis for Popularity vs. Danceability Intercept: [18.02216442] Slope: [[20.79434411]] Coefficient of Determination: 0.04694697775983081 Mean Squared Error: 307.88966827878494 Regression analysis for Popularity vs. Acousticness Intercept: [37.12370761] Slope: [[-17.79805842]] Coefficient of Determination: 0.08850936708549184 Mean Squared Error: 294.5388335741142
So as you can see, the coefficients of determination are rather low. This could imply that despite any possible correlation, these variables have very little predictive power when it comes to predicting a song's popularity. However, as I noted before, popularity is inherently reflective of human behavior and opinions. So, while these coefficients of determination are certainly low, that may not necessarily rule out any predictive power just yet.
The residuals do appear to be approximately normally distributed aswell which is in keeping with our assumptions of linear regression.
Where to go now? As a next step, I am going to see if the residuals differ at all by year. To do this, I will divide up the years into 5 distinct categories and make a violin plot of the residuals to see if it is worth testing whether or not separate models for the different time periods should be created. I will use the pandas cut function to accomplish this.
rand['Time Period'] = pd.cut(x=rand['Year'], bins=5, labels = ['1922-1942', '1943-1962', '1963-1981', '1982-2001', '2002-2021'])
fig, ax = plt.subplots()
sb.violinplot(x='Time Period', y='Energy Residuals', data=rand)
ax.set_title("Energy Residuals vs. Time")
ax.set_xlabel('Time Period')
ax.set_ylabel('Residuals')
plt.show()
fig, ax = plt.subplots()
sb.violinplot(x='Time Period', y='Danceability Residuals', data=rand)
ax.set_title("Danceability Residuals vs. Time")
ax.set_xlabel('Time Period')
ax.set_ylabel('Residuals')
plt.show()
fig, ax = plt.subplots()
sb.violinplot(x='Time Period', y='Acousticness Residuals', data=rand)
ax.set_title("Acousticness Residuals vs. Time")
ax.set_xlabel('Time Period')
ax.set_ylabel('Residuals')
plt.show()
The distribution of residuals changes drastically over time with the model consistently underpredicting older song popularity and a more even spread as time goes on. This warrants a deeper investigation into whether or not these three song characteristics have varying prediciton power throughout time. To accomplish this I will start by making separate regression analyses for the different time periods.
First, I am going to check if the characteristic values differ significantly throughout time. If this is the case, it may be necessary to standardize energy, danceablity, and acousticness.
plot = ['Energy', 'Danceability', 'Acousticness']
for p in plot:
fig, ax = plt.subplots()
sb.violinplot(x='Time Period', y=p, data=rand)
ax.set_title("{} vs. Time".format(p))
ax.set_xlabel('Time Period')
ax.set_ylabel('{}'.format(p))
plt.show()
Energy and acousticness vary noticeably over the different time periods so I will standardize them so they can be compared across different time periods. Danceability remains fairly consistent but for consistency, I will standardize this variable as well. In order to standardize, I will calculate the mean and standard deviation of these values for each individual year and subtract the mean from values in that particular year and divide that result by the standard deviation. I use dictionaries to create a map of the means and standard deviations for each year.
e_average = {}
e_std = {}
d_average = {}
d_std = {}
a_average = {}
a_std = {}
for y in rand.Year.unique():
comp = rand.loc[rand['Year'] == y]
e_average[y] = rand['Energy'].mean()
e_std[y] = rand['Energy'].std()
d_average[y] = rand['Danceability'].mean()
d_std[y] = rand['Danceability'].std()
a_average[y] = rand['Acousticness'].mean()
a_std[y] = rand['Acousticness'].std()
Next, I create a new dataframe, drop all of the columns that are no longer going to be used in the analysis, and standardize energy, danceability, and acousticness. I also drop energy, danceability, and acoustincess since the unstandardized values will no longer be necessary.
stdzd_rand = rand.copy(deep=True)
stdzd_rand = stdzd_rand.reset_index()
stdzd_rand = stdzd_rand.drop(['Energy Residuals', 'Residuals', 'Danceability Residuals', 'Acousticness Residuals', 'Valence', 'Tempo', 'Speechiness', 'Duration'], axis=1)
stdzd_rand['Standard Energy'] = float(0)
stdzd_rand['Standard Danceability'] = float(0)
stdzd_rand['Standard Acousticness'] = float(0)
for index, rows in stdzd_rand.iterrows():
stdzd_rand.at[index, 'Standard Energy'] = (stdzd_rand.at[index, 'Energy'] - e_average[stdzd_rand.at[index, 'Year']])/e_std[stdzd_rand.at[index, 'Year']]
stdzd_rand.at[index, 'Standard Danceability'] = (stdzd_rand.at[index, 'Danceability'] - d_average[stdzd_rand.at[index, 'Year']])/d_std[stdzd_rand.at[index, 'Year']]
stdzd_rand.at[index, 'Standard Acousticness'] = (stdzd_rand.at[index, 'Acousticness'] - a_average[stdzd_rand.at[index, 'Year']])/a_std[stdzd_rand.at[index, 'Year']]
stdzd_rand = stdzd_rand.drop(['Energy', 'Danceability', 'Acousticness', 'index'], axis=1)
stdzd_rand.head()
Popularity | Year | Time Period | Standard Energy | Standard Danceability | Standard Acousticness | |
---|---|---|---|---|---|---|
0 | 6 | 1945 | 1943-1962 | -2.058710 | -0.456082 | 1.719453 |
1 | 13 | 1983 | 1982-2001 | -0.865288 | 0.053157 | 0.035528 |
2 | 27 | 1998 | 1982-2001 | -1.504548 | -1.370257 | 1.260471 |
3 | 39 | 2014 | 2002-2021 | 0.938486 | -0.701498 | -0.947110 |
4 | 8 | 1957 | 1943-1962 | -0.136449 | 0.010209 | 1.025020 |
Now, in order to complete this regression analysis, I will need to split the data into 5 separate tables for the 5 different time periods.
oldest = stdzd_rand[stdzd_rand['Time Period'] == '1922-1942']
oldest = oldest.reset_index()
older = stdzd_rand[stdzd_rand['Time Period'] == '1943-1962']
older = older.reset_index()
mid = stdzd_rand[stdzd_rand['Time Period'] == '1963-1981']
mid = mid.reset_index()
newer = stdzd_rand[stdzd_rand['Time Period'] == '1982-2001']
newer = newer.reset_index()
newest = stdzd_rand[stdzd_rand['Time Period'] == '2002-2021']
newest = newest.reset_index()
And finally, I will create the regression models for each time period following the same procedure as before with sklearn.
df_collection = [oldest, older, mid, newer, newest]
characteristics = ['Standard Energy', 'Standard Danceability', 'Standard Acousticness']
for table in df_collection:
for c in characteristics:
X = table['{}'.format(c)].to_numpy().reshape((-1,1))
y = table['Popularity'].to_numpy().reshape((-1,1))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
reg = LinearRegression()
reg.fit(X_train, y_train)
intercept = reg.intercept_
coefficient = reg.coef_
r_squared = reg.score(X_test, y_test)
pred = reg.predict(X_test)
mse = mean_squared_error(y_test, pred)
print("Regression analysis for Popularity vs. {}".format(c))
print("Time Period: {}".format(table.at[1,'Time Period']))
print("Intercept: \n", intercept)
print("Slope: \n", coefficient)
print("Coefficient of Determination: \n", r_squared)
print("Mean Squared Error: \n", mse)
print("\n\n")
Regression analysis for Popularity vs. Standard Energy Time Period: 1922-1942 Intercept: [12.49431334] Slope: [[4.54656561]] Coefficient of Determination: 0.18155939408165145 Mean Squared Error: 9.219983968712825 Regression analysis for Popularity vs. Standard Danceability Time Period: 1922-1942 Intercept: [7.08484265] Slope: [[2.08344521]] Coefficient of Determination: -0.38946676686365245 Mean Squared Error: 11.909715144545592 Regression analysis for Popularity vs. Standard Acousticness Time Period: 1922-1942 Intercept: [9.58255028] Slope: [[-2.67498598]] Coefficient of Determination: -0.1274914616698104 Mean Squared Error: 63.36962215180933 Regression analysis for Popularity vs. Standard Energy Time Period: 1943-1962 Intercept: [10.28241854] Slope: [[0.17517256]] Coefficient of Determination: -0.009481035987703335 Mean Squared Error: 125.71763330117209 Regression analysis for Popularity vs. Standard Danceability Time Period: 1943-1962 Intercept: [10.16056537] Slope: [[-0.19262239]] Coefficient of Determination: -0.0019238394954044225 Mean Squared Error: 71.9625334424359 Regression analysis for Popularity vs. Standard Acousticness Time Period: 1943-1962 Intercept: [10.08572289] Slope: [[0.09417206]] Coefficient of Determination: -0.005195675083753404 Mean Squared Error: 104.20725684784058 Regression analysis for Popularity vs. Standard Energy Time Period: 1963-1981 Intercept: [23.34635326] Slope: [[1.14051521]] Coefficient of Determination: -0.006060557028874758 Mean Squared Error: 192.68521990466982 Regression analysis for Popularity vs. Standard Danceability Time Period: 1963-1981 Intercept: [23.0745588] Slope: [[-0.54227011]] Coefficient of Determination: -0.031246719287566727 Mean Squared Error: 193.5006526599412 Regression analysis for Popularity vs. Standard Acousticness Time Period: 1963-1981 Intercept: [24.48165504] Slope: [[-3.49816119]] Coefficient of Determination: 0.003350552357349823 Mean Squared Error: 156.49143746476483 Regression analysis for Popularity vs. Standard Energy Time Period: 1982-2001 Intercept: [30.14607761] Slope: [[1.25014216]] Coefficient of Determination: -0.02093746566480359 Mean Squared Error: 199.37460315641493 Regression analysis for Popularity vs. Standard Danceability Time Period: 1982-2001 Intercept: [29.99454088] Slope: [[0.59770366]] Coefficient of Determination: 0.0006718394974339903 Mean Squared Error: 212.75094865473588 Regression analysis for Popularity vs. Standard Acousticness Time Period: 1982-2001 Intercept: [29.40304929] Slope: [[-2.00655711]] Coefficient of Determination: 0.04194263786777608 Mean Squared Error: 185.7982853401452 Regression analysis for Popularity vs. Standard Energy Time Period: 2002-2021 Intercept: [39.55543553] Slope: [[-0.53096196]] Coefficient of Determination: -0.0259333793409251 Mean Squared Error: 315.6165145156702 Regression analysis for Popularity vs. Standard Danceability Time Period: 2002-2021 Intercept: [39.17027201] Slope: [[3.26479831]] Coefficient of Determination: 0.03735380573057523 Mean Squared Error: 245.24007643203137 Regression analysis for Popularity vs. Standard Acousticness Time Period: 2002-2021 Intercept: [39.54968209] Slope: [[-0.34562031]] Coefficient of Determination: -0.003743746785304536 Mean Squared Error: 308.6305598070967
As you can see, the coefficients of determination are actually much lower than they were without the separate time periods. This would indicate that these characteristics do not wield predictive power in these separated time periods.
So what is our final answer to the original question? Are there certain characteristics that can predict whether or not a song will be popular. Well, there might be, but we have not found them here today. Based on the dataset I used and my analysis of the different characteristics, there is little to no evidence to suggest significant (or any) predictive power for any of the characteristics tested. There was very little predicitive power or even correlation between the song characterisitics in the dataset and the popularity of a song. Even after testing to see whether these characterisitics were more meaningful in determing song popularity in songs from different time periods, there was actually less predicting power in each of the individual time periods than there was in the aggregate dataset.
So what does this actually mean? It could mean a number of things. There could be issues with the dataset. I did not personally collect this data, so how they actually quantified qualitative characteristics such as energy, danceability, or even popularity is not immediately obvious to me and there could certainly be biases or even flaws in this methodology. It could also possibly be some combination of the characteristics that can predict a song's popularity. It could also be some quality (or combination of qualities) that were not present in this dataset. Aspects of a song that would not be able to be quantified such as lyrical content, melody, harmony, cadence, etc. can also absolutely play a crucial role in predicting a song's popularity. With more time and more people, all of these possibilities could be explored in greater detail to see if there is a more satisfying answer to my original question. But for now, I can safely say that based on this particular dataset and methodology, these tested characteristics do not explain a song's popularity.
Thank you so much for reading through this tutorial! I hope you enjoyed it. If you are interested in the topic of predicting song popularity, I have linked a paper from Stanford University about using machine learning to predict song popularity.