Predicting Hit Singles

26 January 2019

This was an interesting two weeks! Many lessons learned, primarily regarding the data science pipeline, statistical regression, relational databases, and the many technologies required to handle large amounts of data.

I set out to find which song ‘features’ the US national audience enjoys most in their favorite hit singles. This is easy to write in a sentence but (somewhat obviously) much more difficult to execute in practice. There’s an enormous amount of money to be made in this domain so it wasn’t terribly surprising that one of the hardest parts of this project was simply getting the data and making sure it was in a usable format.

I used two primary data sources for this project:

  • The million song dataset, a “freely-available collection of audio features and metadata for a million contemporary popular music tracks.”
  • BillboardTop100of, someone’s very-useful compendium of the historically ranked Billboard Top 100 songs for each year.

I scraped the billboard site with BeautifulSoup to gather the rankings for 5300 songs from 1960-2011, many of which were also contained in the million song dataset as their collection and curation standards were based off of “popular” songs as a primary source. Very convenient.

Scraping turned out to be the easiest part of this project by far. The million song dataset is rather dated (released in 2011, which is why my scraping stopped there) and all of the universities and services that used to host the dataset are now defunct - all except for a single public AWS snapshot. Because of this I needed to sign up for AWS for the first time, spin up my own instance and connect to the snapshot as a storage volume, mount it, then pull it down to my local machine using rsync. This took a few days to figure out with some help from a friend.

I needed to generate a SQL database in order to pull anything useful from the .h5 files the snapshot hosted - they were stored in a triple-tier single-letter alphabet set of files, which is basically useless in terms of organization - and greping that much data would have taken forever without the aid of a relational database. So - down the SQL rabbit hole I went, also for the first time. The creators of the MSD provide a few scripts for generating SQL databases, so I modified the python script they provided in order to extract more song features than were included by default.

The Million Song Dataset provides us with a large number of useful features, a subset of which are pictured here:

Of these, I ended up paying the most attention to those that might uniquely identify traits in popular songs. Key, mode, time signature, duration, artist and song “hotttnesss”, and a few others.

Cross referencing the two dataframes I created was time consuming (again, data scientists spend ~80% of our time simply cleaning our data to the point that it’s usable) so I worked on a subset of the songs from Billboard that matched perfectly on the artist name column from MillionSongs, ~1400 total. There is a lot of useful data left on the table with this approach, but with only a week and a half to complete this project and present it to my class, I opted for developing a functional pipeline first and optimizing later. (I do intend to revist this in the near future, as I’m not yet satisfied with my results.)

Even before running regression on this dataset we can pull out some interesting superlatives from this data. Kanye is the hottest artist (passes my personal sniff test), Cream produced the hottest single “White Room” and some songs actually ranked over multiple years, such as “The Twist” by Chubby Checker in 1960 and 1962.

This is unfortuately where my 6 month old Macbook Pro suffered a logic board short and I lost the ability to back up my work (large databases, csv files, and jupyter notebooks) or charge my computer using the lightning ports, so I lost a bit of time running between Apple stores and trying to recover what I could before presenting in class. Unfortunately this meant my regression results looked like this on presentation day:

It’s worth clarifying here what some of these numbers mean. The Adjusted R-squared score is one of the fields I pay attention to in terms of measuring the predictive power of a line generated by ordinary least squares regression - needless to say, a low score indicates very little predictive power.

This low number does not indicate failure, much as it may seem. There is a lot of data still on the table and I had very little time to engineer the fit of my line using feature engineering, which incorporates a bit of domain knowledge and some math to help improve the predictive power of your line. With more time, I’d also include genre analysis, incorporate lyrical analysis with NLP (natural language processing) techniques, and would also try to revive this aging dataset by contacting the original creators to add my insights back to the core dataset so others can capitalize on my work.

This was a tough project to complete in two weeks, and also taught me a lot about selecting projects of an appropriate scope based on my skill level. I’m excited to continue to work in this domain and learn more about how to engineer songs for popularity based on key insights I believe I can still derive from the Million Songs Dataset.