Data

For my machine learning class this semester, my team and I developed a model that predicts song "moods." It feeds on Spotify's Track Feature API. We composed songs from popular Spotify-created playlists that exhibit a certain mood (e.g. Happy Hits) and fetched every track's attributes, followed by one-hot encoding, normalizing, label mapping, and cleaning the data. We had about 3000 songs total that were randomly shuffled and merged into a single dataset.

To ensure all song moods are represented fairly, we chose data from playlists with a consistent and single mood. Playlists chosen are either published by Spotify or were viral user-generated playlists. We also tried to ensure that each playlist has a consistent size for each mood category (~100 songs/playlist). Each category had a different total number of songs, so we capped all at the minimum across all categories.

Which playlists do we use?

Is it exhibiting a single and consistent category?
Is it popular?
Does it have a consistent size with the rest of the playlists across the same category?
Has it been manually checked for quality?

Feature Selection

Even with normalized data. Certain features held no significant pattern to their respective labels. We ran a forward-searching Wrapper algorithm to find the optimized feature set, and we discovered that 4 features had bad accuracy when combined together and decreased overall accuracy when combined with the rest of the features. These four features were : Duration (4), tempo (4), time signature (2), and key (3), mode (3). Final feature set includes 8 features: ['loudness', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'speechiness', 'valence']

Mistakes were made

Incorrect to use a single label prediction. As certain labels are less specific than others. For example, a song can be both happy and chill, sad and romantic, romantic and energetic and happy. After careful deliberation, we eventually changed the structure of our prediction system entirely. Instead of a single classifier, we run a model for each opposing pair of moods and got a confidence prediction. This way, the model can output multiple moods by fairly comparing class pairs separately.

Instead of predicting a single mood, we predict a confidence % between two contrasting mood pairs. Which are Happy vs. Sad and Chill vs. Energetic. This model is beneficial in two ways: Firstly, the model can output multiple moods as a confidence percentage. Second, the model is allowed to fairly predict between two polar opposite labels at each specificity level, starting from very broad (happy vs. sad) and drilling down to more specific pairs after.

MLP Optimized Parameters

Best learning rate: 0.03
Best momentum: 0.45
Optimal number of hidden layers: 4
Best number of nodes per layer: 80
Early stopping: False

Conclusion

78%

Happy vs. Sad test set average accuracy for a 10-fold CV run for both models

88%

Energetic vs. Chill test set average accuracy for a 10-fold CV run for both models