Improving a Music Website's User Experience
In the beginning, a music website was born...
For our capstone project, we had the exciting opportunity to provide data science consulting services to a music website startup (the "Company"). The Company is an invite-only online music rating community that connects artists and fans. The website provides a platform for artists to share their work with music fans and gain valuable feedback before releasing their songs to a general audience. Meanwhile, music fans get to preview songs from their favorite artists with the prospect of receiving special reward points based on their activity on the website. These points can be redeemed for rewards such as special invites to promotional events by the artists. A win-win situation for both artists and fans.
Project Scoping in Collaboration with Client
Since the Company was a pre-seed stage startup whose product was still in beta phase, we had wide latitude to explore, in consultation with the client, potential project ideas. This freedom was a great opportunity to develop client management skills for the data science team, which consisted of eight talented individuals from a diverse range of backgrounds (physics, quantitative finance, philosophy, social-organizational psychology, economics, operations research, and software engineering).
For the project, the client provided us with access to their MySQL database. Our first task was to examine the data, explore the website, and consult with our client on what directions we should ideally take. Our initial meeting with the client involved determining the goals of the Company, which was to create a website that promoted the work of up and coming artists. It was unique in that it allowed for artists and users to interact with each other so that artists could receive constructive feedback from their users. The client seemed interested in finding ways for artists to receive quality feedback from their users. Another idea discussed during this meeting was a dashboard that could filter users by their demographic profile to provide artists with more detailed information on how they were being rated across each segment of the listening population. We also discussed the creation of a comment filter using natural language processing ("NLP") in conjunction with a Naïve Bayes algorithm, for example, to ensure that artists could receive the most constructive feedback possible.
After consulting with the client a second time, our team’s goals ended up revolving around three themes. One was to improve the user experience of the website. A second was to improve the artist experience of the website. Lastly, we thought our insights could assist in guiding the direction of the growth of the website. With these broad themes in mind, we settled on two concrete project tracks, based on the data available and what could deliver the most value for the client in the time given to complete the project. For both project tracks, it was more important to deliver a set of tools that could serve as a broad foundation for future analytics as the Company attracts more artists and users, rather than to gain immediate insight from the data currently available.
Project Track 1: Social Network Analysis
The first project track was to map out user relationships on the website by performing various social network analyses. This knowledge could be used to improve both artist and user experience by exposing the most influential users and artists within the website. More specifically, this information could be used to help artists market their music more effectively by identifying the most influential users. For users, the information could eventually be incorporated into a song recommendation system, our second project track.
Project Track 2: Recommendation System
The second project track was to develop a recommendation system that could improve the user experience by exposing users to music that they haven’t heard of but would most likely enjoy listening to. We used a collaborative filtering framework that considers both explicit (user ratings) and implicit feedback (music spotlighted and music saved) to derive a list of recommended songs for each user.
As part of our project deliverable, we created a Flask app to present our work. This involved creating a data pipeline in which we extracted data from the client's database into our programming language platforms (Jupiter Notebooks, R Studio). We then performed our various analyses under both project tracks and saved the results as pickle objects. Then, using Flask, we developed an API that could retrieve our analyses based on the client's query parameters and return the results on a local host or web server. The following provides further detail on our work.
Social Network Visualizations
We created three types of social network graphs summarized below.
|Type of Social Graph||Description|
|Undirected||No visual indication of the direction of connection between nodes|
|Directed||Direction of connection between two nodes indicated by broader edge close to the receiving node|
|Bimodal||Two types of nodes in the network, such as user and song|
For each type of social network graph, we created visualizations of four different modes of connections on the website. They are summarized below:
|Type of Connection|
|Follower ---> Followed|
|Message Sender ---> Message Recipient|
|Music Viewer ---> User Viewed|
|Social Link Clicker ---> Social Link Clicked|
For illustrative purposes, the following are some examples of directed and undirected graphs. The first pair of graphs are directed and undirected graphs mapping the messages sent from one person to another. The artists are represented as red nodes and the users are represented as blue nodes. The directed graph on the right shows who received the message for each connection, as represented by the darker edges close to the node. One can see that some people are messaging each other (dark edges on both ends of the connection) while others are not. Furthermore, one can see some nodes have multiple dark edges around it, indicating that the person is receiving messages from multiple users. Knowledge of communication patterns can give an idea of who are the more influential members of a network.
On the website, one can also follow other users, similar to the feature on other social network websites such as Twitter, Instagram, and Facebook. Unlike these other social network websites, however, as of the writing of this blog post, the followed cannot broadcast his or her activities to the follower(s). Nevertheless, one can see the potential influence in the network that a person with many followers can have, such as bringing attention to a song that the person is listening to. Below are two graphs representing follower and followers. Given that every person who joins the website automatically follows the CEO, one can see a dark mass surround the node representing the CEO in the directed graph on the right. One can also detect that there are some artists, represented by the red nodes, starting to amass their own groups of followers.
The next set of graphs shows connections based on who was viewing whose music. It makes sense that artists are in the center with all edges connecting to them. As the Company expands, one would expect the groups centered around the artists to more visibly differentiate themselves in the network. For example, one might see artists representing a certain music genre to be at the centre of a cluster of users and artists representing another music genre at the center of another cluster.
The last set of undirected and directed graphs represents connections in which the social media link of a member has been clicked. Each member of the website can add link(s) to his or her profiles on other websites such as Twitter, Facebook, Instagram, Youtube, and Spotify. As can be seen in the graphs, this form of connection has not seen as much activity as the previous connections. Furthermore, one can see that some artists' links are only being clicked by the artists themselves (no edges) while others are getting viewed by other artists and users as well. As the Company grows, this form of connection can provide useful information on which artists and users are successfully generating interest in their other digital media platforms, an indication of influence beyond the Company's website.
The previous graphs were created using Python's NetworkX library. We also created social network visualizations in R using more advanced graphics(D3), an example of which is the following:
Lastly, we experimented with bimodal graphs which consist of individuals and songs. Because of issues with interpretability of the graphs, given the current data available, we decided not to include those graphs in the blog. As the volume of data increases and certain meaningful patterns between individuals and songs emerge, these graphs should provide more useful insight into the activity of users on the website.
In summary, although the data was very limited, one can already discern from the previous graphs certain clusters of connections between different users and artists as well as individuals who are better connected than others. However, social network graphs alone don't necessarily make it easy to identify influence, which brings us to our next set of analyses.
Social Network Centrality Measures
To gain a more precise measure of an individual's influence or centrality, in the jargon of social network analysis, we calculated various centrality metrics for each of the four connections illustrated previously. The following is a brief explanation of the four types of centrality metrics examined with a fictitious diagram(1) illustrating each.
Degree Centrality ("The Celebrities")
Degree centrality is determined by how many edges are connected to a node. Alice, in this case, has the highest degree centrality with five connections while Raphael has the next highest degree centrality with four connections. In terms of the Company's website, if an influential user, whether artist or music fan, broadcasts information, many users will immediately see it because of the many direct connections.
Betweenness Centrality ("The Bridges")
Betweenness centrality measures, intuitively, the importance of individuals in forming a connection between cliques of individuals. More precisely, it is the average proportion of shortest paths between any two nodes in a network that crosses a user. For example, in the diagram below, you have Alice’s clique on the left and on the right you have Aldo’s clique. Raphael has the highest betweenness centrality score because he lies between these two cliques. In terms of the Company's website, if an influential user, as measured by the betweenness centrality measure, broadcasts information, it will reach users otherwise not connected.
Closeness Centrality ("The Networkers")
Closeness centrality measures the average degrees of separation to any other individual in the network. In this example, Rafael is the most influential because he is closer, on average, to each member of the network. This could be useful for understanding which user has the ability to broadcast information to all members of a network the quickest.
Eigenvalue Centrality ("Close to Influencers")
Lastly, eigenvalue centrality measures how close members of a network are to other influential users. In the example below, Alice and Rafael have the highest eigenvalue scores because they are connected to other influential people, including each other and Frederica and Bob. Also, although Aldo has as many connections as Bob and Frederica, he has a lower eigenvalue score because he is further removed from Alice, who is the most well-connected member of the network. In terms of the Company's website, if a person with a high eigenvalue score broadcasts information in the network, it would reach other well-connected people in the shortest amount of time.
Using Python's NetworkX library, we computed centrality measures for every member of the network. Later in this blog, we will provide some examples of how we presented this information in the Flask app to provide the most useful insights for the client.
The second part of our project consisted of the development of a recommendation system. Before we built the recommendation system, however, we performed a series of exploratory data analyses to get a better idea of the activity of users on the website that will inform our recommendation system. The following are some examples of analyses we performed.
Exploratory Data Analysis of Various Music Features
To get a sense of how the music ratings were distributed, we created the violin plot below. Members of the website can rate songs from 1 (a "miss") through 5 (a "hit"). The graph shows that more songs were rated a 5 than those on the lower end of the ratings spectrum, indicating that the ratings on the website tended to skew positive.
The scatterplot below shows the rating of each user (x-axis) for each song (y-axis) on the Company's website. Each color explains the kind of rating that the user gave. One can see that 4 (purple) and 5 ( beige) are the most frequent ratings for many users.
On the Company's website, users can also save songs on their profile to listen to later. These saved songs are not visible to other users. The bar graph below shows the number of songs saved (y-axis) for each listener (x-axis). It doesn't appear that many users have saved music on the website yet.
The Nuts and Bolts of Building a Recommendation System
Recommendation systems have been a relatively recent innovation in data science that has allowed software to “learn” about the user so that the user can better enjoy a digital product. A recommendation system is a sub-class of information filtering systems which seeks to predict a user’s preference for an item. Spotify, Pandora, and other digital music services have found ways to improve their product by personalizing the user experience based on factors such as user history, the similarity between songs, and similarity between users.
For our client, we laid the foundation for improving user experience by building a recommendation system based on collaborative filtering, which makes predictions of a person's preferences based on the preferences of similar users or the similarity of other items. An important assumption of collaborative filtering is that users with similar preferences in the past are likely to have similar preferences in the future. The collaborative filtering algorithm that we built is based on three features: user ratings, user spotlights of music, and music saves. It involves the creation of a probabilistic model in which each feature plays a role in influencing future recommendations. The features we used can be further categorized into explicit feedback and implicit feedback, each category requiring a different model.
For our recommendation system, the explicit feedback we used were user ratings. A user's rating is considered an explicit feedback because it directly expresses the preference of a user for a particular song. It is generally viewed as reliable because it provides transparency in the recommendation process. To predict a rating that a user will give to a song, we first constructed a ratings matrix with users as rows, songs as columns, and music ratings as entries, illustrated in the diagram below (Please note that the ratings are not binary but integers from 1 through 5.)
To predict a user's rating for a song we used a matrix factorization technique called "singular value decomposition" ("SVD"). In our case, SVD decomposes the ratings matrix described previously into three components: 1) latent factors, 2) the users' relationship to the latent factors, and 3) the songs' relationship to the latent factors. The SVD algorithm then proceeds to minimize, through an iterative process such as gradient descent, the aggregate difference between the observed ratings and the predicted ratings for every user/song pair. The algorithm can also incorporate a learning rate and a penalty term. The final output of the algorithm is a predicted rating for each user/song pair from 1 through 5. The following diagram and equation provides a general summary of the process.
Implicit feedback are indirect indications of a user's preference for an item. In our case, implicit feedback was based upon two inputs: what music users spotlighted and what music users saved. Implicit feedback systems are useful because they avoid potential bias of users by looking more at the behavior of users rather than upon what the users claim to desire. For each implicit feedback that we considered, we constructed a sparse matrix of user id (rows) and songs (columns) with each matrix entry a 1 or 0, depending on whether the user did or did not save/spotlight the song, respectively. For our recommendation system, we used item-based filtering, meaning that we compared the similarity of two items or songs. The similarity statistic was computed using the Jacardi distance, a decimal measure that ranges from 0 (least similar) to 1 (most similar). The Jacardi distance is defined by the following formula, where A and B are two different songs:
For example, if we are comparing two songs based on the spotlight feature, then the similarity of songs A and B is defined as the number of instances where both A and B are spotlighted by the same user divided by the total number of times that A or B is spotlighted. So if a network has 10 listeners, and five of the listeners spotlighted both A and B, and the other five listeners spotlighted A or B, but not both, then the Jacardi distance is 0.50.
The Jacardi distances for both implicit feedback features are then averaged to arrive at an implicit score for each song. The diagram below provides a summary of the process, where the output is a sorted list of songs based on the Jacardi score for a certain user:
Hybrid Collaborative System: Putting It All Together
The scores from the explicit and implicit feedback systems were combined to form a single score for each user/song pair. The explicit feedback scores were weighted more heavily as the scores ranged from 1 through 5 whereas each implicit feedback score ranged from 0 through 1. This makes intuitive sense as one would like to place more emphasis on feedback that is directly related to a user's preference for an item. Ultimately the song recommendations will tend to improve overall user experience of the website by exposing users to music that they will likely enjoy. The following is a summary of the hybrid collaborative system with a specific user as an example.
The last part of this project involved the development of an application programming interface to demonstrate our work. This would complete our data pipeline which started from data generated by users in the website, which then flowed to the client's database server, then extracted into our Jupiter notebooks/R studio, and finally returning to the web. The following flow chart summarizes the overall process:
The Flask app allowed the client to query our analyses by specifying various parameters. So for example, the previous social network graphs could be displayed based on the specification of a connection type. In addition, social network graphs can be displayed for each user (ego graphs). The following graph shows the followers of user 27 based on 1 degree of separation (direct connection).
If one would like to know the centrality scores and rankings of a specific user across all types of connections, then one would need to specific the user id number and the following tables will be returned:
It should be noted that the ranking is based on the ranking among either artists or music fans. In the example above, user 27 is an artist whose centrality rankings are relative to other artists.
If one would like to know the top N artists or top N music fans for a certain connection type and centrality measure, then one would need to enter these parameters and tables of the following form will be returned:
The example above shows the top three music fans with the most number of connections based on the follower/followed connection type.
Finally, one can enter a user ID and get the top ten song recommendations for that user. The following is an example:
Overall, our data science consulting engagement was an immense success because our client found our work to be useful in laying the groundworks for understanding the website's users and improving their experience. The project was a success in another sense: the team self-taught and created social network analyses, a recommendation system, and a Flask app in about 3 weeks, with no prior experience in any of these areas. When the client attended our graduation ceremony to hand us his own certificates of appreciation, it capped a rewarding and productive three-month experience at the NYC Data Science Academy.