Clustering Artworks by Ai-Quantified Visual Qualities & Content Recommendation App
By Gabriel del Valle
10/08/24
www.linkedin.com/in/gabrielxdelvalle
Introduction
One of the most transformative ways that Data Science has brought value to major web platforms like Amazon and Spotify is modeling user preferences to make recommendations which the user finds relevant. For content driven platforms like TikTok or Meta, the quality of their recommendation algorithm is a core business strategy for attracting and retaining users, as well as keeping them engaged minute to minute.
I wanted to practice modeling user preferences for content recommendation, so for my R focused project I designed an interactive shiny app where users can like or dislike artworks they are presented with.
Try the app yourself! https://gabrielxdelvalle.shinyapps.io/algo_gallery/
Problem Framing and Web scraping
Web scraping public domain modern artworks provided me with a large dataset of very diverse content which users could feel a preference for at a glance. In the case of a large web platform with many users, comparing user profile’s likes and dislikes would be sufficient data to build a nuanced model of user preferences. Large platforms can group profiles who have liked and disliked the same content, and when one profile in the group likes a post the others haven’t seen, that post would make a good recommendation for the others.
However, I didn’t have a large amount of user engagement data to use for this project. That’s why I planned this project around generating data based on the visual qualities of the artworks themselves, to distinguish a user’s preferences based on the subjective content of the artworks which they liked and disliked. OpenAI’s CLIP image classification tool made this possible, allowing me to define a set of categories to rate each artwork on. In the Category Selection section of this blogpost, I’ll detail how I chose my set of variables from the endless possible combinations of categories I could have provided and measured their performance at describing the subjective qualities of artworks.
Purpose
Using image classification to quantify subjective qualities of artworks is not a typical approach for user preference models. It would be computationally expensive to scale to a large content platform and likely would capture less subjective nuance than simply comparing user profiles.
Despite its limitations, in the context of this project, the image score approach allowed me to apply content recommendation strategies that would be applicable to user profile data comparisons, such as calibrating user preference in as few clicks as possible by first presenting the user with content that captures the diversity of the wide variety of the content. Also, assessing the accuracy and bias of predictions. For content platforms like TikTok or Meta, which compete for user’s attention, delivering relevant content to new users as quickly as possible is a critical strategy.
Value
In a content recommendation context, while not ideal as the primary method, visual content analysis could complement user-profile comparisons to capture the aesthetic nuance of what users prefer.
As well, analysis of users' visual preferences may have applications in e-commerce and marketing.
Example1: A targeted ad is likely to be more effective at attracting a user’s attention if it aligns with their sense of aesthetics.
Example2: A business which sells sneakers makes suggestions on their website mostly by functional categories, such as shoes for running, skateboarding, and basketball. But users may have aesthetic preferences which supersede functional categories, such as they like the shape which certain kinds of basketball and skateboard shoes have in common. Image classification scores can easily be applied to a few thousand products and stored in a product dataset, allowing the sneaker business in this example to compare the aesthetics of the individual products that individual shoppers took interest in, informing both product recommendations and internal data on what styles are popular.
App Logic & Strategies
In order to calibrate user preferences to the diverse range of content in as few clicks as possible, I calculated a set of 20 artworks which captured the widest spread of variance in the dataset to present to the user as the first set of images to like or dislike in order to efficiently calibrate predictions. To do this I applied Kmeans clustering, an unsupervised machine learning technique which groups observations based on their distance in n-dimensional space, allowing me to capture the diversity of all the artworks from the centers of the clusters, as well as from the minimum and maximum ends of the PCA dimensions.
The prediction algorithm uses a cosine similarity matrix to compare each new artwork presented to the user to all the images they had previously interacted with, and predicts a like or dislike based on whether the new image is most similar to a previously liked or disliked image. Cosine Similarity was very effective at identifying similar images using their image scores, and could make these comparisons very quickly by taking advantage of R’s vector and matrix operations. A prediction is made in real time, almost instantly, for each new random artwork queried after a like or dislike. This aspect of the project showcased the advantage of using R where vector and matrix operations are relevant for quick and lightweight calculations.
The next strategy I employed which is relevant to content recommendation is checking each prediction against the users true preference for accuracy, and also identifying bias in the algorithm for either likes or dislikes that would invalidate the algorithm (If the algorithm predicts that the user will like all images, and the user has a tendency to like most artworks, the accuracy of predictions would be artificially inflated, since the model is unable to predict dislikes.)
To track prediction accuracy the app plots a line graph where the x axis is the number of interactions and the y axis is correct predictions / total interactions. To identify bias the app also plots the interaction type (like or dislike) of the incorrect predictions. Rather than try to actively recommend content to the user in this app, I decided presenting users with random artworks after the initial calibration would be more statistically valid for assessing the accuracy.
The algorithm’s accuracy consistently hovers at 60% after 10 to 100 predictions, and tends to get better over the number of interactions, often with an equal ratio of likes to dislikes in the incorrect predictions, indicating no bias. This result makes the predictions only slightly better than random and thus in need of improvement.
Potential Improvements: Nuances of Single-label vs Multi-label classification
The accuracy of the algorithm could be improved if the image scores could be made to capture more of the subjective nuance of each artwork. This could be done by using a multi-label classification approach rather than the single-label classification approach I used for this project.
When a single-label approach is used for multiple categories with CLIP, the image scores for each category are dependent and sum to 1. This limited the number of categories I could viably rate the images on. With the exception of the wide stylistic categories of Impressionist, Many_Colors, and Highly_Detailed, I selected subject categories rather than stylistic categories which would be widely applicable. On the other hand, using a multi-label classification approach would allow me to use a much larger list of categories, as their scores would be calculated independently. This could allow for the description of more subtlety.
Using single-label classification in this project had potential benefits for making kmeans cluster models more stable by reducing the variance in the range of scores and number of dimensions. This approach also allowed me to practice comparing the relative performance of different sets of categories and be precise in identifying categories which were effective. However, the single-label approach limited the effectiveness of kmeans clustering, by limiting the extent that each category could be expressed to signify each artwork’s unique qualities. Even without adding more categories, the multi-label classification approach would allow each category to represent a wider range of distinct values.
Because each category score sums to 1 in single-label classification, each visual variable (interpreted as a vector) is effectively normalized, making approaches which use Euclidean distance like kmeans clustering less effective since the magnitudes of each vector have been scaled down within a limited range.
By contrast, cosine similarity worked well with this data because it measures the angle between vectors, making it unaffected by changes in vector magnitude, unlike methods that rely on distance.
So for a future version of this project, not only could multi-label classification increase the number of variables each artwork is assessed by, but also the effectiveness of identifying similar artworks using Euclidean distance and kmeans clusters. An improved cluster model would not only provide improved calibration images, but also could be used to nuance predictions by comparing both the angle of vectors (with cosine similarity) and the magnitude of each vector.
Category Selection
Image classification with OpenAI’s CLIP was conducted in Python. Besides web scraping and use of CLIP, all other analysis was performed in R.
Here are some examples which showcase the scores assigned to different images. I’ve highlighted the top 3 scores in each of these for easier comparison.
This image is a rare example where scores are close to even across the board:
In this last example we would hope to see Human_Subject scored higher than Animal_Subject, but more abstract styles challenge CLIP. Often it will designate an artwork with a high Impressionist or landscape score if the subject is difficult to identify or obscured by the artwork’s texture or style. This observation is verified by the PCA graph which shows Impressionist and Landscape vectors dominating the left quadrants, while identifiable subjects like humans, architecture, animals, and ornament are all in the right quadrants.
This process of sampling individual images to compare to their scores and considering how each added category changed the overall quality of the scores was how I first began identifying a set of variables that would be effective with CLIP. Through this initial process I identified the following 8 variables as consistently well performing in describing a wide range of art works:
- Impressionist
- Landscape
- Abstract
- Human_Subject
- Animal_Subject
- Architecture_Subject
- Ornamental_Pattern
- Highly_Detailed
I also tried using Plant_Subject, Organic_Subject, and Flower_Subject, each with and without the Landscape category (and in various other combinations), but found these categories could not properly identify their subjects. Rather, Landscape picks up on the presence of plant life, so that a painting with just a flower will have a high Landscape score.
After identifying the 8 strong categories, I knew I wanted to add a dimension for color, but was debating between three categories that could apply: Many_Colors, Color_Contrast, and Minimalist. The differences of the impact of these three variables would be hard to differentiate by comparison to sampled artworks alone. I recognized it would be valuable to find a method to compare sets of variables systemically and identify better performing sets with certainty.
Since my intention was to use the image scores to cluster artworks, my (mislead!) idea was to evaluate sets of categories based on their cluster model metrics. I created a CLIP dataset combining each of the new test categories (Many_Colors, Color_Contrast, and Minimalist) with the list of strong categories, and created a benchmark to measure against, composed of only the strong categories. I then fit each to multiple cluster models (to take an average and account for random variance between models) and evaluated the following metrics of each to see if it could give me insight into which set of variables performed the best:
This graph of the clusters produced by my Shiny app is a 2D representation of a 9D space, which means that you cannot tell just by looking at the graph if it is well fit. Data points which appear close in 2D may be much farther in n_dimensional space.
Silhouette scores provide a measure of how well fit the model is by scoring each observation on how well it fits its cluster, with a range of -1 to 1. A score of -1 indicates that a data point is in the wrong cluster, a score of 0 indicates that a data point is on the boundary of 2 clusters, and a score of 1 indicates that an observation is perfectly fit to its cluster.
Another means to assess a cluster model’s fit are the measures of its variance, which can be tuned through the selection of number of clusters for the model:
WSS: Within Sum of Squares
- variance within clusters
- lower is better
- decreases with more clusters
BSS: Between Sum of Squares
- variance between clusters
- higher is better
- increases with more clusters
TSS: Total Sum of Squares
- total variance of the dataset
- constant for dataset
- determines feasibility of cluster modeling dataset
BSS / TSS Ratio
- describes how distinct each cluster is from each other
- higher is better
- increases with more clusters
TSS = WSS + BSS
Despite that measures of variance improve as the number of clusters increases, as clusters increase so do chances of overfitting the model and decreasing accuracy. To avoid overfitting, the number of clusters should only be increased if it results in a significant reduction in WSS and not past that point.
Each time the kmeans cluster model is fit (for example each time the app is run) there is random variance in how the model is fit, and thus also variance in the silhouette scores. The score pictured here of 0.42 is typical of this dataset (indicating a subpar, but passable fit). To account for random variance, when measuring metrics for category selection I fit each dataset to a cluster model 6 times and took the average WSS, Silhouette Score average, and BSS/TSS ratio of the sample. At the time 6 clusters per dataset was as many as my computer could handle, but the number of samples could be increased by saving the metrics to a dataset and clearing the load on the system between the generation of each model rather than doing my analytics on the models directly.
.
When comparing the different image score datasets, based on cluster metrics, 4 clusters with color_contrast seems to be the best choice. However, I concluded from this experiment that cluster metrics are not a valid method for choosing the best set of categories to describe the subjective qualities of artworks using CLIP single label classification. For the app I chose the Many_Colors dataset and as a result selected 5 clusters.
Cluster models and their metrics mostly signify variance among a combination of variables within a dataset. My idea was based on the assumption that a more effective set of variables for describing artworks would have more coherent scores, therefore less noise, and therefore more consistent variance and better clustering.
However, even if the prior assumption were true, variance among a set of variables is not the same measure as the ability of those variables to describe reality (thus the fault in this approach). In fact, the opposite to that assumption could be true! The better a set of variables is at describing the visual qualities of artworks, the more variance you could expect to find among artworks. For example, if applying multi-label classification to provide a full range of scores for each category (making variables independent and not limiting their values to sum to 1) were proven to be much more effective at distinguishing each artwork, it would be a case of variance increasing.
On the other hand, a more effective means of determining the effectiveness of selected categories was PCA graphs, which visualize in 2D how much each variable describes the variance of the dataset. Each vector has a different angle around the origin relative to the dimensional space it defines (the x and y axis representing dim.1 and dim.2, onto which the true 9D vectors are projected onto for summary and visualization) and the magnitude of each vector is its impact on the variance of that dimension (cos2 value).
Beyond this technical explanation, PCA graphs can illuminate the associative relationship between variables. In the context of this project, a strong set of descriptive categories reveals a cohesive spectrum of the kind of artworks being identified. For example, in the following PCA graph of app’s dataset (many_colors with 5 clusters), the following relationships can be observed:
Dim1 (the y axis) is a spectrum from abstract shapes at the top towards living subjects that have more identifiable details on the bottom
- Abstract, which as an art category is defined as artworks about no physical or specific subject, is on the opposite end of the spectrum from Animal_Subject and Human_Subject, which are clearly defined subjects. Abstract paintings also often explore color, aligning Abstract closely with Many_Colors
- Architecture is often an exploration of geometric shapes similar to abstract paintings and has less clear rules for distinguishing it from other subjects
- Highly detailed shares the lower right quadrant with Animal_Subject and Human_Subject which have more clearly defined details to distinguish than other subjects
- Ornament is by definition highly detailed, but it is also highly associated with architecture, thus Ornamental_Pattern is found between Architecture_Subject and Highly_Detailed
Dim2 (the x axis) is a spectrum from less identifiable subjects on the left to more identifiable subjects on the right
- Impressionist paintings bend the rules of depicting reality and obscure their subjects by playing with color, texture, and shape
While landscapes are a readily identifiable subject and their own distinct genre in art, it may be interpreted by CLIP to mean any environment with an outdoor setting, any plant subject, and any texture or composition which is reminiscent of a landscape
- These are the only two variables which describe the left hemisphere of the PCA graph, and what they have in common is that they can be broadly applied to textures which are difficult to distinguish a clear subject from.
- By comparison all categories on the right are associated with more clearly identified subjects, including Abstract paintings which have clearly defined shapes rather than textures which are difficult to distinguish objects within
- While landscape paintings of the 17th to 19th centuries are highly detailed and realistic, all of the landscape paintings in this modern art dataset would be considered more impressionistic than realistic
Here we see a cohesive relationship between the orientation of variables in this PCA graph, indicating that the selected categories describe the set of paintings relatively well.
Let’s compare to the PCA of the color_contrast model with 4 clusters:
- Overall this graph is less balanced, with Ornamental_Pattern having an outsized cos2 value at the expense of other variables which are made less impactful
- The bottom left quadrant is entirely empty, suggesting that a significant range of the variance is unaccounted for by the variables
- Human_Subject and Architecture_Subject are collinear, with almost the same magnitude, meaning they identify almost exactly the same visual qualities despite being two distinct subjects
Perhaps PCA metrics could be better used to compare the effectiveness of sets of categories for image classification. Though an algorithm could not narrativize the graph as I have done here to portray the subjective relationship between variables and dimensions, one could calculate the angular distribution of PCA vectors to select for the most evenly distributed set of variables.
For each variable’s vector to have an even angular distribution, each variable would need to effectively describe a distinct quality in CLIP. This distinction is especially important in the single-label classification approach where variables are dependent, but distinction would likely be less important in the case of multi-label classification. In multi-label classification, a large list of more specific variables could describe more nuanced qualities. With multi-label classification, one could use PCA variables to prune variables with small cos2 values and ensure no variables share the same angle.
Conclusion
This project has exposed me to several data science techniques and strategies relevant to web platforms and businesses. I plan to advance the work I’ve started here to see if I can increase the accuracy of my predictions with the following strategies:
- Use multi-label classification to accommodate a longer list of more nuanced categories, each with fully scaled ranges
- Leverage PCA rather than cluster metrics to optimize category selection
- Nuance the prediction algorithm by taking advantage of the full magnitude of each observation’s vector for distance comparisons, as well as any other patterns which may arise from the use of multi-label classification
Thank you for reading!
If you are interested in this work feel free to connect or reach out on LinkedIn: www.linkedin.com/in/gabrielxdelvalle