Data on Amper Music and Icelandic Indie Audio - Discrimination-Generation Feedback Loop
The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
N.b.: This post pertains to a project completed in December 2019. The write-up and presentation corresponding to this post are available in this Github repository.
Abstract
In this study, music descriptor classification models were trained and tested on music data generated via Amper Musicโs system. These models, as well as tempo/key induction and KMeans-based Laplacian form segmentation algorithms were applied to a compilation of Icelandic indie tracks that for the most part represented electronica and indie rock genres currently supported by Amper.
In turn, the resulting analyses for each input track were employed to generate single-region and multi-region (โstitchedโ) renders corresponding to that track. A given region was assigned one of the most frequently occurring descriptor classifications.
In the case of multi-region renders, the region-lengths corresponded to the durations of induced form segments. Subsequently, descriptor classification models were applied to single-render and multi-region renders. While the classification results of the former were evaluated for accuracy, those of the latter were compared with the input track classifications to determine consistency/stability.
Introduction
Amper Music
Amper Music is an AI-driven music software company specializing in the generation of flexible royalty-free stock music for content creators (e.g., filmmakers, game designers, podcast producers, etc.).
All of Amperโs instrumental samples are professionally performed, recorded, and produced. Music categories are defined in terms of descriptors. A descriptor consists of a genre, subgenre, mood, and sentiment (negative, neutral, positive), and is defined in terms of the musical, instrumental, and audio (digital signal processing, or DSP) constraints (represented as tags) to which it is mapped.
A team of asset creators continuously researches into trending musical styles and generates data conforming to the stylistic norms of the given descriptor. Currently, Amper supports seven unique genres, ranging from folk to hip-hop, each containing between one and four subgenres.
Users may engage with the Amper ecosystem through two means: Score, a web-based app, and the public API. Within the Score app, the user is asked to specify a duration for a given render (or may upload a video or audio file from which the duration may be extracted), select a desired genre, subgenre, and mood, and to choose from a list of โbandsโ (groups of instruments), which may be previewed prior to rendering.
In addition, the user may adjust the lengths of the introduction and ending, and location of the climax within the segment. In seconds, a render complying with user specifications is generated via the Composer, a generative (rule-based) algorithm engine, and Inferno, the audio rendering engine. Instrumentation, tempo, key, structure (intro/climax/outro placement) mood may be adjusted, enabling the user to iteratively generate and audition variations, while remaining within viable descriptor boundaries. Preferred renders may be downloaded and archived.
Machine Listening: Use-Cases and Motivations
Although Amperโs music generation system does employs symbolic, as opposed to machine learning techniques (for a variety of reasons), and is fueled by human-generated data--an expert system, so to speak--there have recently arisen use-cases that would best be addressed through applying machine learning--or more accurately, machine listening--models.
One objective is to provide the asset creation team with a classification toolkit that would facilitate descriptor research (e.g., confirming tempo, tonic, formal analysis, and other musical characteristics), A/B testing, QC, measuring relative descriptor โdistances,โ and so on. As the scale and complexity of the database increases (as well as the demand for new descriptors), such a time-saving toolkit will become of greater necessity.
Another pertains to end-user interaction. There have been a few requests amongst current users to enable render generation without needing to specify instrumentation, click through the genre-subgenre-mood tree, and to have a more convenient method of exploring the Amper descriptor space. As such, various efforts are underway to introduce more flexibility to the Score input model. A longstanding (and longer-term) goal has been to enable the user to upload a reference track, and for the system to return a render similar in nature with respect to style, structure, and instrumentation.
Project Methodology: A High-Level View - Pt. 1
Achieving the implementation of the reference track paradigm would require not only the integration of source separation, instrument recognition, style classification, tempo/key induction, and form segmentation, but also determining reasonable criteria for input/output similarity.
As shall become evident throughout this paper, as a) the ultimate concern is generation (as opposed to discrimination), b) the Amper system is trained to produce style/use-case specific music adhering to a unique (and evolving) set of quality standards, and c) there is in general a broad range of โacceptableโ musical output, given an input (as compared to the domains of machine translation, recommender systems, or even image generation, for instance), and deriving such criteria is not a straightforward process.
Project Methodology: A High-Level View - Pt. 2
As is indicated in the diagram below, the project workflow consisted of the following stages:
1) Training data was generated via the Amper render server, based upon a dictionary of active descriptors.
2) Data acoustic feature extraction was performed.
3) Descriptor classification models were trained on the data.
4) The trained models were applied to the external dataset (in this case, an Icelandic indie compilation), after removal of vocal tracks (Amper renders contain neither melodic/vocal lines nor lyrics).
5) Tempo and key induction models, as well as an (unsupervised) KMeans-based form segmentation model were run on the Icelandic indie tracks.
6) The output from these models was merged and binned.
7) The merged model predictions were used to generate:
- a) single region renders (i.e., renders of the same duration as reference tracks, each assigned one of the most frequently occurring descriptors, tempi, and keys output by the classification models);
- b) multi-region renders (i.e., renders of the same duration as reference tracks, each divided into multiple regions, whose lengths correspond to those of the input tracksโs induced formal segments. Each region is assigned the most frequently occurring descriptor, tempo, and keys output by the classification models for the corresponding formal segment in the input track).
8) In turn, classifiers were applied to both the single region and multi-region renders (a โdog-foodingโ procedure).
9) Classification results for single-region renders were evaluated for accuracy (as a sanity check/validation of model performance).
10) Classification results for multi-region renders were compared to those of the original input tracks.
11) Insights obtained from steps (10) and (11) were used to inform subsequent data collection, model selection, and hyperparameter tweaking.
EDA: Amper Training Data
In the interest of creating a dataset that was balanced amongst classes, and heterogeneous within each class, 33 renders of each active descriptor were generated at one of three unique durations: 10, 30, and 60 seconds, respectively, thus yielding 99 examples in total for each of the 204 active descriptors. And, in the interest of model robustness, a random tempo and instrumentation was selected for each render. In total, 20,196 examples were generated in the most recent batch.
Each descriptor is identified by an internal name key, and assigned values for genre, subgenre, mood, sentiment (negative, positive, or neutral), and a unique ID. E,g.: "cinematic_percussion_primal_pulsing": ["powerful", "neutral", 46, "cinematic", "percussion"]
The graph below indicates the distribution of moods per genre/subgenre combination. It is evident that moods are far from a uniform distribution, with cinematic_ambient associated with the greatest number of moods:
With respect to sentiment, there is a clear imbalance within the dataset: of the 204 unique genre/subgenre/mood combinations, 104 are โpositive,โ 51 are โneutral,โ and 49 are โnegative.โ The ramifications of this skewed distribution will be addressed below.
Feature Extraction - Pt. 1
After experimenting with different quantities and combinations of acoustic features, the set was limited to 21 salient parameters, comprising three Mel frequency cepstral coefficients (MFCCs), three principal component analysis (PCA) coefficients, and 15 standard spectral and temporal features. Feature extraction was executed via the LibROSA Python audio/music analysis package. Each audio file in the dataset was analyzed in 10 second windows with a two second stride. As such, the most recently-generated post-extraction input CSV consists of ca. 240,000 rows.
Given the highly subjective nature of mood selection/assignment, and the varying degrees of proximity of moods to each other, valence and momentum, hand-engineered ordinal categorical features associated with each mood were then assigned to each respective descriptor. Initially, the valence scale consisted of 4 levels, while the momentum scale consisted of 5 (each level corresponding to relative activity level and event-density).
The bar graphs below illustrate relationships amongst sentiment, valence and momentum within the dataset:
Feature Extraction - Pt. 2
Given valence distribution imbalances, it was determined that valence levels 0 and 1 should be merged, thereby collapsing the valence scale into 3 steps. Here are the same relationships post-rescaling:
Although valence, like sentiment, has three levels, valence and sentiment demonstrate mutual independence (albeit moderately strong positive correlation). Likewise, momentum is independent from both sentiment and valence.
As tempo is amongst the extracted features, it is possible to examine the relationship between tempo and genre, subgenre, sentiment, valence, and momentum. Being a baseline, the minimum (induced) tempo in the dataset is ca. 49 BPM, the maximum is 234, the mean is 120, and the median is ca. 118.
Feature Extraction - Pt. 3
In the box plot below, tempo ranges for each genre/subgenre combination at each respective sentiment are depicted:
Most genre/subgenre combinations have median tempi at ca. 120 BPM. Tempo and sentiment are not clearly correlated. Hiphop_orchestral (positive sentiment) has the highest median tempo, and rock_indie (negative sentiment). There is no clear strong correlation between tempo and sentiment.
Here are the ranges for each valence and momentum level, again segmented into sentiment categories:
The highest valence and momentum levels have slightly higher median tempi than the other levels, but the correlation with sentiment is likewise unclear. There is no obvious linear increase in median tempo/tempo minimum and maximum across the remaining valence and momentum levels.
In all three graphs, it is evident that there are generally more outliers at slower than faster tempi. (It is quite possible that the extreme upper range tempi represent miscalculations on the part of the LibROSA tempo-tracking function, as the maximum allowable tempo is 200 BPM). These observations indicate that sentiment, valence, and momentum are potentially meaningful features, in that distinctions amongst levels cannot simply be explained by tempo differences. Conversely, tempo proves to be a valid feature for the same reason.
Modeling
Descriptor, Sentiment, Valence, and Momentum Classification
As may be inferred from the above graphs, the data is hierarchical in nature. That is to say: each genre is connected to (the โparentโ of) a subset of subgenres, and each subgenre to a subset of moods. By the same token, conditioning valence and momentum classification on sentiment would in principle improve model performance. As such, it was decided that: 1) binary (one-vs.-rest, or OVR) classifiers be applied to the top-level targets of genre and sentiment, and 2) lower-level target classifiers (โchildrenโ) be conditioned on their respective โparentsโ (i.e., subgenre given genre, mood given subgenre, valence given sentiment, momentum given sentiment).
KNN vs. MLP
After testing several models, it was determined that K-Nearest Neighbors (KNN) and Multi-Layer-Perceptron (MLP) models demonstrated superior performance. For the MLP, a maximum iteration of 800 was necessary for the model to converge. Otherwise, all default scikit-learn hyperparameter settings (number of layers, optimizer, learning rate, etc.) were maintained.
For both models, a train/test split of 70/30 was selected, and stratification was applied to the test set to ensure that all classes would be represented. The full classification reports are available upon request, but here are the average accuracy scores:
Target | KNN | MLP |
---|---|---|
Binary Genre | 0.999 | 0.990 |
Binary Subgenre | 0.992 | 0.924 |
Subgenre | 0.530 | 0.535 |
Mood | 0.983 | 0.972 |
Valence | 0.995 | 0.973 |
Momentum | 0.995 | 0.943 |
From the summary statistics, there are two obvious results: 1) the KNN modelโs performance is superior to that of the MLP; and 2) accuracy scores are quite high for all targets except subgenre (the reasons for which remain unknown).
External Test Tracks: Icelandic Indie Compilation
To date, numerous non-Amper tracks have been tested. In the scope of this project, I have focused on the compilation This is Icelandic Indie Music, Vol. 1 (Record Records, 2013), as well as the track โCrystalsโ off of Of Monsters and Menโs album Beneath the Skin (Republic Records, 2015). These selections were made for two reasons: 1) the tracks included mostly represent genres supported by Amper (with two notable exceptions: one soul and one reggae track); and 2) out of convenience (i.e., they were in my iTunes library). It should be noted that these tracks were not labeled with relevant metadata.
As was mentioned above, Amper renders do not contain melodic lines or vocals. In order to avoid introducing the human voice as a confounding factor, I stripped the tracks of vocals using the Deezer Spleeter source separation library. No further modifications were made to these tracks.
Tempo, Tonic, Form, and Model Integration
Tempo and Tonic
For tempo and tonic (root note) induction, LibROSA utilities were employed. In the case of tempo, onsets were detected and static tempi inferred for each 10-second window. A chroma CQT (constant Q-transform) algorithm was applied, such that the most prominent chromatic pitch for each 10-second window would be returned. In both cases, a stride of 2 seconds was selected (for the sake of consistency with the descriptor classification output).
Below is an example of a CQT plot for an input track with a quite stable tonic:
Form Segmentation
A Laplacian segmentation algorithm, which detects formal boundaries based upon analysis of the audio signal, was applied to perform a task that is otherwise subjective and ambiguous. (What cues signal the beginning of a new section? Change of texture, instrumentation, key, mood, onset densityโฆ?) For this procedure, in which onsets are clustered into segments via a KMeans clustering model, it is necessary to supply a value of k.
But how does one select such a value? And when applying the model to multiple tracks, is it not possible that each will consist of a different number of discrete formal segments?
In order to avoid arbitrary decision-making, I applied silhouette analysis--essentially a โgrid searchโ procedure for KMeans--to select k values for each respective input soundfile. The range of possible k values was constrained by soundfile duration lower and upper limits. (For instance: one would not expect a 10-second file to consist of 5 sections, and for a 4-minute file to consist of one or two sections would constitute a trivial application of form segmentation).
Below is an example of a form segmentation diagram, in which an introduction, main material, interlude (in which the introduction material or texture returns) and ending are inferred:
Here is a somewhat more complex example, with four form-segment types:
Merging and Reduction
Once all models had been applied to the input tracks, the model output labels for each 2-second increment were merged, as is depicted below:
In the interest of 1) observing dominant classification trends for each inferred formal segment of each track and 2) reshaping the data in such a way that it would be conducive to creating a timeline from which renders could be generated (details to follow), I created a CSV consisting of the most frequently occurring targets (genre, subgenre, mood, tempo, and key) for each form segment, and calculated each segmentโs duration. Furthermore, through consulting the previously-mentioned active descriptor dictionary, I created a column for descriptor names corresponding to the identified genres, subgenres, and moods (a necessary step for timeline generation):
Classifier Model Selection
Although the KNN model outperformed the MLP on Amper test data, the MLP tended to yield more accurate genre and subgenre classifications for the Icelandic indie data. This could be discerned once the above redux was generated for both classification models.
In general, the style predictions for the MLP were more on-target. More specifically, when any of the classification models tested were unable to make an accurate prediction, and/or became too sensitive to a dominant drum track, they would default to โcinematic_percussion,โ a genre/subgenre combination lacking in any instruments generating precise pitches.
For this dataset, the KNN was outputting โcinematic_percussionโ often enough for it to be a top-ranked candidate for a number of formal segments. This was never the case for the MLP. With the exception of the occasional brief drum fill, the tracks themselves never exhibited any purely percussive passages.
Render Generation - Pt. 1
As was explained in the Introduction, two sets of renders were generated for each input track, using values contained in the โmajority_rules_mlp_icelandic_indieโ CSV above: 1) single-region renders, each representing one of the top-ranked descriptors, tempi, and tonics for the given track; 2) multi-region (or โstitchedโ) renders, the number of regions being determined by the number of induced formal segments. In both cases, instrumentation was randomized, for the purpose of producing heterogeneous output.
As no single render is an ideal representation of a given descriptor, each descriptor in the single-region case was rendered at each top-ranked tempo in each top-ranked key, while in the multi-region case, ten renders were generated for each input track (each with a different instrumentation).
Render Generation - Pt. 2
In order to create a render, it is necessary to enter data in a timeline JSON format, as depicted below:
"timeline": {
"spans": [
{
"actions": [
{
"time": 0,
"add_region": {
"key": {
"tonic": "C"
},
"id": 222,
"cut_at": 48,
"descriptor": "documentary_idiophonic_idle"
}
}],
"time": 0,
"tempo": 136.0,
"id": 111,
"type": "metered",
"instrument_groups": [{"instrument_group": "grand_piano"}]
},
{
"time": 344,
"type": "unmetered"
}
]
}
In the figure below, the original input track โCrystalsโ has been superimposed on a corresponding multi-region render. As one may observe, the region boundaries line up exactly (i.e., with significant shifts in amplitude reflecting changes in texture/activity levels):
Click here to listen to the original track (without vocals), and here for the corresponding multi-region render.
Single-Region Render Classification and Multi-Region Render Comparison - Pt. 1
In the following table are the accuracy scores for the single-region renders:
Classifier | Accuracy Score |
---|---|
Genre | 0.897 |
Subgenre | 0.858 |
Mood | 0.715 |
Sentiment | 0.821 |
Valence | 0.705 |
Momentum | 0.752 |
For the multi-region renders, the most frequently occurring value for each target variable for each formal segment amongst each set of ten examples was compared to the corresponding value for the respective input track. This table represents the mean accuracy scores for multi-region renders across titles.
Classifier | Mean Correlation Score |
---|---|
Genre | 0.900 |
Subgenre | 0.855 |
Mood | 0.755 |
Sentiment | 0.635 |
Valence | 0.686 |
Momentum | 0.413 |
Single-Region Render Classification and Multi-Region Render Comparison - Pt. 2
Genre accuracy scores by title: [('agent_fresco', 1.0), ('crystals', 1.0), ('ensimi', 0.75), ('fm_belfast', 0.5), ('kirayama_family', 1.0), ('lionheart', 1.0), ('lockerbie', 1.0), ('mummut', 1.0), ('ojba_rasta', 0.75), ('retro_stefson', 1.0)]
Subgenre accuracy scores by title: [('agent_fresco', 1.0), ('crystals', 1.0), ('ensimi', 0.75), ('fm_belfast', 0.5), ('kirayama_family', 1.0), ('lionheart', 1.0), ('lockerbie', 1.0), ('mummut', 0.8), ('ojba_rasta', 0.75), ('retro_stefson', 0.75)]
Mood accuracy scores by title: [('agent_fresco', 1.0), ('crystals', 1.0), ('ensimi', 0.5), ('fm_belfast', 0.5), ('kirayama_family', 1.0), ('lionheart', 1.0), ('lockerbie', 1.0), ('mummut', 0.8), ('ojba_rasta', 0.25), ('retro_stefson', 0.5)]
Sentiment accuracy scores by title: [('agent_fresco', 0.5), ('crystals', 0.5), ('ensimi', 1.0), ('fm_belfast', 0.0), ('kirayama_family', 0.0), ('lionheart', 1.0), ('lockerbie', 1.0), ('mummut', 0.6), ('ojba_rasta', 1.0), ('retro_stefson', 0.75)]
Valence accuracy scores by title: [('agent_fresco', 0.0), ('crystals', 1.0), ('ensimi', 0.75), ('fm_belfast', 0.5), ('kirayama_family', 1.0), ('lionheart', 0.5555555555555556), ('lockerbie', 1.0), ('mummut', 0.8), ('ojba_rasta', 0.25), ('retro_stefson', 0.5)]
Momentum accuracy scores by title: [('agent_fresco', 0.0), ('crystals', 0.5), ('ensimi', 0.5), ('fm_belfast', 0.0), ('kirayama_family', 0.0), ('lionheart', 0.7777777777777778), ('lockerbie', 0.5), ('mummut', 0.6), ('ojba_rasta', 0.75), ('retro_stefson', 0.5)]
Single-Region Render Classification and Multi-Region Render Comparison - Pt. 3
For both the single-region and multi-region renders, scores for genre and subgenre are reasonably high and comparable. Mood scores are relatively low in both cases. While sentiment classification remains in a reasonable range in the former, it is surprisingly low in the latter. Most alarming is the low momentum score for multi-region renders.
As may be observed from the multi-region score distributions, there is a high degree of variance amongst titles across all target variables except genre and subgenre. This variance, as well as the relatively low mean scores, can be explained by several factors, but one crucial factor is the sentiment class imbalance mentioned in the EDA section. Sentiment misclassifications would in turn influence valence and momentum classification accuracy, as the latter are conditioned on the former. Another consideration is the multi-region render sample size. It would be worth repeating the comparison procedure with a greater number of multi-region renders per input track.
Conclusion and Future Work on Amper Music and Icelandic Indie Audio Data
This project constituted the initial phase in an endeavor to generate Amper renders based upon feature extraction and classification of music data โfrom the wild.โ Through fusing descriptor classification, tempo and tonic induction, and form segmentation models, it was possible to generate single-region and multi-region renders that embodied structural and stylistic characteristics of the input tracks. When presented with the renders, the classification models produced generally accurate predictions, with a few notable exceptions. The reasons for prediction anomalies are currently under investigation.
In the future, it would first and foremost be of interest to train the classifiers on more Amper data. (The resource limitations of render time and competing server requests impose perennial constraints on dataset size.) Along similar lines, it is intended to expand from the sandbox of the Icelandic indie compilation to the Free Music Archive (FMA) dataset. This would be advantageous both given the dataset size, and the accompanying metadata. As distances amongst genres, subgenres, and moods are not equal, establishing a descriptor space distance metric would be instructive in establishing a classifier weighting scheme.
These weights could be learned and would inform optimization. In addition, such a scheme would allow for an โUNKโ class for genre, subgenre, and mood target variables. Eventually, descriptor classification and instrument recognition models will be integrated, and will likely fall into a symbiotic relationship--especially given the crucial role that instrumentation plays in differentiating descriptors from each other. Prior to these developments, however, a web-based classification/generation prototype app will be launched and tested within the company, as a means of validation and gathering feedback on the current state of the project.
Acknowledgements
The author wishes to acknowledge Cole Ingraham and Nate Moon for their efforts, insights, and support, and Adam Gardner for his pioneering contributions.