Music recognition was one of the first applications of artificial intelligence. Music recognition software therefore belongs to the first or second wave of AI technology in the 2010s, which also saw the development of well-known chatbots such as Siri or Alexa. It therefore corresponds to ‘weak AI’ or Artificial Narrow Intelligence (ANI), which has been used to support everyday processes such as listening to music. In this part of our series on AI in the music industry, we will look at how music recognition and music recommendations work in music streaming services.

AI in the Music Industry – Part 4: AI in Music Recognition and Recommendation

Shazam’s music recognition, as impressive as it may seem, was not even originally based on artificial intelligence, as Jovan Jovanovich explains in a very detailed article on the expert platform Toptal.[1] Back in 2003, Shazam co-founder Avery Wang published the source code for the fingerprinting algorithm. Put simply, the analogue sound wave is converted into a digital signal. The frequency and amplitude of the sound wave can now be measured and clearly defined at any point in time. A discrete Fourier transformation is used to convert the time domain into the frequency domain, which can be easily mapped mathematically. Instead of the time dimension, only the frequencies of a song and their order of magnitude are visible. This is what makes a song musically distinctive and recognisable. The resulting frequency pattern can then be used to create a fingerprint of each song, which can then be compared to a song database to uniquely identify the song. Of course, the whole process is much more complicated than described here, but the basics can be understood.

AI comes into play when pieces of music not only need to be recognised, but also categorised. In a 2009 article, Tao Li et al. described various applications of machine learning for the collection and processing of music data.[2] A first area of application is the categorisation of individual pieces of music into genres. The first step is to filter out the characteristics of a piece of music, which does not yet require the use of AI, as we have already seen in the case of Shazam. However, the next step, which is to assign the song to a genre, requires artificial intelligence. This involves supervised learning processes that are carried out using labelled data. The data is labelled in such a way that music genres can be distinguished from each other. Statistical methods such as Gaussian Mixture Models (GMM), Nearest Neighbour Classification, Linear Discriminant Analysis (LDA) or Support Vector Machines (SVP) are used.[3] When a new song is added to the database, the AI system learns to recognise similarities and then assigns the song to a genre. The more songs the AI has already analysed, the easier the genre assignment will be.

A second application of AI music recognition is to identify emotions in music and categorise them into emotion clusters. In principle, this classification is very similar to genre categorisation. A database needs to be created in which music tracks are labelled according to emotional categories such as ‘dreamy’, ‘happy’ or ‘sad’. The same statistical methods are used to create the clusters to which new tracks can be added.[4]

Identifying an artist’s musical style is much more difficult. Subjective judgements, which are not easily reproducible, play a large role. To make matters worse, not only the compositional style but also the lyrics have an influence on the artistic style. To account for this, the Chen & Chen propose a binomial cluster algorithm that uses numerous parameters to determine style for both variables (composition and lyrics). In particular, multivariate statistical methods are used, which cannot be explained here. In any case, the aim is to be able to assign a new song to one of the style clusters created by the AI system.[5]

From a purely technical point of view, the step from music recognition to music recommendation is not very big. However, it requires usage data to make appropriate recommendations to music consumers. Research into music recommendation dates back to the early 2000s[6] and has exploded since then, especially after music streaming became the most important music distribution channel.

A music recommendation system requires three components: (1) the user, (2) the track and (3) the algorithm to find the perfect track for the user. To achieve this, as much user data as possible is needed. This can be obtained from the user’s personality profile, which consists of demographic (age, gender, marital status, education level, etc.), psychographic (views, opinions, needs, etc.) and geographical characteristics (place of residence, country/city, distance from urban centres, etc.). All this information can be obtained either through explicit feedback such as star ratings or likes from users, or through implicit feedback derived indirectly from user behaviour. The second component, the piece of music, is described on the one hand by the metadata, i.e. title, artist, composer, lyricist, genre, release date, etc., and on the other hand by the acoustic properties of the piece, such as volume and frequency. Finally, there is the algorithm, which calculates with the available data, but also processes feedback from new data, to generate suitable music suggestions for the user.[7]

There are basically two methods on which music recommendation systems are based: collaborative filtering and content-based filtering. The term collaborative filtering first appeared in a 1992 article describing ‘Tapestry’, an email system from the Xerox Palo Alto Research Institute that could distinguish between important and unimportant emails – an early form of spam filtering. For the ‘Tapestry’ developers, collaborative filtering means “(…) that people collaborate to help one another perform filtering by recording their reactions to documents they read.”[8] The underlying assumption of collaborative filtering is that two people listening to the same song may also want to listen to similar songs that they do not already share. When this basic principle is implemented on a large amount of data, it increases the likelihood that people listening to the same music will have the same tastes. Ultimately, collaborative filtering is about training the algorithm with input data so that it can make the most accurate predictions about a person’s music preferences. Su & Khoshgoftaar identify the following basic algorithmic techniques that enable collaborative filtering, which can be further technically differentiated:[9]

  •  Memory-based filtering calculates the distances between the data collected and tries to identify similar users or products, as in the case of the online retailer Amazon, which suggests to its users: “You might also be interested in this”. The similarities can be determined using various mathematical-statistical methods and then used to predict neighbouring data points. This can also be used to generate top lists that best match the user’s own behaviour.[10]
  • Model-based filtering uses machine learning based on Bayesian models, statistical clustering or dependency networks to identify complex structures and correlations in the data. This is already a learning algorithm that can also adapt to new data to improve its recommendations.[11]

Content-based filtering gathers the characteristics of a product, such as a song, and links them to the user’s preferences and needs. This method is also about recognising usage patterns to make predictions about future usage. A distinction is made between low-level and high-level filtering. Low-level filtering uses only the metadata of a song, such as title, artist, composer/author, etc., while high-level filtering also includes acoustic characteristics such as tempo, pitch, volume and instrumentation in the analysis.[12]

In addition to the two basic forms – collaborative filtering and content-based filtering – there are other methods, such as hybrid collaborative filtering, which combines the advantages of collaborative and content-based filtering, or emotion-based filtering, which attempts to differentiate emotional states on the basis of large amounts of data and derive music consumption behaviour from this. Finally, there is context-based filtering, which gathers published opinions and information about the music tracks, their artists or genres to derive predictions about user behaviour. All these music recommendation algorithms can be further developed using the artificial intelligence methods we have already discussed, such as artificial neural networks (ANN), recurrent neural networks (RNN) and convolutional neural networks (CNN), i.e. deep learning AI.


Endnotes

[1] Jovan Jovanovic, “How does Shazam work? Music Recognition Algorithms, Fingerprinting, and Processing”, Toptal-Blog, n.d., accessed: 2024-02-01.

[2] Tao Li et al., 2009, “Machine Learning Approaches for Music Information Retrieval”, in: Meng Joo Er & Yi Zhou (eds.), Theory and Novel Applications of Machine Learning, Vienna: I-Tech, pp 259-278.

[3] Ibid., pp 261-263.

[4] Ibid., pp 263-264.

[5] Ibid., pp 264-269.

[6] Worth mentioning are the articles of Hung-Chen Chen & Arbee L.P. Chen, 2001, “A music recommendation system based on music data grouping and user interests”, CIKM ’01: Proceedings of the 10th International Conference on Information and Knowledge Management, October 2001, pp 231-238; Alexandra L. Uitdenbogerd & Ron G. van Schyndel, 2002, “A Review of Factors Affecting Music Recommender Success”, Proceedings of the ISMIR 2002, 3rd International Conference on Music Information Retrieval, Paris, pp 204-208 and John Platt et al., 2002, “Learning a Gaussian Process Prior for Automatically Generating Music Playlists”, Advances in Neural Information Processing Systems, vol 14, pp 1425-1432.

[7] See Dushani Perera et al., 2020, “A Critical Analysis of Music Recommendation Systems and New Perspectives”, in: Human Interaction, Emerging Technologies and Future Applications II, Proceedings of the 2nd International Conference on Human Interaction and Emerging Technologies: Future Applications (IHIET – AI 2020), April 23-25, 2020 in Lausanne, pp 82-87.

[8] David Goldberg et al., 1992, “Collaborative Filtering to Weave an Information Tapestry”, Communications of the ACM, vol 35(12), pp 61-70.

[9] A detailed overview of the three basic techniques of collaborative filtering, including a consideration of the advantages and disadvantages of each technique, can be found in Xiaoyuan Su & Taghi M. Khoshgoftaar, 2009, “A Survey of Collaborative Filtering Techniques”, Advances in Artificial Intelligence, article ID 421425, p 3, https://doi.org/10.1155/2009/421425.

[10] Ibid., pp 5-8.

[11] Ibid., pp 8-11.

[12] See Dushani Perera et al., 2020, “A Critical Analysis of Music Recommendation Systems and New Perspectives”, in: Human Interaction, Emerging Technologies and Future Applications II, Proceedings of the 2nd International Conference on Human Interaction and Emerging Technologies: Future Applications (IHIET – AI 2020), April 23-25, 2020 in Lausanne, p 85.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.