Human-AI Partnerships in Generative Music

Jason Smith

doi:doi:10.21428/92fbeb44.1aaaf7ab

Author keywords

Artificial Intelligence, Machine Learning, Human-Computer Interaction, Generative Music

Context/theory

In musical performance, machine learning can be seen as a “tool” to support performers, while an AI system can be considered an autonomous “actor” [1]. The distinction between machine learning and AI in this context is that “machine learning” refers to the technological process of a system learning from data, with “AI” referring to cases in which the algorithm’s impact on a system’s behavior is significant.

Creative autonomy refers to independent, creative decision-making by an AI system through its self-evaluation [2]. In a co-creative system [3] such as an interactive generative music system, this includes independent action during a performance with a user. However, system output must be related to a user’s input in order to not be seen as random [4]. In this research, I aim to examine how users perceive NIMEs or generative music systems as “tools” or “instruments” versus “AI” or “agents” by determining how creatively autonomous they perceive the systems to be.

Explainable AI techniques are intended to improve user understanding of system behavior [5]. In doing so, they allow users to further understand the difference between models in order to make the distinction between “machine learning” and “AI” - as well as “tool” and “actor” - on a system-by-system basis. Explainable AI techniques can be used in musical applications for visualization and user input in order to achieve increased control and understanding [6]. This research explores how user understanding of a co-creative system impacts their perceptions of its creative autonomy.

Other frameworks originating in co-creative systems research can be applied to NIMEs. The framework of creative sense-making is characterized by “clamping” and “unclamping,” or gradual deviation and return from expected outcomes over the course of a creative collaboration session [7]. Mutual theory of mind is the shared ability between collaborative partners to understand each others’ cues as well as the creative task at hand [8]. In conjunction with creative sense-making, collaborators must exhibit a shared understanding of their creative goals in order to intentionally deviate and return to expectations. Musical improvisation by a group encompasses both of these domains. Firstly, improvisers share responsibility and alternate between partners, resulting in gradual clamping and unclamping from an original musical goal. Also, ensemble members must understand each others’ cues to trade leading/following roles and must share a mutual concept of their performance goals. I aim to apply these co-creative frameworks to human-AI musical performance by evaluating how users perceive a system taking control in the creative process, and whether or not that leads them to a satisfying experience.

Research question

This research aims to answer the following broad question:

How can the design of an AI affect the way musicians interact with and perceive the interactive music systems they use to create music?

I have divided this goal into four research questions, based on the different creativity frameworks used to evaluate human-AI musical interaction:

How does AI affect user perceptions of creativity and autonomy in a gesture-controlled generative music system?
How does explainability affect user understanding, self-reported expressive ability, and musical satisfaction in a gesture-controlled generative music system?
How can online machine learning enable an interactive generative music system to adapt to user behavior over a single session? Does the inclusion of online machine learning increase users' ratings for anthropomorphism, likeability, and perceived intelligence and creativity?
How do users' perceptions of an AI-based generative music system change across multiple sessions of usage? To what degree do users attribute these changes to increased system familiarity and/or adaptation by the system to them?

Methods

Research study 1: Effects of Deep Neural Networks on the Perceived Creative Autonomy of a Generative Music System

In a 2020 study [9], I aimed to answer research question 1: How does the depth of a neural network affect user perceptions of creativity and autonomy in a gesture-controlled, deep learning-based generative music system? How does user perception of autonomy in a co-creative musical system relate to the level of control they take while collaborating with it, and their sense of expression while using it?

For this study, I created an AI-based NIME that uses motion-recognizing neural networks to manipulate filter parameters for looping, ambient samples. Users can perform movements that trigger musical changes by the AI.

The system is hosted in a Python application, using a laptop webcam to capture video, filter it, and capture motion in real time. One of two models can be used in the system: a shallow linear model and a deeper convolutional model. Input video is frame subtracted to indicate motion and downsampled to an 8x8 grid of motion information functioning as input to the models. Both models were trained on a collection of gestures combined with audio parameter changes and return gain, frequency, and bandwidth for two bandpass filters.

Video demonstration of the RQ1 generative music system.

5 subjects completed this study, all of whom were music technology students. Each subject reported their experiences with machine learning, deep learning, and music in order to establish their background and as a minimum criteria for discussing AI systems. They then practiced with each system for 5 minutes, followed by a 2-minute recorded performance.

Subjects watched their recordings and rated interactions between themselves and the system in how much autonomy it exhibited and how much control it took creatively on a 7-point Likert scale. They completed a modified version of the Creativity Support Index [10] that uses the individual scales (a user's sense of collaboration, enjoyment, expression, desire to use again, a high standard of output, and human-like output). This modified version also changed the collaboration question’s language to refer to collaboration with the system rather than with other people. This is done in order to compare specific areas of their co-creative dynamics with the AI between the system versions, rather than to compare a composite score against other systems.

Creativity Support Index questionnaire scores, with average responses between subjects for the shallow and deep models on a 7-point scale.

Subjects' ratings for system autonomy, creative control, and user-reported expressive ability for the shallow (S) and deep (D) models respectively, on a 7-point scale.

Because the two versions used identical loops and filter parameters, they achieved the same scores for the questions related to the quality of their musical outputs. However, the version with the deeper network saw higher ratings for control, autonomy, expression, and collaboration.

Subjects reported a variety of creative goals when using the system, which affected their sense of the system’s autonomy. This, combined with the fact that some subjects reported difficulty in understanding the system's inner workings, led me to focus on investigating the importance of users understanding a system on their perceptions of its autonomy.

Research study 2: Exploring the Relationship Between Creativity, Autonomy, and Explainability in Machine Learning-Based Generative Music Systems

In a 2021 study, I created a second NIME to examine the effects of explanation, in the form of visualizations, on user understanding, perceived creativity, and trust in a generative music system.

System flow for the generative music system. The outputs of both the pose recognition and motion description models are used to create visual explanations as well as in a set of musical mappings.

A python implementation1 of PoseNet [11] is used to capture a human hand’s position in a laptop camera’s continuous video stream. These x-y coordinates are used to create the first visualization, form the input of the second model, and are sent to Max/MSP to manipulate audio parameters.

A second motion description model, a linear neural network written in pytorch2, then uses the PoseNet output coordinates to describe the motion. The outputs are a bounding box and a classification of the straightness of a gesture. These are used to create a second visualization, as well as musical parameter changes.

Visualizations accompanying various versions of the generative music system. A (left) depicts pose estimation results over a 20-sample window, B (center) depicts the motion detection model’s detected bounding box and motion trajectory, and C (right) combines A and B.

The visualizations designed to explain this system display the input and output to the neural network models used to map motion to musical parameters. The pose estimation output (visualization A) is represented by a green line following the user’s hand. The motion description output (visualization B) consists of a rectangular bounding box encompassing the whole gesture, a set of rings following the hand that shrink over time, and a color change from magenta to cyan when a straight line is detected. Because the pose estimation output is used as input to the motion description model, A represents the input to the description model while B represents the output. Visualization C combines A and B, displaying the input and output of the two models.

Mappings between model output and musical parameters.

This system maps the outputs of each model to musical parameters in a Max/MSP patch. The x-y coordinates of the points found by the pose estimation model are used in one-to-one mappings towards filters for the bass and drum loops. The motion description model outputs, a bounding box and gesture straightness classification, are used in two one-to-many mappings: roughness and rate. Rate controls the musical subdivision of random notes generated by the pitched droplet sound effect, acting as a melodic component. Roughness controls the gain and resonance of a series of independent bandpass filters applied to the bass, synthesizer, and melodic droplet loops.

Video demonstration of the RQ2 generative music system.

This study followed a similar protocol to the 2020 study. Each of the 15 subjects, 10 of whom were music technology students, completed the same pre-questionnaire as before. They also completed the guided performance task using versions A, B, and C of the system in random order. In addition to the control and autonomy ratings and Creativity Support questions, the Explanation Satisfaction Scale [12] and a Trust survey [13] were added to the post-questionnaire.

Average user ratings for each version of the system, using individual scales of the Creativity Support Index.

Average user ratings for Explainability per system version, determined by the Explanation Satisfaction Scale questions in the post-questionnaire.

Average user ratings for Trust per system version, as determined by individual questions in an affective human-computer interaction survey.

Our users rated the visualizations that displayed the system’s interpretation of the current state (B) higher than the one that only showed the system’s detection (A) in terms of creativity, explainability, and trust. Subjects were not more likely to rate any version of the system higher than the others in terms of autonomy or control on average. However, they did rate the combined visualization (C), which displayed the most information, closer to their preferences or expectations for a generative music system’s autonomy. Their expectations were heavily dependent on background, as were their methods of controlling the system once learning the system mappings from the audio and visualizations. Some participants who preferred instrument-like, predictable behavior (low autonomy) generally used slower motions to manipulate the filters mapped to the horizontal and vertical position of their hand. Others who preferred AI-like, explorative behavior (high autonomy) performed large, varied gestures to allow the AI to make more decisions.

Research study 3: Using Online Machine Learning to Adjust to User Expectations in a Generative Music System

The previous two studies revealed that users’ musical goals affect their perceptions of a system’s autonomy, and that their expectations about how an AI-based generative music system should act influence their actions when responding to system behavior. The relationship between expectation and perceived autonomy was strengthened through understanding how the system works, as a result of the visualization conditions. We are planning a third study to answer RQ3 and RQ4: how can online machine learning, a technique in which a machine learning model is updated over time, be used to allow a NIME to adapt to user expectations over time? How does users’ experience, through extended use, impact their understanding of an adaptive NIME?

We are assessing the ability of the system to attain Mutual theory of Mind, or the mutual understanding of cues and shared goals between collaborators. In doing so, we share three metrics found in the evaluation of a conversational agent: anthropomorphism, likeability, and perceived intelligence [8]. These scores are based on the GodSpeed Questionnaire Series, a five-question series that also includes animacy and perceived safety [14]. However, animacy and safety are more closely related to robotic motion and are less relevant to evaluating a software agent, so they were removed from the questionnaire. These scores, alongside metrics from previous questionnaires - the creativity and autonomy ratings, modified Creativity Support Index, Explanation Satisfaction Scale, and Trust survey - will be recorded alongside qualitative interview questions to assess how subjects perceive and wish to interact with the system.

This study will also incorporate a revised NIME that subjects will practice with, perform with, and reflect upon. This system will include online machine learning, the practice of re-training a model over time to adjust to incoming data [15]. In order for online machine learning to be possible, user input for retraining the model hosted by the system is necessary. The next iteration of the system will allow users to record gestures, label them, and prompt the re-training of the model. To answer research question 4, subjects will complete the performance and questionnaire using this system over multiple sessions.

Expected outcomes

Contributions to the NIME Doctoral Consortium

My research stems from a multidisciplinary background that combines music technology with Artificial Intelligence, Computational Creativity, Human-Computer Interaction, and Robotics.

One example of insights my background can bring to the Consortium is the usage of unique methodologies. In my studies, I have used evaluations and design heuristics from co-creative AI domains such as creativity, autonomy, and explainability. I am also looking to continue extending evaluations originating from robotics to AI-based musical applications, a method that has formed core concepts of my research questions. I hope to promote the benefits of considering co-creative musical systems as autonomous agents that exhibit their own musical expressivity.

What I hope to get from the Doctoral Consortium

I am about to embark on this third, advanced stage of the research I have proposed. I am hoping to receive feedback on both my proposed study methodology and design of a third-stage generative music system.

I hope to receive mentorship from senior researchers in the field of music technology - not only in this specific research application, but in more generalized analysis of machine learning and artificial intelligence as they relate to music technology as well. As I combine AI research with music technology, I hope to gain insights from those at the forefront of NIME development and performance.

As I aim to complete and defend my dissertation, I am reaching the conclusion of my doctoral studies. I would greatly appreciate learning about opportunities for music technology research and work after graduation. Finally, I hope to use this Doctoral Consortium to build a professional network with members of the NIME community who have research interests similar and/or complementary to my own.