An easy-to-use AI-assisted interface for music creation empowered by harmonic style transfer
Performing and improvising on the piano is a lot of fun but requires extensive training. In this paper, we present an AI-empowered piano performance interface, which makes piano performance much more accessible and approachable for non-pianists. Our system takes chords as input and utilizes harmonic style transfer, a deep music generative method, to render new piano performance while requiring only an elementary understanding of harmony. Compared with existing music interfaces for laypersons, our system produces much more realistic and higher quality music (as the pieces are style-transferred from human-composed music). In addition, the control of our interface is explicit and effective in the sense that users can always anticipate the rendered texture to yield the input chords. Case studies show that our interface not only serves as an easy-to-use system of creating and performing piano pieces but also a friendly tool to learn and develop basic harmonization knowledge and skills.
NIME, collaborative interface, harmonic style transfer, chord progression
•Applied computing → Sound and music computing; •Human-centered computing~Human computer interaction (HCI)~Interaction devices;
Creating and performing music is a rewarding experience reserved mostly for musicians with extensive practice. Whereas fundamental music concepts (e.g. scales and triads) are relatively easy to grasp, meaningfully applying them in music composition or improvisation requires proficiency in musical knowledge and exercised motor skills on an instrument, both of which are difficult to acquire.
In order to allow a larger audience to enjoy music performance, various toy instruments [1][2][3] and human-computer co-creation and performance systems have been developed [4][5][6]. This work follows this path and aims to build an intuitive piano performance interface for laypersons. Among the existing systems, we see Piano Genie [6] as the most relevant work, which allows non-musicians to improvise piano music using an interface of eight keys specifying a pitch contour, taking into consideration both musicality and intuitive user control. However, since its underlying model is inherently monophonic, the system cannot generate meaningful polyphony. More importantly, while the sense of pitch contour is innate to almost everyone, the control using pitch contour is highly stochastic. It is difficult for users to anticipate the returned pitch, and inputting the same key sequence may not return the same notes. These weaknesses limit Piano Genie’s capacity as a performance system, especially in a collaborative performance scenario.
In this paper, we consider harmonic progression a more ideal medium to connect non-experts and intelligent music systems. Chords are not only musically well-defined but also quite meaningful to human perception. When a user specifies a chord progression, say, C–Am–F–G with a certain polyphonic texture, one can almost imagine the music outcome (as opposed to specifying a melody line that goes up and then a bit down, which is still quite ambiguous and contains a large degree of freedom of interpretation). The preference of using chord progress to control music interface can also be seen in popular commercial softwares such as Hookpad [7], GarageBand [8] , etc. These softwares mainly rely on limited sets of static textures and are quite useful for accompaniment pattern generation. In contrast, our goal is to enable more expressive and realistic music rendering by combining chord inputs and arbitrary polyphonic textures in a flexible way, turning the interface into an enjoyable performance tool.
To this end, we explore using chord as an input medium and harmonic style transfer as a generation method to render expressive and realistic polyphonic texture while requiring no prior training on the piano. The harmonic style transfer module is adopted from a state-of-the-art study in music representation learning [9], which can disentangle polyphonic music into two interpretable factors: chord and texture. By manipulating such two factors, changing the harmonic progression of a music piece while retaining its texture, or harmonic style transfer, can be achieved.
Based on the deep music generative model, we build an intuitive human-computer interface, a piano performance system tailored for non-pianists, those who understand fundamental concepts of chord progressions but lack mastery of the musical keyboard to perform on the instrument. In specific, the interface enables two modes: the exploration mode and the improvisation mode. The exploration mode guides the user in building and exploring chord progressions and generates music offline, whereas the improvisation mode generates music in real-time. Case studies show that our interface not only serves as an easy-to-use system for creating and performing piano pieces but also a friendly tool to learn and develop basic harmonization knowledge and skills.
In sum, the contributions of this study are:
We build a powerful and intuitive piano performance interface by leveraging deep music representation learning, in which users can steer the performance with different chord progression inputs.
We show that the interface is an effective entertainment and performance tool for non-pianists, as long as they have basic harmony knowledge.
We show that the interface also has merit in music education, serving as an auxiliary tool for ear training and practising harmonization.
In this section, we first introduce the deep generative model, which is the backend of our system in Section 2.1. Then, we describe the visual display as well as how to control the interface in Section 2.2.
We follow the method introduced in [9]. As shown in Figure 1, the model is a tailored variational autoencoder consisting of a pair of chord and texture encoders, a chord decoder for chord reconstruction, and a PianoTree [10] decoder for reconstructing the original music. (We refer the readers to the original paper for more technical details.) Among all the variables, what matters the most when building a controllable user interface are and . The former is the latent representation describing the overall chord progression, while the latter representation contains the polyphonic texture information, say, arpeggio, Alberti bass, block chords, etc.
During inference time, we recombine the encoded user chord input and the texture representation from a reference music phrase, decoding the recombined latent representation with the PianoTree decoder to obtain realistic new music. The produced music, in most cases, follows the user-specified chord progression and the textural style of the reference music phrase. In other words, the harmonic style of the reference is transferred according to the user input.
As for the source of the reference music phrases, we allow the user to select from a set of 20 curated four-measure phrases from POP909 [11], a dataset of popular songs in piano arrangement. We select phrases that are aligned with music segments specified in [12] and start on downbeats. We only include phrases in 4/4. Also, following [9], we apply harmonic style transfer for each two-measure window with no overlap.
Figure 2 and Figure 3 show the visual interface1. The top navigation bar contains a dropdown menu for reference phrase selection and also shows the phrase information. Both the reference phrase and the generated (harmonic-style transferred) phrase are visualized in the upper half of the main interface. The lower half shows the two interaction mode designs: the exploration mode and the improvisation mode. The former guides the users to generate a chord input, while the latter is more real-time. In the rest of the section, we first describe how phrase visualization works in Section 2.2.1, and then introduce more details of the two interaction modes in Sections 2.2.2 and 2.2.3 respectively.
We use piano-roll visualizations for the reference and the generated new phrases. In addition, we use time rulers and bar lines for better readability.
As we use chords as a means of input, we intend for the user to focus on chords rather than individual notes. We include two modes of chord-conditioned pitch overlays as shown in Figure 4. For each timestep, the root mode overlays the root of the current chord with blue bars. The chord mode overlays all chord tones with bars of different colors. In this way, the note-chord relationships and inter-chord relationships are further visualized.
A user can interact with the system offline in the exploration mode: As shown in Figure 5, the user inputs chords through the canvas in the middle, with additional controls laid on the right. We use a tree diagram to represent chord progressions where each node represents a chord lasting two beats and are connected by directed edges, allowing the exploration of different continuations of given chord sequences. Clicking a chord node expands it into a chord dial where the user can further edit the chord as shown in Figure 6. In essence, we allow the user to edit chords without worrying about individual notes.
We guide the exploration through suggesting common continuations given a user-inputted chord sequence. The user can select an incomplete chord progression and select “suggest chord” in the control panel. The system then retrieves the most likely next chords (with conditional relative frequency > 0.05) from the HookTheory API [13] and display them as branches as shown in Figure 7. We choose this approach over template-based approaches such as [5] due to its higher flexibility.
All complete chord progressions the user enters are stored as rows in the table below the altered phrase. The user can freely combine different chord progressions with different textures. Video 1 demonstrates first editing the chords and then generating the performance.
While the exploration mode fully utilizes harmonic style transfer with fine-grained chord editing, the delay between user input and altered phrase playback means it is unsuitable for real-time improvisation. To this end, we implement the improvisation mode. We select a vocabulary of all major and minor triads and major, minor and dominant seventh chords in root positions and enumerated all combinations in one-measure steps. (e.g. C-F-G-C, C-C-C-Cm, etc.) We then generate phrases for all such enumerations and all reference phrases beforehand such that they can be retrieved in real-time. During performance, the user selects the next chord to be played through the chord input interface to queue the next measure of music.
As shown in Figure 8, we design two modes for chord input: the fixed mode and the movable mode. (The relation between these two modes is analogous to the relation between the movable-do and the fixed-do system.) The fixed mode features a visual interface similar to that of a music keyboard. The user enters a chord by selecting its root and specifies its quality though holding a key combination. This mode is suited for users familiar with music theory and allows for entering borrowed chords and modulations. The movable mode prompts the user to select a key and guides the user in entering diatonic chords. We map chord entries to number keys for intuitive use. For instance, pressing 1, 4, 5, 6 on the keyboard enters a I-IV-V-vi progression. Video 2 demonstrates the improvisation mode using these two input modes.
To validate the effectiveness of our interface, we conducted two user studies. The first one considers our interface as a pure computer-aided piano performance system, while the second one considers our interface as a music education and ear-training tool.
We invite eight participants (5 Male and 3 Female aged from19 to 23) to perform with our interface in individual sessions. Among the eight participants, one is a proficient piano player and can ably improvise on the keyboard; three are somewhat familiar with music theory, knowing some common progressions but cannot fluently improvise; and four have very limited musical experience and cannot spell basic triads.
For each session, we first introduce or review the concepts of diatonic triads in the major scale and provide the participant with a list of common four-chord diatonic triad progressions. We then introduce the control of our interface and ask the participant to perform on it for a maximum of 15 minutes. The participant can freely switch between the exploration mode and the improvisation mode in the process. Afterwards, we interview the participant and ask them to rate our system in terms of overall usability, musicality, expressiveness, accessibility and music skill engagement on a scale of 0 to 10. Detailed rating rubrics are shown in Table 1.
Rating rubrics.
Criteria | Description |
Overall Usability | Making music with the system is enjoyable. |
Musicality | The generated music arrangement sounds realistic and of good quality. |
Expressiveness | The system is capable of generating the music I want. I can anticipate the result of my inputs. |
Accessibility | The interface is easy to use. I can produce meaningful music on it with little practice. |
Music skill engagement | Using the system engages my music skills. The system feels like a proper instrument rather than a toy. |
The participant ratings are shown in Figure 9 with the averages shown in Table 2. We see that the overall performance experience is enjoyable, and the generated music is of high quality. Most participants also agree that the interface is controllable and offers good variety in terms of its output. The ratings on accessibility have a notably higher variance compared with the other categories. While most participants report that the input method is intuitive and easy to learn, some musically inexperienced participants comment that performing in real-time feels overwhelming as they are not familiar with chord progressions. Finally, all participants report that their music skills are actively engaged when using the interface, with six out of the participants noting that they feel more comfortable with chord progressions after using the interface.
Average ratings for each category.
Criteria | Average |
---|---|
7.875 | |
Musicality | 7.875 |
Expressiveness | 8 |
Accessibility | 7.625 |
Music Skill Engagement | 8.25 |
We select and report representative interview responses.
Q1. Describe what you did with the interface. Did you enjoy the process?
“I was a bit confused with the improvisation mode at first as I am not familiar with chord progressions, so I tried out the exploration mode and looked at some progressions.“
“I started by playing the reference phrase in the original progression in the improvisation mode. I knew some of the reference songs, so it was fun. Afterwards, I tried some other progressions.”
“I experimented with some techniques I know from music theory like using secondary dominants for modulation or playing V7-i instead of v–i in the minor scale. I think this can be very helpful for quickly trying out different progression when writing a song.”
Q2. Are you comfortable with inputting chords to guide the music? Do you feel in control when improvising music with it?
“I didn’t have to worry about playing individual notes or structure of the music phrase so I can focus more on what chords to play next.”
“It was easy to enter different chords with key combinations. I can easily try out different chord progressions despite not knowing how to play them on the piano.”
“Sometimes I get confused and don’t know what chord to play next. I think the system is more suitable for people who are familiar with chords.”
“I don’t feel that the generated music is really my work because it just alters the chords of the reference music.”
Q3. Can you suggest any improvements for the interface?
“Better visualization for the generated phrase might be helpful. I want to see the exact pitch of the notes.”
“Some kind of guidance for the improvisation mode would definitely help. At first, I looked at the buttons and didn’t know what to do.”
“I don’t have complete control over the rhythm. It'd be better if I can control the rhythm and make the music go faster or slower as I like.
In general, the responses confirm that our interface is an easy-to-use and effective performance tool, as long as the user has basic harmony knowledge. By taking the cognitive load off music texture and motor skills, we make it easy for non-pianists to improvise meaningful music while engaging and practising their understanding of harmony. This circumvents the typical frustration when an inexperienced musician faces a bottleneck in instrument proficiency.
Since our interface is designed to involve concrete music skills, we are interested in investigating how such skills can be developed through using the interface. Specifically, we focus on the identification of chord progressions, a practical ear-training task.
We chose a vocabulary of four common major progressions for our participants to learn: I–vi–IV–V, I–V–vi–IV, I–V–vi–iii and IV–V–iii–vi. We invite two music novices with no prior experience in harmonic listening and task them to learn to distinguish these four progressions in three two-hour sessions.
To track the learning progress, we design a learning process consisting of four levels of escalating difficulty: static, melody with chords, piano arrangement, and (original) full arrangement. The four common major progressions are realized differently in the four levels and thus impose different learning difficulties. To be specific, the static part consists of static chords played in root position on the piano. The melody with chord part consists of melodies played on top of static chords, both played on the piano. The piano arrangement part consists of piano arrangements of popular music, and the full arrangement part consists of full arrangements of popular music phrases. Each part consists of eight music phrases (as concrete cases to learn), and the participant attempts to identify each phrase as of one of the abovementioned four major progressions. All test audio, apart from the static chords, are selected from the POP909 dataset [11] and do not intersect with the reference phrases used in the interface. We evaluate the participants’ performance by the number of correctly identified phrases in each part.
To further facilitate active learning, we add a self-test feature similar to that in Earmaster [14] and Tueno [15]. The user can specify a set of progressions as the test vocabulary. The system randomly selects a progression from the vocabulary and generates a phrase accordingly while withholding the answer. The user can then reveal the answer and verify whether it is in line with their prediction. Our rationale is that this feature encourages the user to actively predict and listen for the characteristics of the chord progression of interest, serving as a form of supervision. We demonstrate this feature in Video 3.
For each session, we start by reviewing the relevant music theory and interface control with the participant. We then ask the participant to try to learn the specified progressions using the improvisation mode for ~1.5 hours. During this process, the participant can freely access the self-test and can listen to the static chord progressions. We conduct the test at the end of each session.
Individual performance of the two participants over the three sessions is shown in Figure 10. We observe that variations in music texture can be a major obstacle in learning to identify harmonic progressions, and our interface provides a targeted practice.
One of the participants only managed to correctly identify two out of the eight phrases in the static part by the end of the first session. The participant then requested to listen to the audio again and reattempt the test. After listening to the static progressions, the participant seemed to get the characteristics of the progressions and proceeded to correctly identify all progressions in the static part. However, when faced with the chord with melody part his performance regressed to a one out of eight as he commented that he was distracted by the melody. This demonstrates that learners may easily learn to memorize and identify static chord progressions, but this ability does not constitute general musicality since a change in texture as minor as adding a melody can throw the learner off balance.
After learning to identify the static progressions, both participants moved on from simple textures to more complex textures with melodies at varying points in later sessions. Both participants achieved correct ratios of around five out of eight in the more challenging piano arrangement and full arrangement parts. Both participants report that trying out different textures for the same progression and different progressions with the same texture helps them in learning to discern the textural invariance of chord progressions. Interestingly, the participants report that in some cases, the phrases in full arrangement are easier to identify than the ones in piano arrangement as the bass lines can be more easily picked out.
We can see our interface as a complementary method to both learning from existing ear-training applications (e.g. Earmaster [14], Tueno [15], etc) and learning from analyzing songs. Most existing ear-training applications rely on static textures and might be ineffective for learning chord progressions as a part of general musicality. Learning from fully arranged songs might prove too challenging for beginners and lead to a sparse learning experience as musical phrases of a specified chord progression might be difficult to find. Learning from harmonic style-transferred phrases allows for a suitable difficulty for intermediate learners, and this difficulty can be adjusted by changing the reference phrases.
We developed a powerful piano performance interface based on chord input and empowered by harmonic style transfer. We designed intuitive visual components and control methods to make the interface accessible to musically inexperienced users. Our user studies show that the interface facilitates enjoyable and expressive piano performance for non-pianists with basic harmony knowledge and engages music skills. We further show that our interface can complement existing methods of ear-training for harmonic listening and holds merit as an education tool.
We identify several potential improvements to our interface. Incorporating methods to control the rhythm and tempo might allow a greater sense of control and expressiveness. Providing real-time guidance for chord entry might make the interface more accessible for inexperienced users. Finally, implementing a physical interface and allowing users to quickly swap between reference phrases through visualized previews might constitute a more versatile interface for live performance. We leave such venues as future work.
[redacted for double-blind review]
All procedures in this study that involve human participants were conducted in accordance with the NIME Principles & Code of Practice on Ethical Research. All participants gave informed consent. All in-person activities were conducted in accordance with the local and institutional COVID policies.