Computational Analysis of Style in Traditional Fiddle Playing

william.wilson@bcu.ac.uk

doi:doi:10.21428/92fbeb44.fce4a4c6

Title

Computational Analysis of Style in Traditional Fiddle Playing

Author keywords

Violin

Gestural Analysis

Electromyography

EMG

IMU

MIR

Classification

MFCC

Research question/s/problem

How do contemporary regional fiddling styles differ, and how may these distinctions be best observed?

A multidisciplinary approach is proposed by which this research question may be met. Incorporated Music Information Retrieval (MIR) and gestural analysis techniques aim to quantify, respectively, how and why these styles differ. The efficacy of both audio and gesture-based approaches for the analysis of classical violin playing been demonstrated extensively. To a lesser extent, the efficacy of combined audio-gestural approaches have also been explored (Dalmazzo and Ramirez, 2019). Such studies have employed feature-extraction and machine learning techniques to classify both gestural and audio time-series data with demonstrable success; these approaches are largely yet to be implemented in the study of traditional fiddling, however.

Context/theory

The history of the fiddle spans both centuries and continents. Regional styles of contemporary prominence may be considered both a product and reflection of cultural, political, and demographic shifts endured. Conventionally, analysis of these has been conducted through the use of qualitative musicological evaluation.

The audible content of a violin performance may be considered a product of the performer’s gestural execution. These performance aspects may be quantified, respectively, through the use of Music Information Retrieval (MIR) and gestural analysis techniques. Prior studies have proven the implementation of machine learning techniques to be efficacious in the automation of classification tasks based upon both audio and gestural data.

Gestural Analysis:

Gestures may be recorded and quantified through the use of gestural sensors, yielding time-series gestural data. Gestural sensors are typically Inertial Measurement Unit (IMU) or Electromyography (EMG) based.

An object within a three-dimensional space has both a location and an orientation; these may both be described in three dimensions. An object’s location may be described by its translational position, relative to a set of X, Y, Z axes. The orientation of an object may be similarly described by the object’s rotation around each axis. Conventionally, these may be referred to as ‘Roll’, ‘Pitch’, and ‘Yaw’ (Craig, 2005). Each of these metrics are termed ‘Degrees of Freedom’ (DoF). IMU sensors quantify movements through recording changes in each DoF over time, yielding respective acceleration and gyroscopic data.

Castellini & van der Smagt (2009) summarise EMG as “a technique by which muscle activation potentials are gathered by electrodes placed on the [...] skin”. Raez et al. (2006) describe raw EMG signals as consisting of electrical wave-packets ranging between -5 and +5 mV, attributing these to the electrical field generated by muscle fibres during contraction. Citing a number of prior studies.

Castellini & van der Smagt (2009) discussed the efficacy of using forearm surface EMG in combination with machine- learning algorithms for the classification of hand posture. They attributed the success of prior implementations to an existing relationship between the force applied by a muscle and the amplitude of the resultant EMG signal; in implementation, the use of multiple sensors allows for the identification of ‘precise force configurations’ associated with specific hand postures.

Music Information Retrieval:

Feature extraction derived low-level descriptors provide a representation of a signal’s timbral and rhythmic characteristics. Schedl et al. (2014) note limitations surrounding the interpretability of these descriptors, favouring instead their implementation within computational classification systems.

Ali-MacLachlan (2019) asserts the utility of Mel Frequency Cepstral Coefficients (MFCCs) as a representation of timbre, terming these a “compact feature representation used in audio signal classification”. Zheng et al. (2001) define MFCCs as “the results of a cosine transform of the real logarithm of the STFT expressed on a Mel-frequency scale”. A noted benefit of the Mel-scale’s application in such tasks is its approximation of human-frequency perception.

While acknowledging their usefulness as an indicator of timbrality, McFee et al. (2015), contend that MFCCs are flawed in their depiction of pitch, considering them to offer “poor resolution of pitches and pitch-classes”. Instead, the authors suggest the use of Chroma representations in the depiction of these, purporting them to “encode harmony while suppressing variations in octave height, loudness, or timbre.” Stein et al. (2009) identify a number of techniques by which Chroma representations may be calculated. In doing so the authors noted each as derived from the Pitch- Class-Profile (PCP) technique. An FFT of an input signal is first taken, after which the frequency bin magnitudes within each semitone boundary are summed. The subsequent semitone magnitudes are summed by pitch with those of different octaves, providing an instantaneous indicator of perceived pitch.

Neural Network Classification

Alpaydin (2020) describes a Single-Layer Perceptron as “the basic processing element” of any neural network, comprising of a single node which may receive any number of numerical inputs. To each numerical input, a weight is ascribed. Through summation of the product of each input and ascribed weight, the node produces an output value. Russell & Norvig (2020) describe the Multi-Layer Perceptron (MLP) as an expansion of the Single-Layer Perceptron comprising of multiple layers of nodes, decreasing in quantity and linked by interconnected weights. Weights are initialized randomly, and refined through training upon labelled data, through which the input data may be classified to an output. While a conventional MLP comprises of weights connecting in only one direction, (and is thus termed a feed-forward network) Russell and Norvig (2020) identifies a Recurrent Neural Network (RNN) as a variant of the MLP incorporating recurrent connections, wherein the output of an intermediate node may be fed back towards the input of itself, or other preceding nodes.

Applications of neural network classification upon both MIR and gestural data have been documented extensively; the following publications are deemed of key relevance to the proposed methodology:

Dalmazzo et al. (2021) demonstrated a high degree of accuracy while using Convolutional Neural Networks (CNN) trained upon IMU data for the classification of eight bowing techniques: martelé, staccato, detaché, ricochet, legato, trémolo, collé, and col legno. Reported recognition rates ranged between 97.147% and 99.234% for a variety of CNN based models - the prior a conventional CNN and the latter a CNN Long Short-Term Memory Network.
Dalmazzo & Ramirez (2017) investigated the efficacy of employing forearm-surface EMG alongside IMU data for the off-line recognition of fingering gestures during violin performance; classification using a Hidden Markov Model (HMM) yielded gestural recognition accuracies of between 89.44% and 99.23%.
Zheng et al. (2001) demonstrated the efficacy of using extracted MFCCs to train a HMM for the purposes of speech recognition.
Miotto & Orio (2008) utilized Chroma representations in their development of an automated music identification system, proposing their use as ‘indexes’ in an HMM-based retrieval system.

Methods

During a preliminary study, a series of multi-class classification tasks were completed using the open-source Violin Gesture Dataset published by Sarasúa et al. (2017). The dataset contains simultaneous IMU (50Hz), EMG (200Hz) and audio (48kHz) recordings for 880 performances of an excerpt from Kreutzer’s Etude No. 2 in C Major, with a typical duration of around 11 seconds. Each recording is labelled by both participant and a bow-articulation condition (martelé, staccato, detaché, spiccato, legato) - to be subsequently termed ‘Style’.

Figure 1 depicts a chronology of the implemented method, which aimed to classify data associated with an isolated bow stroke by participant or style. The three data types employed are depicted at the far left of the figure, as sourced from the dataset. Subsequent processes applied to these are identified, as were applied in preparation of the data for classification — depicted at the far right of the figure.

Gestural Data Processing

Processing of the gestural data was performed upon the signals in their entirety, prior to their segmentation.

A linear de-trend function was first applied to each channel of IMU data, given the tendency of IMU sensors to exhibit drift over time - a result of accumulated error (Kok et al., 2017).

Proportional normalisation was applied to both the IMU and EMG data such that the maximum magnitude of a signal was bounded by 1, while the proportional difference in maximum magnitude between concurrent channels of data (e.g. EMG channels 1-8) was maintained. This can be seen in Figure 2.

A low-pass filter was subsequently applied to the EMG data, with a cut-off of 10Hz; in implementation providing a simple amplitude envelope (Tanaka & Ortiz, 2017).

Data Segmentation

Note-onset positions were first identified within the audio data through use of the onset detection functionalities of the Librosa Python library. Each onset position was returned as an index of the audio sample array. Given sample-rate discrepancies between the three data types, proportional scaling of each index was necessary in identifying temporally equivalent indices within the gestural data. The recordings of each data type were then split, using their respective onset indices, into a series of inter-onset- intervals; these were considered to be representative of singular bow strokes. Figure 2 depicts the data-segmentation of a single recording.

Audio Feature Extraction

Feature extraction techniques were subsequently employed through use of Librosa, for the purposes of calculating low- level descriptors derived from the segmented audio data; namely 13 MFCCs, 13 Delta-MFCCs, 13 Delta-Delta-MFCCs, and 12 Chromas. The mean values for each feature were calculated, such that for each bow stroke singular sets of MFCCs, Delta-MFCCs Delta-Delta-MFCCs and Chromas were produced.

Neural Network Classification

MLP networks were used for the completion of 12 separate multi-class classification tasks; two tasks per input data type.

The number of input and output nodes of the MLP varied between tasks, given variation in the number of input data points per data type, and the number of classes per classification task. Despite this, the fundamental architecture of the MLP remained consistent. This comprised of an input-layer and two densely-connected hidden layers, each with an equal number of nodes to the number of input data points. The output layer comprised of an equal number of nodes to the number of classes. This architecture is depicted in Figure 3.

Results

Consistently higher classification accuracies were exhibited in completion of the participant classification task, with an average accuracy of 85.52% across all data type combinations. ‘Style’ classification accuracies were considerably lower per data type combination, with an average classification accuracy of 65.44%.

*Table 1: Preliminary Study Classification Accuracies*

Lone MFCC data demonstrated comparatively low accuracies in the completion of both classification tasks, al- though the inclusion of additional feature-extracted low- level descriptors (Delta-MFCCs, Delta-Delta-MFCCs, Chromas) resulted in an accuracy comparable to that of the gestural data types for the purposes of participant classification. The inclusion of these did not prove similarly beneficial for the purposes of style classification; while a significant increase in accuracy was observed, the resultant classification accuracy was far below that of the gestural data types. It should be noted that this implementation of MIR feature extraction techniques preserved no temporal information contained within the original audio signals, in contrast with the gestural data types which remained time-series. Consideration of this, in the context of the aforementioned results, may suggest temporality to be a more crucial aspect in the classification of bowing technique than in participant identification.

Expected outcomes

I am hopeful that the subject of my research is of interest to the NIME community. While I am aware that musical applications of gestural techniques have been previously explored, efforts to implement these for the purposes of musicological and ethnomusicological analyses appear to be somewhat uncommon. I’m similarly hopeful that, through this opportunity to share my research ideas with the NIME community, I would receive valuable feedback regarding implemented methods and my further development of these. Given the chance to share my ideas in a consortium setting, I feel I would benefit from the opportunity to discuss experiences, findings, and ideas with other PhD researchers working in related fields. Through participation in the doctoral consortium I also hope to further consider the value of my research within the context of creative applications, with the intention of maximizing this. I am certain that the NIME community would offer valuable insight in this regard.

Bibliography

Ali-MacLachlan, I. (2019). Computational Analysis of Style in Irish Traditional Flute Playing. PhD Thesis, Birmingham City University, Birmingham, UK.

Castellini, C. & van der Smagt, P. (2009). Surface EMG in advanced hand prosthetics. Biological Cybernetics, 100(1), 35–47.

Craig, J. J. (2005). Introduction to Robotics - Mechanics and Control (3rd ed.). New Jersey, USA: Pearson Education, Inc.

Dalmazzo, D. & Ramírez, R. (2017). Air violin: A Machine Learning Approach to Fingering Gesture Recognition. In Proceedings of the 1st ACM International Workshop on Multimodal Interaction for Education, (pp. 63–66)., Glasgow UK. ACM.

Dalmazzo, D. and Ramírez, R. (2019). ‘Bowing Gestures Classification in Violin Performance: A Machine Learning Approach’. In: Frontiers in Psychology 10, p. 344.

Dalmazzo, D., Waddell, G., & Ramírez, R. (2021). Applying Deep Learning Techniques to Estimate Patterns of Musical Gesture. Frontiers in Psychology, 11, 575971.

Alpaydin, E. (2020). Introduction to Machine Learning (4th ed.). Massachusetts, USA: The MIT Press.

Kok, M., Hol, J. D., & Schön, T. B. (2017). Using Inertial Sensors for Position and Orientation Estimation. Foundations and Trends® in Signal Processing, 11(1-2), 1–153. arXiv: 1704.06053.

McFee, B., Raffel, C., Liang, D., Ellis, D., McVicar, M., Battenberg, E., & Nieto, O. (2015). librosa: Audio and Music Signal Analysis in Python. (pp. 18–24)., Austin, Texas.

Miotto, R. & Orio, N. (2008). A Music Identification System Based on Chroma Indexing and Statistical Modelling. In Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR), (pp.6), Pennsylvania, USA.

Russell, S. & Norvig, P. (2020). Artificial Intelligence - A Modern Approach. (4th ed.). Pearson series in Artificial Intelligence. New Jersey, USA: Pearson Education, Inc.

Sarasúa, Caramiaux, B., Tanaka, A., & Ortiz, M. (2017). Datasets for the Analysis of Expressive Musical Gestures. In Proceedings of the 4th International Conference on Movement Computing, (pp. 1–4)., London United Kingdom. ACM.

Stein, M., Schubert, B. M., Gruhne, M., Gatzsche, G., & Mehnert, M. (2009). Evaluation and Comparison of Audio Chroma Feature Extraction Methods, 9.

Tanaka, A. & Ortiz, M. (2017). Gestural Musical Performance with Physiological Sensors, Focusing on the Electromyogram. In The Routledge Companion to Embodied Music Interaction (pp. 422–430). Oxfordshire, England: Routlege.

Zheng, F., Zhang, G., & Song, Z. (2001). Comparison of Different Implementations of MFCC. Journal of Computer Science and Technology, 16(6), 582–589.

Media (optional)

Figure 1: Preliminary Study Chronology

Supervisor’s recommendation letter (to be attached)

Letter of Recommendation

14 April 2022

NIME 2022 Doctoral Consortium

The University of Auckland

New Zealand

To Whom It May Concern:

It is my great pleasure to recommend STUDENT 1 for the NIME 2022 Doctoral Consortium. I was delighted to teach STUDENT over the course of several months from 2020 to 2021 in the New Interfaces for Musical Expression (NIME) module at UNIVERSITY, where students were encouraged to combine their software and hardware skills to produce novel musical interfaces. Following his graduation in Music Technology, STUDENT was awarded a funded PhD position at UNIVERSITY on the theme of Computational Analysis of Style in Traditional Fiddle Playing, the research project that I am currently supervising.

STUDENT has been an outstanding researcher and has made great progress in the first year of his PhD, completing and submitting his first paper at the Folk Music Analysis conference (FMA22) and confidently proceeding towards the creation of his own dataset of traditional fiddle playing styles.

STUDENT’s current research objectives, as stated in the doctoral consortium application, revolve around the analysis of traditional fiddle playing styles combining both audio and gesture-based approaches. STUDENT has recently completed his first exploratory study using an open-source violin dataset and the results look very promising. His future work will look at building his own traditional fiddle style dataset using a combination of audio recording and biodata. Considering the amount of NIME contributions that use musical gesture recognition as a core foundation, I truly believe STUDENT’s work could be of great interest to the NIME community.

STUDENT’s research strongly aligns with the work carried out in the DEPARTMENT at UNIVERSITY. The work of STUDENT’s Director of Studies focuses on the computational analysis of style in Irish traditional flute playing and has already been referenced in STUDENT’s first publication. From the gestural perspective instead, my research around the analysis of traditional piano technique for the creation of augmented instruments provides STUDENT with the needed support on the gesture recognition aspects of the project.

I truly believe that STUDENT’s participation to the NIME Doctoral Consortium will be extremely valuable in the development of his research project, and the knowledge exchange with peers pertaining to the NIME community will be a great developmental opportunity from both a career and personal point of view. Ultimately, I truly believe that STUDENT’s work will be an excellent addition to the doctoral consortium, especially considered his technical focus on musical gesture recognition.

Yours faithfully,

SUPERVISOR

TITLE

FACULTY

UNIVERSITY