Transformer Neural Networks for Automated Rhythm Generation

We propose a novel approach to automated rhythm generation in which a TransformerXL model is employed to model and generate rhythm from the Magenta Groove MIDI Dataset. Recent applications of this high-dimensional language framework in the field of music have demonstrated it’s ability to effectively capture and emulate long-term dependency in musical sequences dependencies characteristic of human notions of musicality and creative merit making it an ideal candidate to experiment with for the task of rhythm-specific generation. We evaluate hundreds of generations from our optimum model using a variety of methods; probabilistic, musicological and in blind listening tests to determine the extent to which our framework has learnt and reproduced the aspects of rhythm we understand to be valuable. Our model is able to achieve a standard of rhythmic production comparable to human playing across arbitrarily long time periods and multiple playing styles.


Dedication
To the Mighty Quinn, may this serve as a baseline for finding your own groove in life.

Chapter 1 Modelling and Rhythm Generation
In this paper we are concerned with musical rhythm, definitions for which are numerous and varied but all point towards a core concept of repetition, particularly of patterns in sound.
It should be obvious that where patterns exist, we can to some extent capture numerically -in fitting some model to example data -intrinsic properties of those patterns, and hence rhythm. What might not be so obvious is why it is important to, or what can be achieved when we do so.
Translating rhythm from performance to numbers in a statistical model can be thought of as a process of dimensionality reduction, we capture as much rhythmic essence as possible and project it into a lower dimensional, machine friendly space.
From a musicological perspective, this common abstract projection provides a shared language to describe, document, teach and transmit rhythmic structure. With the reduction in complexity providing an easier method to store, compare and explore it analytically -analysis' such as performance evaluation or quantification of creativity [1].
Though perhaps most interesting of all in the abstraction of rhythmic input into this domain is the learning of a foundation from which we are able to create, generate or even compose new sounds whilst maintaining the spirit of the original rhythm. Not only is this exciting from an artistic viewpoint, it too provides us with a method by which we are able to evaluate our rhythmic representation -if we can use our learning to produce new sounds and new rhythms, we can hold our model to account just as we can any performance artist.
This process of abstraction, optimisation, reproduction and evaluation builds on a long history of increasingly more sophisticated techniques in algorithmic musical composition.

Related Work
The task of algorithmic composition in music is not new. In fact musicians have been attempting to create non-deterministic rule-based compositions for centuries.
From Mozart's utilization of dice to randomly dictate the order in which his musical segments were stitched together in his Musikalisches Würfelspiel to John Cage's Reunion, composed whilst performed with sounds triggered by ongoing games of chess [2,3]. Unfortunately the vast majority of works in the field are concerned with pitch (rather than strictly rhythm), as such it would be remiss not to leave these out of the discussion. 1. 1

.1 Early Computational Methods
In more recent times, computers have been the tool of choice for automated music generation, bringing with them huge (and ever developing) advances in complexity and capability. Stochastic systems such as that outlined by Iannis Xenakis' in Formalised Music [4] relied on digital random-number generators to achieve what Cage and Mozart did before him whilst Anderson et al. rely on Markov Chains to realize a more advanced level of stochasticity [5]. Conversely, projects like MUSICOMP [6] or William Shottstaedt's Automatic Counterpoint [7] relied on rule-based approaches in which, once initialised and set in motion, a piece behaves according to a set of rules prescribed by the designer [3]. An extension (in principle) to rule-based compositional techniques is the current state-of-the-art in automated music generationartificial intelligence, or more specifically (at the time of writing), neural networks.
The distinction in these modern systems is that they are able to learn their own rules to a level almost inconceivable to humans and hence have the ability to generate sound much more diverse and complex than any of their predecessors.

Neural Networks for Automated Composition
Neural networks (NN's) used for music generation in recent years come in a variety of flavours with difficulty in evaluating them objectively and a wide range of use cases making it hard to single out a single one as best.

Recurrent Neural Networks
One early variety of NN able to model time-series input are recurrent neural networks (RNN's) -feed-forward neural networks extended to ingest sequential data by including recurrent connections. At each time step, the input of the RNN is an element of the sequence and that of the one before it -that is to say that each element is recursively passed back to coincide with the one succeeding it. The expected output is the next element in the sequence. Hence we have a network that learns to predict the next step in a sequence using its current state, the one before it, and, therefore, in theory, all preceding states [2,8]. Recurrent layers can be conceptualised as occurring multiple times, the step of feeding the sequence back into itself can be pictured as feeding the sequence into another identical network with same weights whos learning is linked to previous the previous one. Figure 1 illustrates this, on the left is the recurrent layer, on the right is the layer unrolled. Each module in a recurrent neural network makes some transformation to its input data, in the case of RNNs, this transformation is quite simple, like for example a tanh as shown in Figure 2. One can see how this architecture is a natural choice for music where the sequential nature of the data is explicitly learned by the model. Efforts in algorithmic compositions using RNN's were made as early as the 80s [10], though a more recent notable attempt can be found in [11] where a network of restricted boltzmann machines is used to learn harmonic and rhythmic probabilistic rules from polyphonic music scores. Both examples note difficulty in effectively learning/generating convincing long-term structure, with generated musical lines being impressive, but short. This inability of RNN's to learn effectively in the long-term is due to how the loss function is defined -since the gradient decays exponentially in time, long-term dependencies are hidden [12]. This should not be hard to appreciate by looking at Figures 1 & 2 -there is no explicit mechanism to identify where or how far back in the sequence relationships exist, the setup is such that the further back in time we go, the less relevant the relationships are assumed to be. This obviously poses a problem for modelling music where relationships exist periodically over the entirety of the data and is addressed by long-term short-memory networks (LSTMs).

Long-Term Short-Memory Networks LSTM
LSTMs are a variation on the traditional RNN with additional special units introduced to maintain information in memory for longer time periods, with gates  [9] determining which information is stored and for how long (in contrast to vanilla RNN's which replace the activation at each time step). Figure 3 shows how these gates are configured in the repeating module of an LSTM. The path across the top of the module is the cell state, this is largely unaffected by transformations on its own path but can have information added or removed by the gates feeding into it (the sigmoid neural network layer components). Gates output a number between 0 and 1 representing the extent of information let through and in doing so inform the cell state on what is important to remember at each time step in relation to other steps in the cell state (how many steps are considered in one hidden state is a parameter of the model and directly dictates the LSTMs ability to learn in the long-term). This gate improvement not only enhances the network's ability to preserve long-term temporal information, but also proves to be more computationally efficient [13].
One of the first notable applications of LSTM networks to the task of automated composition was in [14] where Eck et al produce a remarkable polyphonic blues improvisation using a LSTM in which chord and melody connections are decoupled, with chords influencing melody but not vice-versa. A novel representation of musical sequence is used in [15] where chords are represented as strings of text in the input sequence rather than musical entities (such as [C, :, m, a, j, G, :, m, a, j] rather than [C:maj, Gmaj]), observing that fewer existent characters in the input space reduced complexity and increased runtime in train -they did however note that this method failed when applied to percussion. Transposition invariance is achieved in [16] by tying together multiple parallel LSTM's (by which they mean coupling the weights of each network), one for each note in the input sequence and also in [17] by representing input data as a two-dimensional matrix of playable notes and time with a supplementary matrix indicating whether a note is repeated, held or neither. The Magenta project boasts expressive timing and dynamics with LSTM's in [18] by fixing the time "step" to 10ms and allowing the model to skip forward in time if necessary (rather than tying the time step to the meter). There is limited contribution to the strictly rhythmic domain with LSTM's, though [19,20] both achieve low-dimensional generation (3 distinct sounds) on limited data. Two operational consequences of the increased complexity associated with LSTM's gated memory functionality are that they exhibit longer, more memory intensive training times and that they are more prone to overfitting (as a consequence of increase in parameters). From a model performance perspective, they are still not considered perfect in how they model temporal dependency in that there is still an emphasis on proximity in the input sequence, one imagines that in many cases this is important musically but this does come as a trade off with emphasis on distant relationships.

Attention-based Mechanisms
These drawbacks of LSTM networks are problems addressed by a more recent architecture that too is concerned with long-term memory, that is, self-attention, first proposed in [21]. At a high-level, the self attention -or more commonly, attentionmechanism achieves what gated memory does in LSTM's by learning a self-similarity matrix between the input sequence and itself. This matrix is easily conceptualised by thinking of a square similarity matrix where each element of the sequence is a row or column on the grid, the diagonal taking the maximum value -most similar/relevant. An example with a text sequence can be found in Figure 4 which displays the  [21] relationship between text tokens in a translation task that have been learnt via an attention mechanism.
Learning the temporal dependencies in input sequence using this self-similarity approach removes the relationship between importance and proximity -any element in the sequence can be identified as relevant to any other within the constraints set by the dimensions of the grid, this is a hyperparameter of models that use these mechanisms and can be thought of as analagous to the size of the hidden cell state in LSTMs (we will talk more about how this constraint is removed by our model of choice later on). Early descriptions of the mechanism can be found in [21,22,23].
Attention is at the heart of the transformer neural network, introduced in 2017 [24].
Transformers typically consist of an encoding module of feed-forward neural networks (FNN's) which take as input sequential data and produces some embedding for each element. Self-attention is applied, aggregating information from all other elements, generating a new per-element representation learned from the entire context -this is repeated to create successive generations of more nuanced embeddings.
A decoder then generates an output sequence element by element while consulting the representation generated by the encoder [24,25]. Figure 5 illustrates the basic units of a transformer (of which there are many stacked). The attention layer here  [21] replace the recursive mechanism present in the previous two networks.
Due to the use of FNN's and the lack of recursion (as in RNN's and LSTM's), computation can be done in parallel, making transformers much faster to train than their predecessors. In machine translation and NLP tasks, transformers have achieved state-of-the-art in recent years (for example in [26]) though even more recently have been applied to the task of music generation. Google was first with their Music Transformer in early 2019, showcasing state-of-the-art perplexity on the scale of minutes (as opposed to seconds as before) and demonstrate the ability to expand on a musical prime as input, noting that they foresee the how this might be useful as a creative tool [27]. The MuseNet project from OpenAI have also succeeded in generating music at the minute-scale using transformers and a novel input representation in which each token is encoded to combine pitch, volume and instrument [28].
Transformers have also been shown to do well in multi-instrumental applications for example in [28]. Also notable in [28] is the use of the same pre-training method introduced in [26] where learning is performed on a large dataset to understand general musical structure and then fine-tuned on a smaller dataset to capture more style specific relationships. They also utilise the transformer XL implementation [29] which learning of the input sequence to take place across arbitrary length.

Objectives and Expectations
Although still a hotly developing field, there exists very little work dedicated solely to the modelling and generation of rhythm or percussion, the recently published Groove Midi dataset [30] provides a decent source of symbolic rhythmic data to work with, as is done in the accompanying paper [31] where LSTM's are used to infill a low-dimensional input sequence with more complex rhythms. Although impressive, one of the drawbacks stated in the paper is the inability and computational expensiveness of learning in the long-term. We believe that a Transformer architecture (particularly Transformer-XL) can offer a solution to this, producing similarly interesting results with a better understanding of long-term dependency. For this we investigate the Groove MIDI dataset, train an optimum Transformer model and assess it's ability to learn and emulate the musicality of human rhythm across a variety of styles.
Our output will be a trained machine learning model with an interface to (1) generate new rhythms at user specified tempo and (2) continue a user defined input rhythm (both seen and unseen at training time). We will subject our model to a series of tests and analysis' to determine the extent to which it has effectively modelled consonant, interesting and musically valuable rhythm as we understand it. These tests include (1) a probabilistic evaluation of the models ability to predict rhythm in a sequence (2) Turing Listening Tests to determine the models ability to imitate human drumming and (3) a musicological analysis on a sample of outputs to place the models contributions amongst those that exist from non-machine methods.
The task we present here has not been achieved to date and we hope that our efforts contribute to the ongoing body of work committed to the development of computational creativity and the statistical modelling of music and sound, grateful so we are for the countless contributions on which this work is built.

Chapter 2 Architecture
The project pipeline is constructed using the code found in the Transformer-XL repository 1 as a template [29]. All configuration and model hyperparameters were abstracted to configuration files and experiments in data representation/model training were carried out on two NVIDIA K80 GPUs to iterate towards our best model (achieved after approximately 4 hours of training). The code and instructions on how to use it can be found at: https://github.com/slimranking/bumblebeat/tree/master/bumblebeat

Groove MIDI Dataset
We rely on the Magenta Groove MIDI Dataset (GMD) for training -13.6 hours (22,000 measures) of MIDI of human-performed, tempo-aligned expressive drumming played mostly by professional drummers. The data is already split into train, test and validation sets which we will use here so as to make comparison with other works more tractable (see Table 1). [30] All samples are matched with associated metadata including anonymized drummer identifiers, musical style annotations, and tempo. Almost all samples are played in

Sequence Tokenisation
The original MIDI representations can be thought of as a sequence of triples, each element provides a value for pitch, velocity and start time, see equation 2. 1. Since we are dealing with rhythm only, duration is irrelevant and end time is discarded.
Important to note that all sequences were quantized to 1/16 th s before training. Our desired format is a stream of tokens i.e. a one dimensional sequence representing all samples, their velocities, pitches and temporality. To achieve this we apply three transformations to the MIDI representation above.

Pitch Mapping
The Roland TD-11 drumkit that the dataset was collected on records 22 distinct pitches. Many of these pitches are very sparse in the dataset and can be naturally grouped, lowering the dimensionality of the input data and reducing the complexity for the model.
The grouping of pitches we adopt is almost identical to that used in Gillick et al.
in [31] (see Table 3). Applying this to the entire dataset reduces it to consist of 9 unique pitches in total; kick drum, snare drum, closed hi-hat, open hi-hat, low tom, mid tom, high tom, crash cymbal, ride cymbal. After applying the mapping, our sequence can be described by equation 2. 2.

Velocity Representation
Our velocity values, v n , lie in the range [0, 127], these are bucketed to fall within B equally spaced bins. approach was to reduce B from B = 10 until we found a bucketing with which most buckets were occupied/being generated into a large proportion of the time. This way we avoid needless added complexity for the model to learn or take into account yet maintain sufficient variation in intensity to be interesting. We found 4 a nice balance, this is also in line with the number of choices one might be provided on a more basic drum instrument/software (silence, low, medium, high).
Finally, every (pitch m n , velocity bucket b n ) combination is assigned a unique token corresponding to that pair. With B = 4 and 9 pitch classes, we have 32 (9 × 4) unique tokens corresponding to every possible combination of (m n , b n ). This is quite an important decision in the data representation, and an unintuitive one. By representing the pitch-velocity like this we introduce complexity for the model to learn i.e. as far as the model is concerned, (m n = 35, b n = 5) and (m n = 35, b n = 9) are unrelated since they have different unique tokens. However we know that they are in fact the same instrument. We remove this information from the model and ask it to learn this. We experimented with representing the velocity and pitch as separate tokens but found the results (subjective listening and quantitative evaluation of our model) to be better with the combined representation. We will revisit exactly why we think that is later.
.6 concludes the velocity representation of our sequences.
pv n = Unique (pitch class, velocity bucket) token for n th note

Time Representation
The time ordering of our current sequence can be deduced from the t n values (the second dimension of the sequence elements). We want to reduce the number of dimensions at each element from two to one. To do this we insert special time tokens into the sequence to separate the pitch-velocity (pv n ) tokens by tokens representing the time between them. After this step, the sequence ordering is integral to the interpretability of the sequence.
The transformation of the sequence in equation 2.6 is as follows  Representing silence using ticks removes the implicit embedding of tempo from the sequence and is inspired by the successful application in a musical context using the Transformer-XL framework by Donahue et al in [32].
The number of ticks between two pv n events is designed to be as efficient, we want to reduce unnecessary complexity for the model to learn. As such there are 5 unique tick time tokens, found in Table 4.
Silences are filled with as few tick tokens as possible for the duration, for example  All of our sequences are converted to this one-dimensional format and joined together into one long stream. Each sequence is divided in the stream by a special dividing token. This joining is relatively infrequent in the stream as a whole and does not skew the models learning of tokens we care about. This approach is used to separate documents in the paper presented with our transformer-xl model [29] and to separate musical sequences in [32]

Modelling
For our modelling we use a Transformer-XL architecture [29]. From the original paper... Given previous success on similar tasks using the Transformer-XL architecture [32], we decided to use this model.

Transfomer-XL Model
For a corpus of tokens x = (x 1 , . .., x T ) the Transformer-XL model learns the joint probability P (x), auto-regressively expressed as As with the the original Transformer model [24], the conditional probability is learnt by training an encoder on the context, x <t to a fixed hidden state which is then multiplied by the existing token embeddings, returning logits. A softmax is applied to the logits to give a categorical probability distribution for the next token [29].
The XL model is specifically interested in encoding arbitrarily long contexts (input sequences of arbitrary lengths). Traditionally this is achieved by breaking the input sequence into training segments and training the model individually on each. This results in the largest possible dependency length being dictated by the segment size and inevitably (more often than not) contexts being split up (in the event of a segment boundary falling in the middle of one of our concatenated input sequences).
To address these two limitations, the XL model implements a segment level recurrence mechanism, where the hidden state learnt for each segment is cached and made available to the next segment. Figure 6 illustrates this.
Applying this mechanism to every two segments creates a recurrence that effectively spans the length of all segments. This is noted as contributing to a huge increase in dependency length over the original Transformer or previous RNN models (450% and 80% respectively) [29].

Sampling and Generation
We want to use our learnt model to generate new sequences. Equation 2.8 provides the foundation for both of our generation tasks. The model has no explicit musical knowledge, as such the desired output is specified in number of tokens. Given that these can be a mix of pitch and time tokens, it is impossible to specify to the model the length you require in musical terms (such as number of beats for example).

Task 1: Generation
The generation task is to create new sequences completely from scratch. To do this the model is primed with the special token used to delimit sub-sequences in our long one-dimensional training sequence (from Section 2.2).
As mentioned in the previous section, the current token (in this case the special delimiter) is encoded and multiplied by the existing token embeddings to produce a logit distribution over the next token. We sample from this distribution to select our next token, feed this back into the model to update the memory/add to context and repeat until a given generation length.
Adopted from [32], there are two user specified parameters that alter this process.
• Top K -Before sampling from the logit distribution we take the top K most probable tokens, isolate them and normalise there probability distribution to some to one, the final sampling is done from this distribution • Temperature -This parameter alters the extent to which we truly sample from the distribution or just take the most probable.
Playing with these two parameters gives varying results. A higher (or no) top k affords more improvisation, or encourages less likely generations. Temperature dictates to what extent the generated rhythms are consistent over time.
The output of the generation is a sequence identical (in format) to that introduced in 2. 7. We de-tokenise this sequence by creating a MIDI with the pitch and a velocity randomly sampled from within the bucket corresponding to the pv value.

Task 2: Continuation
Generation by continuation functions exactly the same as generation introduced in the previous sub-section except that before generating the model is primed with an existing input sequence. That is to say that an existing input sequence is passed to the model, updating the internal memory and context before any sampling is done.
Temperature and Top K are both parameters of Continuation. Another parameter specific to continuation is the prime length. This is how many tokens from the priming sequence to pass to the model before asking it to generate. A higher value for prime length results in a much more stable output truer to the original form, however this comes and the cost of improvisation or exciting/interesting results.

Evaluation in Development
The framework used in this experiment is highly parameterized. The selection of these parameters is based on a mixture of subjective evaluation of the output (for data representation parameters) and perplexity of the predictions at training time (for model hyperparameters).

Perplexity
Perplexity is a typical metric for evaluating auto-regressive language models offline, defined in equation 2.9 [33].
Where log pθ (x i | x <i ) is the log-likelihood of the i th token primed on the antecedent context, x <i . This can be conceptualized as the average log of probabilities of a given token, given the previous context, across all tokens in a sequence. In our case, this sequence is the one-dimensional stream generated from the test/valid stratification of the data. A high probability is an indicator of a performant model and hence so is a low perplexity (owing to the negative sign and logs).
Model training is stopped when valid perplexity ceases to decrease. The model from this point is taken forward to be used for experimentation.

Chapter 3 Evaluation, Experimentation and Analysis
We subject our best model/data representation to a number of tests/analysis' to understand to what extent we have learnt a musically valuable representation of rhythm.

Offline Evaluation -Model Perplexity
Perplexity (Section 2.5.1) was used as our guiding metric for offline evaluation during training and model tuning. This serves as an informative and easy to calculate measure of how well our model predicts the data in our training set and as such was used to justify decision making in development (e.g in the representation of the data and hyperparameters of the model).
The perplexity of our final model on our test dataset is 1.552. 1  Perplexity is useful because it is easy to obtain and tells us something directly about how our model understands the problem. It is however widely accepted not to necessarily be strongly correlated with human perception of musicality. In fact almost all key publications in the field of creative machine generation use evaluation methods encoded with a human preconception of what we consider valuable. In the specific case of musical creativity this often involves listening tests to measure, for example, musicality [27], pleasantness [36] or -as in our benchmark paper, Learning to Groove -the ability to pass as human to the listener [31]. It is the latter that we are concerned with here.

Test Samples
Listening experiments were carried out using 500 samples generated by the continuation and generation methods. These samples were not cherry-picked and every generation was made available for the test.
The model and generation methods are parameterised so as to generate a token sequence of pre-specified length (more detail in Section 2.4) . As such, sequences of 3000 tokens were generated and the first 8 bars were extracted manually. This manipulation alongside the alignment of the first beat to coincide with time=0 is  Given the imbalance in genre in the dataset and finite sampling for our test, some of the less common genres were not present.

Experiment Setup
The experiment was facilitated by the Amazon Mechanical Turk platform on which workers were asked to listen to two 8-bar samples -one from our generated dataset of 500 and one from our original Groove MIDI dataset. Workers were aware that one of the two samples was generated by a machine, and one by a human. They were ask to elect which one they believed was generated by a human, they also had the option of answering with "Not sure".
Inspired by [32], to ensure that we only count responses where the worker genuinely listened to both samples we included 4 instances in which randomly generated noise samples replaced our machine-generated ones. Responses from workers who failed to identify the correct sample in any one of these 4 instances were removed from the test.
In total, 640 individual listening tests were carried out. After filtering out the responses of workers who failed the random noise test, we were left with 548 responses for analysis. Figure 7 illustrates our results. Standard error is calculated using a binomial proportion confidence interval of 95%.

Accuracy in Identifying Human Generated Rhythm
7a and 7c show the accuracy of experiment participants ability to identify which of the pairs of samples they were presented with was human-generated -an accuracy of 60% indicates that 60% of the time, our model was not able to convince a human listener that it itself is human and hence a lower value in these charts supports a more performant model.
These two charts are split across the little metadata we had about the samples, genre and generation type. It is important to note that there is no ground truth genre annotation for the samples generated by the generation method (i.e. completely sampled from the model) and as such our sample size for experiments tagged with this information is roughly halved -hence the larger error.
Finally and most importantly, we have included in Figure 7a the results of an almost identical listening experiment presented in Learning to Groove from Gillick et al [31]. In which their generations were put to listeners in a blind test in an effort to determine their models ability to pass as human. Though none of the three methods presented by Gillick match exactly the work achieved in this paper, we believe that the tasks are sufficiently similar enough to merit comparison -both papers are concerned with learning a model of expressive performance on the Groove MIDI dataset, both with the intention of using this model to predict and generate new and bespoke rhythms to equal or better human performance in musical creativity tasks.

Sureness in Annotation
In total, 77 out of 548 (14.1%) tests resulted in the listener not being able to identify which of the two samples was human (answering with "Not Sure"). Figure 7b shows this proportion over all tests and for each of our generation methods separately.

Offline Analysis -Velocity Distribution
It is interesting to observe how the distribution of velocity across measures compares between our original dataset and our generated samples. This is naturally best achieved aurally (and for that reason we encourage the reader to spend time listening to the samples produced and provided alongside this document) however we also see value in visualising and qualifying them here.

Velocity Analysis Method
We work from the same pool of samples introduced in The visual continuity of the graphs has been chosen for aesthetic reasons; to more easily visualise how variance (thickness of coloured area) compares between neighbouring steps and to match intuition around continuity of music and how intensity of sound naturally decays over time. This is as oppose to having been fueled by actual numbers in the raw dataset -the values which are being plotted in the velocity distributions exist discretely at 1/16th timesteps and the joining of these discrete values with a continuous curve is achieved by fitting a spline of second-order polynomials.
The nature of the music as heard is one of a steady dance beat with a pulsing kick on the 0 and half step with rides of uniform intensity on the 1/8ths (present throughout the entire sample hence consistent peaks at 1/8ths). Each bar is carried by two quadruplets of snares, each of reducing intensity; the first is intertwined at odd time steps (3,5,7,9)/16ths creating the characteristic syncopation associated with afrobeat; the second begins hard at 12/16ths (notice the higher peak) and continues for 4 consecutive 1/16ths until 15/16ths. The syncopation of the first quadruplet and subsequent snare hits populate otherwise vacant areas in our plot/sequence (this This description is an attempt to aggregate and describe the essence of the music over its measures but naturally each bar in the sample varies, this variance is captured in the thickness of the coloured area at a given time step. It is not expected that the reader can deduce this textual interpretation of the rhythm from the graph alone. What is intended here is that the reader can visualise quickly some key aspects of the music. For example in our afrobeat sample, certain key characteristics are present; the syncopation is evident in the wavey, up-and-down nature of the curve; the intensity peaks are on the 0 and 12 (typical in many west-african and latin musical traditions); and the large(ish) variance in intensity over the whole bar (indicative of variation across the piece). These aspects are different across different styles and traditions. The idea of these visualisations is to provide a method of identifying whether our model has captured or preserved these characteristics.

Continuations
Regarding our samples created by continuation, our goal is to (1) understand to what extent the model has learnt the input rhythm and maintained the structure effectively and (2) understand to what extent the model has added its own flavour/interpretation to the input rhythms, creating new rhythms of its own.
We will exhibit three pairs of samples broadly indicative of the continuation dataset as a whole. where the model lost some aspect of rhythmic musicality that would give it away as being machine made (for example losing time, missing a beat, unusual velocity progressions) -the same cannot be said for the samples produced by the generation method which we will talk about in the next section.
The reason for this is evidently the models ability to mimic the input pattern in the long-term. The continuations, though musically impressive, do not differ much (if at all) from the samples which they succeed. The more complex rhythms like afrocuban or latin feature less "improvisation" in the continuation than more simple ones like dance, rock or punk. This becomes obvious when we look at the velocity distributions.  one not present in the rhythm so far) is forced to essentially zero in the distribution from which the next token is sampled -this intuitively makes sense since the input structure is a lot less common and the model has learnt less paths out of it. This trade off between originality and maintaining the original structure is controlled (to the extent that it can be) by the temperature parameter (2.4), an increase of which makes improvisation more likely (0.92 was used for this experiment).

Generations
With no genre annotations our goal here is to see The Good Samples however do make up a majority proportion of all those that were generated, together with the Ugly Samples, they make up at least 75% of all generations.
The Good are defined as such because by our own judgement they are musically decent, consistent (they keep and remain in time), occasionally exciting, maintain long-term structure (over 8 or 16-bar loops) and could reasonably pass as human generated. However there isn't much variation in style across the samples, largely they tend to be variations around rock, soul or dance beats with more complex rhythmic patterns such as those found in latin or afrocuban not appearing to any measurable degree. This last point is unsurprising given the distribution across genres in Groove Dataset we trained with (see Table 2).
The Bad are exactly that, they are poorly timed, the velocity is monotonous, accents are incorrectly placed and musically they warrant little merit. Often these samples were found to be the result of the model getting stuck in a bad loop, this then feeds back into the model by updating the internal state for the next generation creating a poor structure in the long-term.
The Ugly are interesting and make up a non-negligible part of our generations.
These are samples deemed to exhibit some degree of musicality but a trained ear could identify that they were not played by musicians. For example they keep bad time, or the periodicity of some of the sub-rhythms do not match up with what is customary/expected/consonant. It is possible that these samples could fool a listener with no interest/experience in music into believing it was made by a human, or feasibly that it was played by an inexperienced drummer -an important point to bear in mind given that the listeners in our listening tests did not necessarily have experience in music. However in our bad and ugly plots ( fig. 16 & 17) -where the machine did not play in time and velocities were more monotonous -we see a much more even distribution of velocities across the bar with large variance. It is quite obvious from the plots which of three is more musical and which is closer to random noise. It would be interesting in future work to incorporate this velocity information into the model generation, so as to prevent or dissuade the model from pursuing undesirable forms at generation time. What follows are some interesting reflections worth touching on in conclusion.

Data Representation
The representation of the Groove MIDI Data set, specifically the unique choices taken for the tokenization, were a large contribution of this paper.

Quality of the Results
The generations are varied in quality and limited in genre. It has also not been proven that the model adds any significant layer of improvisation to the existing samples in the raw dataset. We argue that this is not a negative point and that reproducing input is impressive since it demonstrates an ability to learn in the longterm, something identified as difficult or expensive in previous algorithms (generally samples were consistent over periods of minutes).
The genre distribution of our output samples reflects the distribution in our raw dataset, this is expected albeit slightly disappointing (as some of the more rhyth-Chapter 4. Discussion mically interesting genres were less common). A fine-tuning technique such as that proposed in [32] could aid in controlling these distributions. For example if we were to train on a larger, more general dataset in future and then fine-tune our model on specific genres to produce models that were experts in specific genres. In any case our generations in the genres were subjectively and analytically comparable to the samples of the same genre in the original dataset.
Ultimately, upon listening to the samples individually and reviewing the results of the listening experiment, it is not presumptive to assume that, on balance, the ability of our model to generate is decent. On the time scale observed, our generations out perform any state of the art we have seen to date for this type of task (admittedly there hasn't been many).

Model Parameters
The selection of model and generation parameters have a huge impact on quality and character of results. Some observations include...
A lower memory length in the generations from scratch helped avoid the model getting stuck in bad loops (ie musically undesirable loops). This is presumably because the model doesn't feedback into itself as much as with a longer memory length and hence doesn't internalize it's bad learnings.
Top K is to be tuned relative to the number of tokens and dictates to some extent how much improvisation the model is allowed to do. Temperature also balanced this trade-off and was useful in defining the models ability to find its way out of undesirable loops. As noted in [32], lowering the temperature prevented the generations from getting stuck in loops (both desirable and undesirable), though lowering it to a certain degree sacrificed musical quality, this is the trade off here.
A high enough prime length in continuation ensured a reliable reproduction of the input, but this comes at the cost of less experimentation. This balance was found subjectively on a handful of samples and applied to the whole dataset. There could perhaps be more effective ways of doing this on a per sample basis based on the evaluation methods in the previous section.

Listening Tests
Finally and most importantly we reflect on the results of the listening tests. Given the statistical uncertainty presented on the results in Figure (7) it is impossible to conclude that the model performed better for a specific genre or task. However we can conclude that our model was consistently able to convince listeners that it was human and that this feat has not necessarily been completed on all generation tasks on this dataset to date. Listening to the generated samples corroborates these results in both the short and long term. An achievement that we present for the first time in this domain.

Conclusions and Future Work
We have presented a successful attempt at the statistical modelling of musical rhythm for the purposes of automated rhythm generation in the long term (minute scales) using transformer neural networks. We present for the first time in this domain generations of musical quality comparable to human drummers both in musical character and how they are perceived. And in doing so we hope to have offered an exciting basis for the future development of percussion specific automated generation techniques.
There is always more work to do in such a quickly developing field and we hope to have outlined some of the points of improvement in this work throughout the document but would like to draw attention now to some more important lines of investigation for the near future.
• Pre-training on a larger dataset and fine-tuning on groove genres for genre specific generations/better model quality. Given how transformers usually find their greatest success on very large datasets, this is almost definitely going to improve the quality of the model. The Lakh MIDI dataset would be a good candidate for this. [37] • A deeper investigation into micro-timings and how the model captures/generates these.
• Incorporating our evaluation methods at generation-time methods to ensure high quality in output.
• Implementing the ability to specify how many bars/beats of a generation is required before hand.
For now though we thank the reader for their attention and welcome any future insight, improvement or feedback on the methods presented in this document.