Constellations

A gif showing the monome grid blinking lights next to norns that is running an early version of constellations. Make noise strega is in the background.

Constellations is a script for the norns soundcomputer that turns an external audio signal into a playable note palette on a monome grid as an aid in jam sessions.

👉 The script will be renamed before the release, as the current name is already taken by a stellar constellations script by toomanatees.

I started working on it at the habitus workshop in 2023 and put the research on pause for some time. At the moment it can reliably work with a surgically clean monophonic source but it is not fun to use otherwise.

Below are my notes from the previous research sessions and I hope to pick it up again soon. Thanks to @duellingants and Zack Scholl for the ideas and support.

Current focus #

My goal is to find a real-time tracking algo that would support polyphony. I’d like to spend a few evenings to explore and collect all the links here, and then implement and compare a few of them.

The MVP would be to show a decent quality on a guitar recording, robust enough to survive noise and reverb. The successful result would be to play along to Roygbiv by Boards of Canada.

Since I already found a mention of a few fancy neural nets, there’s temptation to start with that but I will most likely hit a wall (it's either slow or will max out Raspberry Pi resources).

I set aside any expectations to improve the quality of my Lua code for now. As well as improve the grid UI. It already works good enough to plug in a working super collider code.

I also won’t do anything to connect the grid buttons to MIDI or to an internal sound engine – that’s a distraction, and I know it can be done (simply by copying Plonky).

I should consider reimplementing this with Pure Data to iterate faster. There could be readily available PD patches:

Random notes #

Grid app. Takes input, detects pitches per voice. Lays them out on the keyboard as a palette.
How to work with time? When detected - the brightest. Slowly fade away. Follow amplitude.
Suggestions from Zack
- Use Super collider to separate FFT sub bands
- Dan Tapfer’s super collider experiments
- Use band filters for rough detection of notes within bands (3-8 bands)
- Use 127 filters to detect exactly each midi note
- Detect spikes on the spectrogram and track them
Ideas
- Notes that finished sounding should remain on the grid, flickering, barely visible
- Space cluster to the right:
  - Memorize “constellations” – save and recall palettes
  - Freeze current palette – ignore new notes
  - Detect voices and switch between them
- Port the plonky arpegiator
- Blink the notes in the order they were first heard
- A “stargaze” mode – constellation looper, random or in the order the notes were recorded
- Generative mode similar to Dan Tapfer
- Listen to MIDI instead of audio (easy and precise)
- Scale estimation on the norns screen
- A mode where brightness shows notes that have higher probability
  - This would elimminate the need for a better algorithm and won’t take too much compute
- Estimate BPM as in Shazam SDK
A “human echo” patch is possible:
1. Route the midi in and out to the same synth
2. Smash the notes that show up
3. Should sound like a delay, but less organically
Perform BPM detection similar to HAPTIK (uses Shazam SDK)

Research #

Collection of O(NlogN) pitch detection implementations
*A Smarter Way to Find Pitch* by Philip McLeod and Geoff Wyvill and https://github.com/sevagh/pitch-detection/tree/master/misc/mcleod
Different methods in Max MSP https://www.youtube.com/watch?v=cO2LOIjMphE
PureData patch https://www.youtube.com/watch?v=GwEdOo7iPuA&t=1s
- With some code https://github.com/jaylmiller/polyphonic_track
https://github.com/spotify/basic-pitch from Spotify (includes pitch bend)
- https://engineering.atspotify.com/2022/06/meet-basic-pitch
- Isn’t possible to use real-time because of the required long frame length
  - Summary from NeuralNote (built on top of basic-pitch):
    Unfortunately no and this for a few reasons:
    - Basic Pitch uses the Constant-Q transform (CQT) as input feature. The CQT requires really long audio chunks (> 1s) to get amplitudes for the lowest frequency bins. This makes the latency too high to have real-time transcription.
    - The basic pitch CNN has an additional latency of approximately 120ms.
    - The note events creation algorithm processes the posteriorgrams backward (from future to past) and is hence non-causal.
frame-level multipitch estimation (MPE) vs note estimation
- Counter-intuitively, notes cannot be simply inferred from MPE. MPE preserves vibrato and deviations from a base pitch and should not always be quantized to the nearest semi-tone
- When using MPE you might be interested in tracking the pitch changes once it was detected
- People used transformers, but they are computationally expensive
- Automatic Music Transcription: An Overview
- Signal Processing Methods for Music Transcription
- Wave2Midi2Wave – from audio to midi and back
https://arxiv.org/pdf/2203.09893.pdf
- Resource usage is high (951MB peak memory but for a large file)
  
  We find that both methods are comparable in estimated overhead, with NMP using 490 MB peak memory and taking 7 s and MI-AMT using 561 MB and taking 10 s; however on the long file, NMP substantially outperforms MI-AMT, using only 951 MB peak memory and taking 24 s, while MI-AMT used 3.3 GB and took 96 s. It’s interesting to note that the peak memory of the instrument-specific models is even higher, with OF using 5.4 GB and Vocano using 8.5 GB.
- Datasets
  - http://www.slakh.com/
Controlling a Synth using a Neural Network in SuperCollider
Post-processing ~fiddle: real-time multi-pitch tracking using harmonic partial subtraction
Second fiddle is also important - tracking pitches across voices
Time-to-Frequency transformation algorithms
- FFT
- Constant Q transform
- CWT (continuous wavelet transform)
- CCWT
- fcWT (Fast continuous wavelet transform)
  - Implementation with a benchmark against other algos
C++ extensions for Super collider
https://github.com/corbanbrook/spectrotune and https://github.com/stc/PolyTune Spectrotune is a Processing application which scans a polyphonic audio source (in wav, mp3, etc formats), performs pitch detection and outputs to MIDI. Spectrotune offers adjustable options to help improve pitch detection, including:
- Pitch Class Profiling (PCP)
- FFT Bin Distance Weighting
- FFT Windowing - rectangular, hamming, hann, triangular, cosine, and blackman windows.
- FFT Linear Equalization - attenuate low freqencies and amplify high freqencies.
- Harmonic Filter - filters peak harmonics.
- Noise Smoothing - rectangle, triangle, and adjacent average smoothers.
- Parabolic Peak Interpolation.
- Adjustable Peak Threshold.
- Octave toggles - narrow the spectrum to the octaves you are interested in recording.
- MIDI octave channel segmenting - route each octave to its own MIDI channel.
https://github.com/aubio/aubio
https://essentia.upf.edu/
- https://essentia.upf.edu/reference/streaming_ChordsDetection.html
- HPCP - Harmonic pitch class profiles
https://doc.sccode.org/Classes/KeyTrack.html
https://github.com/musicmichaelc/PolyPitch
- This is an SC plug-in for multiple fundamental frequency tracking, after Anssi Klapuri's 2008 paper 'Multipitch analysis of polyphonic music and speech signals using an auditory model'

Session 1 #

Fed up with the research, decided to start doing something.

Playing the vanilla Crone implementation, with Digitakt as a synth and simple rhythmic pattern.

I can’t correlate the sound I’m hearing to the button on the grid The pitch detection is late and peaks are missing. From Pitch UGen description:

The pitch follower executes periodically at the rate specified by execFreq in cps. execFreq is clipped to be between minFreq and maxFreq.

It doesn't help that there’s no visible rhytmic pulsation on the grid. When the audio content has a sharp transient (i.e. a kick drum), a visual representation of the amplitude could help correlate the sound with the grid.

In general, I see more LEDS light up than there are notes in the audio input.

The vanilla Crone uses the standard Pitch UGen from SuperCollider. The snippet below is taken from the norns github:

SynthDef.new(\pitch, {
    arg in, out,
    initFreq = 440.0, minFreq = 30.0, maxFreq = 10000.0,
    execFreq = 50.0, maxBinsPerOctave = 16, median = 1,
    ampThreshold = 0.01, peakThreshold = 0.5, downSample = 2, clar=0;
    // Pitch ugen outputs an array of two values:
    // first value is pitch, second is a clarity value in [0,1]
    // if 'clar' argument is 0 (default) then clarity output is binary

    var pc = Pitch.kr(In.ar(in),
        initFreq , minFreq , maxFreq ,
        execFreq , maxBinsPerOctave , median ,
        ampThreshold , peakThreshold , downSample, clar
    );

    //pc.poll;
    Out.kr(out, pc);
})

From the Pitch UGen documentation:

This is a better pitch follower than ZeroCrossing, but more costly of CPU. For most purposes the default settings can be used and only in needs to be supplied. Pitch returns two values (via an Array of OutputProxys, see the OutputProxy help file), a freq which is the pitch estimate and hasFreq, which tells whether a pitch was found. Some vowels are still problematic, for instance a wide open mouth sound somewhere between a low pitched short 'a' sound as in 'sat', and long 'i' sound as in 'fire', contains enough overtone energy to confuse the algorithm.

Also lots of useful info in the discussion section:

The pitch follower executes periodically at the rate specified by execFreq [50.0 in Norns] in cps. execFreq is clipped to be between minFreq and maxFreq [between 30 and 1000 in Norns, so that means it probes at 50Hz which might affect how ]. First it detects whether the input peak to peak amplitude is above the ampThreshold [0.01]. If it is not then no pitch estimation is performed, hasFreq is set to zero and freq is held at its previous value. It performs an autocorrelation on the input and looks for the first peak after the peak around the lag of zero that is above peakThreshold [0.5 in Norns] times the amplitude of the peak at lag zero.

If the clar argument is greater than zero (it is zero by default) then hasFreq is given additional detail. Rather than simply being 1 when a pitch is detected, it is a "clarity" measure in the range between zero and one. (Technically, it's the height of the autocorrelation peak normalised by the height of the zero-lag peak.) It therefore gives a kind of measure of "purity" of the pitched signal.

Using a peakThreshold of one half [exactly that in Norns, i.e. 0.5] does a pretty good job of eliminating overtones, and finding the first peak above that threshold rather than the absolute maximum peak does a good job of eliminating estimates that are actually multiple periods of the wave.

The autocorrelation is done coarsely at first using a maximum of maxBinsPerOctave [that is 16 in Norns] lags until the peak is located. Then a fine resolution search is performed until the peak is found. (Note that maxBinsPerOctave does NOT affect the final pitch resolution; a fine resolution search is always performed. Setting maxBinsPerOctave larger will cause the coarse search to take longer, and setting it smaller will cause the fine search to take longer.)

The three values around the peak are used to find a fractional lag value for the pitch. If the pitch frequency is higher than maxFreq [1000], or if no peak is found above minFreq [30], then hasFreq is set to zero and freq is held at its previous value.

It is possible to put a median filter of length median [that’s 1 in Norns] on the output estimation so that outliers and jitter can be eliminated. This will however add latency to the pitch estimation for new pitches [is it where the latency is coming from?], because the median filter will have to become half filled with new values before the new one becomes the median value. If median is set to one then that is equivalent to no filter, which is the default.

When an in range [is there a missing word, or is it an “in-range”, i.e. between min and max?] peak is found, it is inserted into the median filter, a new pitch is read out of the median filter and output as freq, and hasFreq is set to one.

It is possible to down sample the input signal by an integer factor downSample [2 in Norns] in order to reduce CPU overhead. This will also reduce the pitch resolution.

Until Pitch finds a pitch for the first time, it will output initFreq.

None of these settings are time variable.

The median filter is interesting. It removes the jitter at the cost of keeping the median for enough frames to switch to the new frequency.

I thought it might have introduced the latency, but both in Norns and in SC the value is 1. I think 1 means that only one measure is taken to calculate median, so effectively the median filter is off, and it always outputs the current detected value.

It is unclear to me, which peak is selected – is it the highest peak? Judging from the source code, it goes for the very first detectable peak, and then does some heuristics to find the right one.

I couldn't make sense of the code quickly enough and asked an LLM to summarize the algorithm. Surprisingly, it helped me understand that the peak in question is the amplitude peak in the buffer. So there’s a buffer of the input audio, and it has to be loud enough to even start detection.

This algorithm does autocorrelation instead of going into the frequency domain, i.e. there’s no FFT, and there’s no histogram of frequency peaks. The peak in the algorithm refers to an amplitude peak in the source audio.

SuperCollider offers alternatives in the Pitch Analysis section:

https://doc.sccode.org/Classes/Qitch.html

In technical terms, this UGen calculates an FFT, applying Brown and Puckette's efficient constant Q transform on a quartertone scale, base note F3= 174.6Hz. Cross correlation search leads to the best match for a harmonic spectrum grid with falling amplitude components. A further fine tuning takes place based on instantaneous frequency estimation (rate of change of phase) for the winning FFT bin.
https://doc.sccode.org/Classes/Tartini.html

This alternative pitch follower uses autocorrelation like Pitch, but with an adapted method, and calculated via FFT. There are some parameters for you to choose the window size and other aspects of the calculation, but a user who doesn't want to worry too much about this kind of stuff, please just use the defaults and don't worry about them. > In technical terms, this UGen calculates a modified autocorrelation function following the method used in the Tartini open source (GNU GPL) pitch following software (http://miracle.otago.ac.nz/postgrads/tartini/))
https://doc.sccode.org/Classes/ZeroCrossing.html

Constellations

Current focus #

Random notes #

Research #

Session 1 #

More random links #