Skip to main content
grains

Constellations

A gif showing the monome grid blinking lights next to norns that is running an early version of constellations. Make noise strega is in the background.

Constellations is a script for the norns soundcomputer that turns an external audio signal into a playable note palette on a monome grid as an aid in jam sessions.

👉 The script will be renamed before the release, as the current name is already taken by a stellar constellations script by toomanatees.

I started working on it at the habitus workshop in 2023 and put the research on pause for some time. At the moment it can reliably work with a surgically clean monophonic source but it is not fun to use otherwise.

Below are my notes from the previous research sessions and I hope to pick it up again soon. Thanks to @duellingants and Zack Scholl for the ideas and support.

Current focus #

My goal is to find a real-time tracking algo that would support polyphony. I’d like to spend a few evenings to explore and collect all the links here, and then implement and compare a few of them.

The MVP would be to show a decent quality on a guitar recording, robust enough to survive noise and reverb. The successful result would be to play along to Roygbiv by Boards of Canada.

Since I already found a mention of a few fancy neural nets, there’s temptation to start with that but I will most likely hit a wall (it's either slow or will max out Raspberry Pi resources).

I set aside any expectations to improve the quality of my Lua code for now. As well as improve the grid UI. It already works good enough to plug in a working super collider code.

I also won’t do anything to connect the grid buttons to MIDI or to an internal sound engine – that’s a distraction, and I know it can be done (simply by copying Plonky).

I should consider reimplementing this with Pure Data to iterate faster. There could be readily available PD patches:

Random notes #

Research #

Session 1 #

Fed up with the research, decided to start doing something.

Playing the vanilla Crone implementation, with Digitakt as a synth and simple rhythmic pattern.

I can’t correlate the sound I’m hearing to the button on the grid The pitch detection is late and peaks are missing. From Pitch UGen description:

The pitch follower executes periodically at the rate specified by execFreq in cps. execFreq is clipped to be between minFreq and maxFreq.

It doesn't help that there’s no visible rhytmic pulsation on the grid. When the audio content has a sharp transient (i.e. a kick drum), a visual representation of the amplitude could help correlate the sound with the grid.

In general, I see more LEDS light up than there are notes in the audio input.


The vanilla Crone uses the standard Pitch UGen from SuperCollider. The snippet below is taken from the norns github:

SynthDef.new(\pitch, {
    arg in, out,
    initFreq = 440.0, minFreq = 30.0, maxFreq = 10000.0,
    execFreq = 50.0, maxBinsPerOctave = 16, median = 1,
    ampThreshold = 0.01, peakThreshold = 0.5, downSample = 2, clar=0;
    // Pitch ugen outputs an array of two values:
    // first value is pitch, second is a clarity value in [0,1]
    // if 'clar' argument is 0 (default) then clarity output is binary

    var pc = Pitch.kr(In.ar(in),
        initFreq , minFreq , maxFreq ,
        execFreq , maxBinsPerOctave , median ,
        ampThreshold , peakThreshold , downSample, clar
    );

    //pc.poll;
    Out.kr(out, pc);
})

From the Pitch UGen documentation:

This is a better pitch follower than ZeroCrossing, but more costly of CPU. For most purposes the default settings can be used and only in needs to be supplied. Pitch returns two values (via an Array of OutputProxys, see the OutputProxy help file), a freq which is the pitch estimate and hasFreq, which tells whether a pitch was found. Some vowels are still problematic, for instance a wide open mouth sound somewhere between a low pitched short 'a' sound as in 'sat', and long 'i' sound as in 'fire', contains enough overtone energy to confuse the algorithm.

Also lots of useful info in the discussion section:

The pitch follower executes periodically at the rate specified by execFreq [50.0 in Norns] in cps. execFreq is clipped to be between minFreq and maxFreq [between 30 and 1000 in Norns, so that means it probes at 50Hz which might affect how ]. First it detects whether the input peak to peak amplitude is above the ampThreshold [0.01]. If it is not then no pitch estimation is performed, hasFreq is set to zero and freq is held at its previous value. It performs an autocorrelation on the input and looks for the first peak after the peak around the lag of zero that is above peakThreshold [0.5 in Norns] times the amplitude of the peak at lag zero.

If the clar argument is greater than zero (it is zero by default) then hasFreq is given additional detail. Rather than simply being 1 when a pitch is detected, it is a "clarity" measure in the range between zero and one. (Technically, it's the height of the autocorrelation peak normalised by the height of the zero-lag peak.) It therefore gives a kind of measure of "purity" of the pitched signal.

Using a peakThreshold of one half [exactly that in Norns, i.e. 0.5] does a pretty good job of eliminating overtones, and finding the first peak above that threshold rather than the absolute maximum peak does a good job of eliminating estimates that are actually multiple periods of the wave.

The autocorrelation is done coarsely at first using a maximum of maxBinsPerOctave [that is 16 in Norns] lags until the peak is located. Then a fine resolution search is performed until the peak is found. (Note that maxBinsPerOctave does NOT affect the final pitch resolution; a fine resolution search is always performed. Setting maxBinsPerOctave larger will cause the coarse search to take longer, and setting it smaller will cause the fine search to take longer.)

The three values around the peak are used to find a fractional lag value for the pitch. If the pitch frequency is higher than maxFreq [1000], or if no peak is found above minFreq [30], then hasFreq is set to zero and freq is held at its previous value.

It is possible to put a median filter of length median [that’s 1 in Norns] on the output estimation so that outliers and jitter can be eliminated. This will however add latency to the pitch estimation for new pitches [is it where the latency is coming from?], because the median filter will have to become half filled with new values before the new one becomes the median value. If median is set to one then that is equivalent to no filter, which is the default.

When an in range [is there a missing word, or is it an “in-range”, i.e. between min and max?] peak is found, it is inserted into the median filter, a new pitch is read out of the median filter and output as freq, and hasFreq is set to one.

It is possible to down sample the input signal by an integer factor downSample [2 in Norns] in order to reduce CPU overhead. This will also reduce the pitch resolution.

Until Pitch finds a pitch for the first time, it will output initFreq.

None of these settings are time variable.

The median filter is interesting. It removes the jitter at the cost of keeping the median for enough frames to switch to the new frequency.

I thought it might have introduced the latency, but both in Norns and in SC the value is 1. I think 1 means that only one measure is taken to calculate median, so effectively the median filter is off, and it always outputs the current detected value.

It is unclear to me, which peak is selected – is it the highest peak? Judging from the source code, it goes for the very first detectable peak, and then does some heuristics to find the right one.

I couldn't make sense of the code quickly enough and asked an LLM to summarize the algorithm. Surprisingly, it helped me understand that the peak in question is the amplitude peak in the buffer. So there’s a buffer of the input audio, and it has to be loud enough to even start detection.

This algorithm does autocorrelation instead of going into the frequency domain, i.e. there’s no FFT, and there’s no histogram of frequency peaks. The peak in the algorithm refers to an amplitude peak in the source audio.


SuperCollider offers alternatives in the Pitch Analysis section: