Making Sound Pictures to Identify Bird Songs

Top: Audio signal with three chirps. Bottom: Time-Frequency spectrogram of signal.

A trained musician can look at a musical score and imagine the sound of an entire orchestra. The score is a visual representation of the sounds. In an analogous way, we can represent birdsong by an image, and analysis of the image can tell us the species of bird singing. This is what happens with Merlin Bird ID. In a recent episode of Mooney Goes Wild, Niall Hatch of Birdwatch Ireland interviewed Drew Weber of the Cornell Lab of Ornithology, a developer of Merlin Bird ID. This phone app enables a large number of birds to be identified [TM237 or search for “thatsmaths” at irishtimes.com].

An especially interesting feature is Sound ID: just stand in a park, by the sea or in the hills and press “record”. Sound ID will compare the birdsongs to a large databank of recorded sounds and quickly identify the species. Sound ID is a major advance in sound identification and machine learning. Currently, it can identify about 250 European bird species and numerous more exotic species.

How does Sound ID work?

A recording of bird song is essentially a graph of air pressure against time, replete with information but difficult to interpret. The idea of Sound ID is to use a computer vision model to identify bird vocalisations. For Sound ID, an algorithm called a Short-time Fourier transform (STFT) converts the audio signal into an image called a spectrogram.

A spectrogram is a diagram with time on the horizontal axis and frequency on the vertical. It is very like a musical score, which has time on one axis and pitch on the other. Notes sounding at the same time appear as a vertical stack, a chord.

Once the audio has been converted into a spectrogram it can be fed into a standard computer vision model, which is trained to identify bird vocalisations based on their visual signature in the spectrogram. Computer image analysis is very advanced, and can be used to break the image into manageable pieces. Each piece can then be compared with a database of birdsongs.

The spectrogram image is processed by a model called a deep convolutional neural network (CNN). This network is tuned by examining a large number of birdsongs. It is also able to recognise extraneous background noises like traffic and human speech, and to eliminate them.

Training

Merlin’s Sound ID tool is trained using audio data when each bird is vocalising. Ornithologists select the precise moments when birds are singing, and tag those sounds with the corresponding bird species. The neural network uses a large number of parameters, called weights, to fit the data. A method called a gradient descent algorithm figures out how to adjust the weights to ensure that the model predictions match those of the Sound ID experts.

Several choices are necessary when constructing a spectrogram: the length of the audio clip, the optimal STFT window length, the vertical scaling, and the spatial dimensions of the spectrogram. Following extensive testing, Sound ID was set to use a window length of 512 samples, with 128 samples for the STFT and an image size of 128 x 512 pixels. This achieves a good balance between speed and model accuracy.

There is a wealth of information and technical details on the Cornell website. Merlin Bird ID is available free of charge. Once installed on your phone, it runs offline, without needing a network connection, and lets you record and identify the birds around you.

App available at: Merlin Bird ID.

A UCD course on recreational mathematics, AweSums: The Wonder, Utility and Fun of Mathematics, will be presented this autumn by Prof Peter Lynch — registration is open at www.ucd.ie/lifelonglearning

$\star \qquad \star \qquad \star$