
Birds’ chirps, trills, and warbles echo through the air, while whales’ boings, “biotwangs,” and whistles vibrate underwater. Despite the variations in sounds and the medium through which they travel, both birdsong and whale vocalizations can be classified by Perch 2.0, an AI audio model from Google DeepMind.
As a bioacoustics foundation model, Perch 2.0 was trained on millions of recordings of birds and other land-based animals, including amphibians, insects, and mammals. Yet researchers were surprised to learn how strongly the AI model performed when repurposed for whales.
Scientists at Google DeepMind and Google Research have been working on whale bioacoustics for almost a decade, with work including algorithms that can detect humpback whale calls, as well as a more recent multispecies whale model that can identify eight distinct species and multiple calls for two of those species. But with the release of Perch 2.0, the researchers came up with the idea of reusing the model to save on computation time and experimentation effort.
“If [Perch 2.0] performs well for our whale use cases, then that means we don’t need to build an entirely separate new whale model—we can just build on top of that,” says Lauren Harrell, a data scientist at Google Research.
That notion is backed by a technique known as transfer learning, where the knowledge gained from a type of task or data can be applied to a different yet related one. In this case, Perch 2.0’s ability to classify bird calls can carry over to classifying whale calls. Transfer learning from a foundation model means you can “recycle all of the training that’s been done and just do a small model at the end for your use cases,” Harrell says. “We’re always making new discoveries about call types. We’re always learning new things about underwater sounds. There’s so many mysterious ocean noises that you can’t just have one fixed model.”
The team evaluated Perch 2.0 on three marine audio datasets containing whale sounds and other aquatic noises. They began by converting each five-second window of audio into a spectrogram, a visual representation of sound intensity across frequencies over time. These images were fed to the model, which produced embeddings or feature sets that preserve the most salient attributes of the data to help determine the nuances between the whistles of a humpback whale and an orca, for example.
Next, the scientists randomly selected a small number of embeddings (from a minimum of four to a maximum of 32) per dataset to train a logistic regression classifier, a type of linear model that predicts a discrete outcome. Results of the training, which have been detailed in a paper presented at the NeurIPS conference workshop on AI for Non-Human Animal Communication last December, showed that the classifier performed well even with just a handful of embeddings, and performance improved as the number of embeddings increased.
The researchers also compared Perch 2.0 with embeddings from similar bird bioacoustics models, the previously mentioned multispecies whale model, and models trained on other animal vocalizations and noises in coral reefs. Findings pointed to Perch 2.0 as either the best or second-best performing model, with the bird bioacoustics models doing well too.
Evolutionary Parallels in Vocalization
So why do models trained on avian calls work well for cetacean sounds? Harrell and her colleagues suggest a threefold theory.
First, they consider evolutionary parallels in that birds and marine mammals could have evolved similar physical mechanisms of vocal production.
Second, they weigh the laws of scale, which suggest that huge models trained on vast, diverse volumes of data tend to do well even on more specific, out-of-domain tasks.
Finally, classifying avian utterances can be challenging and likely forces the model to recognize fine-grained acoustic characteristics that inform its predictions for related tasks. “We are training this model to find those little features in the soundscapes,” Harrell says. “If those features also are similar in some way to the underwater acoustics, then it can search for those subtle details in animal vocalizations.”
The whistles of killer whale populations, for instance, are in “the same kind of spectrogram range as many of the bird vocalizations,” explains Harrell. “But there are many birds and amphibians and mammals that are also making low-frequency calls, so the model is actually sensitive to a lot of the dynamics, and that apparently does well underwater.”
Much like how Perch 2.0 is assisting bird conservationists, the team at Google hope that the same bioacoustics model can aid scientists in protecting whale populations through passive acoustic monitoring and help them unveil the wisdom that these ancient oceanic creatures hold.
From Your Site Articles
Related Articles Around the Web


