Information Theory Counts Size Of Animal Languages

Meanings behind whale, dolphin, and bird vocabularies remain opaque.
Media credits
Media rights
Amanda Alvarez, Contributor

New research employed mathematical techniques to estimate how many “words” animals such as birds and whales use in their vocalizations. Using a discipline called information theory, it is possible to illuminate the structure and complexity of animal “language,” though it can’t yet tell us what the animals are conveying. 

Human languages have been deconstructed into bits since the idea of information theory came about in the 1940s. Using the vast printed texts of our species as a database, scientists can treat words and their combinations as a signal that can be analyzed. The frequency and repetition of symbols yield a measure of the information content in human languages. The symbols in English are the 26 letters of the alphabet, plus a space character. For animal modes of communication, however, figuring out the symbols can be a bit trickier, and researchers don’t have the benefit of huge animal language libraries that they can mine.

“I would love to be able to translate dolphin speak,” said Reginald Smith. Since there is no translation app for clicks and whistles, he uses information theory to gain insights.

“Some animals use combinations of symbols or sounds with meaning, so I try to avoid calling those ‘words,’” said Smith.

Instead, he uses the term “N-gram.” As an independent investigator affiliated with the Citizen Scientists League, Smith has previously used statistical methods to probe complex linguistic systems like Meroitic, an extinct and undeciphered East African language. With human languages, studying how frequently words occur, and how the symbols within words are combined to make longer words, can tell us about how much information is being transmitted, a quantity that can be measured in bits, the same unit of information storage for the ones and zeros on a computer. 

The same principle can be applied to animal communication, which is what Smith has done in a new study posted online on the scientific pre-print server arXiv. The way a letter in a word depends on those that come before it – a property known as the conditional entropy of the symbols in the sequence – can be used to estimate the number of words, or N-grams, in a language, via some complex calculations. Smith used data from previous studies that had recorded the whistles, cries, and songs of bottlenose dolphins, humpback whales, and four species of birds, including robins and European starlings. 

“Dolphins had 27 whistles that they used a lot, though there were 125 different whistles used overall,” said Smith. They used these whistles in a uniform, repetitive fashion, whereas birds tended to use all the songs in their repertoire more liberally.

Starting with animal recordings, Smith first determined how much information is conveyed by a single symbol, and how this changed as a second, third, or fourth symbol, or letter, are added to the sequence. In English, for example, adding a second letter after a first conveys 4.14 bits of information, while a third has 3.56 bits, and a fourth 3.30 bits  These are called the first-, second-, and third-order entropies, and describe how long symbol combinations can become while still carrying information and not becoming redundant. All the bird songs he studied appeared to be limited to the first order, indicating a lower level of complexity.

Smith then extrapolated from the estimate of a language’s complexity to its total vocabulary. For example, the dolphin's vocabulary has approximately 36 “words,” while the figure for whales is about 23; the starling song repertoire is estimated at 119 to 202 songs. The precision of the size estimate goes down as the amount of original data Smith had to work with decreases; at each increasing order of entropy, more samples of the language are needed to create a good estimate of the number of one-, two-, and three-letter sequences, or N-grams. For whales, for example, not enough data existed to go beyond second order entropy, so Smith can’t be sure how many longer sequences there might be. He also suspects animals in captivity, like the dolphins of SeaWorld whose whistle data he used, might have less complex communication, but this would require many more samples and comparative studies to verify.

Going beyond just measuring the structure and complexity of animal languages would be a logical next step, which Smith says he will leave to animal researchers. And, he says, extracting a song or cry from a pattern is about as illuminating as plucking a word out of a sentence: “The takeaway is that we need more research into how valuable the second or third order complexity is to animals’ communication.”

Amanda Alvarez has written about science for the Milwaukee Journal Sentinel, Yale Medicine, and GigaOM. She received her PhD in Vision Science from the University of California, Berkeley, and tweets at @sci3a.