Speech vs. Music APIs — Paul Murphy

Originally published on Tumblr.

People often ask me if OP3Nvoice’s API, a speech API, can process music. I used to think that was an absurd question, but one day I really thought about and realized that, in fact, it wasn’t.

It was only an absurd question to me, because I know too much. I know that the processing used to extract information from speech is completely different from the processing used to extract information from music. I also know people use speech and music APIs for different reasons.

But when I took a step back, and forgot about how different the underlying processing is, I realized that a lot of the information we extract from music and speech is the same. If that’s true, why shouldn’t the APIs to that data be similar, if not identical?

To be clear, when I talk about music APIs, I’m not talking about music repository or music recommendation APIs, I’m talking about APIs that extract content from music, and when I talk about speech APIs I’m talking about APIs that extract content from speech.

For this post I’ll be using the Echo Nest API for music and the OP3Nvoice API for speech.

At the highest level, what do these APIs do? They:

Ingest audio,
Ingest related data,
Extract data,
Process data, and they
Expose data.

Simple enough.

Ingesting Audio

Both APIs ingest audio by either accepting a public URL from which the media can be pulled or by accepting the media via a POST command.

Ingesting Related Data

Both APIs allow users to associate metadata and free-form text with media. The OP3Nvoice API currently allows free-form text to be passed in a metadata block (although a size restriction applies). This is explicit. The Echo Next API doesn’t, but implicitly pulls files from across the Net that it believes are related to the media (songs).

The next two actions, extracting and processing data, are not part of the API. They belong to the service behind the API, but their output is reflected in the data that can ultimately be returned to the API’s client.

Extracting Data

The OP3Nvoice API extracts words. The Echo Nest API extracts loudness, pitch, and timbre.

Processing Data

Data processing is divided into two parts: organizing and interpreting.

Organizing

OP3Nvoice organizes the words it extracts by indexing them using timestamps. The Echo Nest API organizes loudness, pitch, and timbre triplets by rolling them up into tatums, beats, bars, and sections.

Interpreting

Today OP3Nvoice doesn’t interpret any of the data it extracts and organizes. The Echo Nest interprets its raw data in very sophisticated ways.

It uses the organized data to derive key, mode, and tempo. It further interprets this data and produces energy and danceability. And finally it uses the texts retrieved from the Internet to assign mood and genre to artists and albums.

Exposing

And finally, all of the data extracted and interpreted is accessible via API calls.

I’d now like to take a step back and look at all of this data from a more theoretical perspective. Although there’s currently no overlap between the data the data OP3Nvoice and the Echo Nest extract and derive, there will be as they evolve and become more sophisticated.

Extracted Data

Today, OP3Nvoice extracts and gives its users access to words. Oddly, none of the music APIs do, because words (lyrics) have traditionally been extracted by hand by human transcriptionists. There is no question that music APIs should treat words same way that speech APIs do.

Emotions can be derived from the way people speak or sing, independently from the words they use. This research is emergent and it’s very clear that future speech and music APIs will give their users access to this kind of emotional content.

Media files contain events, such as a change in location, which may be perceived by a change in acoustics. Events may be derived from the media itself, or manually associated with a time in the audio file. SoundCoud’s API allows clients to inject comments into the audio timeline. These comments, for example, are events.

This table summarized the current state of extracted data:

	Speech	Music
Words	x	x
Emotions	x	x
Events	x	x
Identity	x	x
Loudness	x	x
Pitch	x	x
Timbre	x	x

Derived Data

Emotions, or sentiment, can be derived from words spoken or sung using relatively well understood NLP techniques. No API I’ve run across attempts to do this, but I expect that to change soon.

Topics and summaries are also derivable from transcripts. Today, VoiceBase, for example, extracts topics from speech files.

And of course, attributes like energy, mood, and genre, which are currently only extracted by the Echo Nest, could certainly be extracted by speech APIs.

This table summarized the current state of derived data:

	Speech	Music
Emotions / Sentiment	x	x
Topics / Summaries	x	x
Key / Mode / Tempo	n/a	x
Energy	x	x
Danceability	n/a	x
Mood	x	x
Genre	x	x

As we’ve just seen, today’s speech and music APIs are used for different reasons and therefore expose different aspects of the media they process. We’ve also seen that there’s a large amount of overlap in this data. There’s no reason that speech APIs shouldn’t expose a lot of the same data as their music counterparts, and vice-versa. And at some point it may well make sense to unify music and speech APIs. After all they’re both giving the world access to the same thing: the content of organized noise.