The Coming Data Tsunami

Originally published on the Clarify.io blog. View archived copy.

Have you noticed what people are doing with data these days?

digital-wave Companies like Textalytics and AlchemyAPI are figuring out how people are feeling based on what they post on Twitter and Facebook. They’re extracting meaning and intent from human communication. PeoplePattern is using public data to figure out not only people’s preferences, but also their personalities and income, so that they can figure out what people might buy. And companies like DigitalReasoning are trawling through massive amounts of human communication to find relationships between people, relationships between people and places, and relationships between those people’s interests and intents.

A few years ago only humans could do this sort of thing, and some of it was physically impossible for humans to do at all. Human communication is full of data, and we finally have the technology to use it.

But humans have only been writing for a little over 5,000 years. Before then, all communication was oral. And today, we estimate that over 200 times more data is communicated orally than in writing. We learn to speak long before we learn to write. Speech is our most natural form of communication.

So what are all these amazing companies doing with all this spoken communication? Nothing.

Why? Because extracting raw data from opaque audio and video files is difficult. It requires signal processing skills that exist only in academia, research labs, and a very small number of military and commercial divisions. It’s too difficult and too time-consuming to master, even for very smart companies.

At Clarify we are changing this landscape altogether. We are bringing audio and video into the conversation. Because we want all of that communication to be processed too, and not just by the likes of PeoplePattern and DigitalReasoning, but by ordinary web and mobile developers. We think it’s about time people were able to use the data trapped in those audio and video files. We think it’s about time for all those feelings, all that meaning, all those intentions, and all those relationships to be discoverable.

The world already has piles of recorded media. In the past few years, people have created more audio and more video files than were made since the invention of recording. And the rate of recording is increasing dramatically. Todays piles are going to look very small from tomorrow’s perspective.

Fifty years ago, only television and movie studios could record video, and the necessary equipment required many people to operate. Ten years ago, a personal movie camera cost thousands of dollars and so unwieldy that few people used them. Today, everyone can record video using cameras and phones that fit in their pocket. Many of these devices automatically save their recordings to the cloud, right after they’re made. Most are never watched again, simply because they can’t be found.

Clarify’s first product is a search engine for audio and video. It’s a search engine that developers can embed into their applications with a few lines of code. That’s right, any audio or video library can be searchable with a few lines of code. This used to be impossible, so we’ve just come to accept that media files were opaque blobs. Replaying them was magic enough. But our technology is making that magic seem primitive. By giving developers access to the data previously trapped in media files, we’re allowing them to do all the clever things they can do with text. Search is the first indication of what’s coming.

There’s a lot of data around today, a lot. There’s a lot more, that’s been trapped until now, and there’s a lot more coming. All those audio and video files, they aren’t another wave of data, they’re a tsunami.