// Blog

Categories and Keywords and Topics, have I

Originally published on the Clarify.io blog. View archived copy.

StarWarsMoviePoster1977When you get a piece of media, there are a ton of things you can learn about it. In honor of next Friday, let’s consider “Star Wars: A New Hope.

There are descriptive aspects such as the cast, release date, director, and producer.

There are technical aspects such as duration, media encoding, color palette, and bitrate.

While that’s all useful information, it doesn’t get to the most important information of all:

What is Star Wars about?

This one is more complex. Basically, we need to understand the story so we need to understand who said what and when. More importantly, we need to understand how those words fit together with everything else in context.

First, let’s start with the first and easiest aspect: keywords.

At the simplest level, keywords are the most frequently said words, excluding certain “stop words” like articles of speech. The stop words are excluded because they’re so common as to appear everywhere and they rarely add understanding or context to the data. From these most frequently said words, you can often determine what the media is about but – at minimum – it gives you useful things to search for.

For Star Wars, our keywords include:

kenobi, jedi, ship, imperial, station, rebels, transmissions

These words are descriptive but don’t tell us anything useful about the story. Unfortunately, that makes sense. In movies, the characters are having conversations and moving the story forward, not necessarily explaining everything to the viewers. On the other hand, news broadcasts are great for keywords.

Next, let’s dig deeper with topics.

Topics are a little more complex. While keywords are what you say, topics are what you’re talking about or how things fit together into larger subjects. For example, if your media mentions tires, windshields, and turn signals, the overarching topic is likely “cars.”

For Star Wars, our topics include:

political, leader, force, attack, government, killed

Wow. That’s dark.

At first glance, are we still talking about a movie with a wookie, light sabers, blasters, and.. oh, a rebellion. And the Emperor disbanding the Galactic Senate. And a princess directing an attack against a battle station. It’s all political. And it’s called Star Wars for a reason.

In that context, the topics start to make sense and be more useful. At a practical level, we can use this information to start grouping movies into specific themes and similar concepts.

Caveat: Where it includes “force,” we don’t identify if that is The Force or the use of force.

Finally, let’s look at categories.

Categories are the most abstract but can be the most descriptive of all. Generally you would use these for high level grouping in a more formal sense. Where topics might be tags for your content, these would be a more scientific or literary classification: think Dewey Decimal instead of tag cloud.

For Star Wars, our categories are:

Dispute Resolution > Warfare

Completely accurate but very dry or clinical. While these may not be useful for user generated content or your favorite podcast, these are vitally important to researchers and archivists who need to group things together, organize them in general, or fit them into a larger context.

Categories, Topics, and Keywords are important but fundamentally different.

The most important thing about three concepts is that they’re not the same. Occasionally people try to use Topics and Keywords interchangeably but even that breaks down because if you try to extend topics into search, you’ll rarely find the terms. But Keywords – by definition – are the words said in the media so you’ll find those easily. When you’re working to sort, organize, and understand your media, know which ones are important, which ones you can effectively ignore, and which will form the backbone of your system.

My tickets are for 9am on the 18th. How about you? 😉

Update: We’ve released an update to our keywords algorithms since the initial publication of the post. As a result, the new list is more relevant and descriptive. Ever-improving algorithms is one of the benefits of a machine learning API. 

If you want to plug your own media into our system to identify keywords, topics, and categories, check out our Quickstart Guides to get started and explore on your own.