Search for Keywords in Audio

One of the concerns with audio content on the web, especially voice podcasts, is that it isn’t easy to search for specific content, such as keywords. We want to be able to search inside audio and video content like we search html. With out this keyword search ability spoken content data is locked up. This makes it near impossible to create metadata from the content inside audio files. One way around this problem is to manually create transcripts. Once you have the transcript text machines can search it. Doing this for all the audio and video content on the web could take a while. You can imagine the magnitude of the problem.

People who research this problem call this data on the Internet Unstructured Data.

Unstructured data refers to masses of (usually) computerized information which do not have a data structure which is easily readable by a machine. Examples of unstructured data may include audio, video and unstructured text such as the body of an email or word processor document.


The Internet has become a treasure trove of FREE data waiting to be mined and turned into metadata. Once you have the metadata you can make predictions. These predictions can turn into actions. Those actions can create change. Groups using this type of data mining include banks, governments, media, advertisers, and yes… even activists.

Now we have a free search service called Podzinger. It’s the first tool, to my knowledge, that actually makes it easy to successfully search for keywords in spoken audio. (Other services like this just didn’t work for me.) I’ve put the search box on the right side of this site. Try it. It works quite well. Let me know if you were able to successfully search for something on Unfortunatly they’ve only index the last nine podcasts. Not all of my audio. If it isn’t useful I might take it down.

Something to keep in mind with these services is that you aren’t really getting it for free. In other words your not paying up front. You’re “leasing” your audio content for this service. By signing up you’re giving permission to Podzinger to analyze the audio, create text transcripts, and then create metadata from your content. (and who know what else) They’re mining it and supposedly acting on what they find. In return you get to use and provide a service to the visitors to your website. It’s possible one reason they let you put the search box on your site is for advertisement and collecting keywords. The more keywords they collect from the search boxes the more accurate their algorithms become at searching audio data. At least this is how I think they are doing it.

It is very important for everyone using the Internet to know about data mining. The most well know is the invasive spy bots and cookies. But we do have other kinds that aren’t so annoying. This is really an ethics issue. What is more important your data privacy or corporate financial profit? Is there a ethical compromise between the two? Many Internet companies would say yes. I’m not so sure…

Many people are more than happy to trade their data and the potential for metadata to be created from it for valuable services. I’m one of those people. But the more I learn about how metadata is mined and watch billion dollar businesses like Google rise to power the more concerned I become. What all Internet surfers need is leverage to negotiate better payment for our data. Data created from our actions, email, blogs, podcasts, forums, etc. Google and others are taking all of us for a serious ride. So far I’m ok with the ride but for how long?