Richard Askew
Dissertation Explained – Keyword Extraction How To

Last week I went back to my University in a professional capacity. Talk turned to my dissertation and I must be honest, I haven’t really looked at it since my time there. At the time it got a bit of attention being included in the BBC Backstage website and being awarded ‘Mash-up of the day’ by Programmable Web. Let me summarise below:

Infused news was created as part of an Internet Computing degree at the University of Hull, Scarborough Campus named “An investigation into the need for user-submitted, multimedia content when delivering news”. The aim was to integrate user-submitted, multimedia elements into existing news stories and evaluate whether or not this augmented version of the news not only makes the story more compelling, action provoking and understandable for the user, but to investigate whether the use of multiple sources gives the news story a more balanced, honest and up-to-date view of the news story.

Now you can still visit it if you want but unfortunately the feeds aren’t what they once were. None the less, I realise I have linked to it from this site and so I really should talk about a few technical aspects of the prototype in case anyone else finds it useful. So first up, keyword extraction…….

This obviously played a major part of my dissertation as the keywords were used to search for media. You could of course write a PHP script that simply looks at a body of text and counts the occurrences of a string. This wasn’t enough for me as single word search terms were unlikely to provide specific enough search terms to prove useful – you would also likely get a high occurrence of generic words.

The solution I employed was the Yahoo Term Extractor. This uses Yahoo Search technology to gather keywords from a body of text. To start with you need an APP ID so follow the steps on the site to get started.

Ready to go?

So at this point you have an API key and a body of text that you need to analyse and extract the keywords. Start by cleaning up the string and removing the text that you don’t want to analyse. For me this was removing the generic links that the BBC used in its articles:


//Remove the send us your comments links - this is BBC specific
$description = preg_replace("/Send us your comments/", " ", $description);

//Remove the back to link text - again BBC specific
$description = preg_replace("/Back to link/", " ", $description);

To talk to Yahoo you use the following URL:

http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction?appid=[YOUR-APPID]&output=xml&context=[YOUR-TEXT]

This returns a list of Key Words in an XML file – visit the URL and view source to see what it is doing. Obviously we need to get at the data, I did this by using the PHP function simplexml_load_file.


if (!$keywordxml=simplexml_load_file('http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction?appid=[YOUR-APPID]&output=xml&context='.$description)){

echo 'Error reading the XML file2';

}

Now to see what is in XML Object:

print_r($keywordxml);

So you can see the keywords, you can now access each one. For example the first one is:


echo $keywordxml->Result[0];

Of course you may need to use the keywords in a more dynamic way so accessing them this way doesn’t give you that. You could put them in an array like this:


foreach ($keywordxml as $ResultSet) {
$keyword_array[] = $ResultSet;
}

There you have it, hopefully someone finds it handy!

Next up, Searching YouTube.