This week's book giveaway is in the Testing forum. We're giving away four copies of The Way of the Web Tester: A Beginner's Guide to Automating Tests and have Jonathan Rasmusson on-line! See this thread for details.
I am unsure if this is the right forum but since I plan to use JSP/Servlet in this, I am posting it here.
I am sure most of you have heard about Tagcloud. It is a way to gauge the most talked about kewords in blog conversations. Unfortunately, Tagcloud does not support languages other than English. For example, this cloud doesn't show any Hindi keywords.
Now to the problem. I want to implement this on my own. I have a group RSS Feed (thanks to Blogdigger) for all Hindi blogs and I would like to generate another XML from this, picking up the most frequently used words in the posts, and ignoring very common words like "is","the" etc.
What I am unsure of, is how to code this. I could parse the XML and then keep on storing words (that are in not in my "ignore" list) to, say, a Map. Then count the number of keys that correspond to same value (each word) and then generate an XML similar to this.
The main criteria would be performance. Since I would have to generate HTML (shown on TagCloud homepage) from this tagcloud XML all should be done in Jiffy. My solution seems too be too cumbersome.
May I solicit your ideas on how this can be implemented keeping web-performance in mind (accuracy is probably not so important)?
Thanks for your time. [ August 05, 2005: Message edited by: Debashish Chakrabarty ]