Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

RegExp performance for returning contextual search results  RSS feed

 
Matt Smithinson
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

I have a search results page on which I want to display search term results in context ... the search term plus 15 words on either side.

I've written a function that is working (pasted below). Essentially, it receives as string and then I use regular expressions plus an ArrayList to determine the substing to return. Performance is ok, but I'm wondering if there's a better way to tackle this problem.

Anyone have a suggestion?

Thanks

--
public static String returnSnippet(String xmlText, int numberOfWords)
{
Pattern myPattern;
Matcher myMatcher;
// Move this out so it's passed in since this is a utility class?
String[] patternArray = {
"<chapter-title(.*?)</chapter-title>",
"<subject-title(.*?)</subject-title>",
"<resource-metadata>(.*?)</resource-metadata>",
"<children>(.*?)</children>",
"<child(.*?)</child>",
"<a-head(.*?)</a-head>",
"<b-head(.*?)</b-head>",
"<lrh>(.*?)</lrh>",
"<rrh>(.*?)</rrh>",
"<head(.*?)</head>",
"<titlegroup(.*?)</titlegroup>",
"<para role(.*?)</para>",
"<contributor(.*?)</contributor>",
"<author(.*?)</author>",
"<fmname(.*?)</fmname>",
"<lname(.*?)</lname>",
"<(.|\n)+?>"
};

// Do these two first
myPattern = Pattern.compile("<hit (.*?)>");
myMatcher = myPattern.matcher(xmlText);
xmlText = myMatcher.replaceAll("[hit]");

myPattern = Pattern.compile("</hit>");
myMatcher = myPattern.matcher(xmlText);
xmlText = myMatcher.replaceAll("[/hit]");

for (int i = 0; i < patternArray.length; i++)
{
myPattern = Pattern.compile(patternArray[i]);
myMatcher = myPattern.matcher(xmlText);
xmlText = myMatcher.replaceAll("");
}

// Add the logic to count words before and after here
// See RegexTestHarness.java in C:\j2sdk1.4.2_11\lib on my machine for notes / test version
myPattern = Pattern.compile("\\[hit\\].*?\\[/hit\\]");
myMatcher = myPattern.matcher(xmlText);
if(myMatcher.find()) // Using if captures the first instance only; using while will loop through them all
{
int hitStart = myMatcher.start();
int hitEnd = myMatcher.end();

myPattern = Pattern.compile("\\s");
myMatcher = myPattern.matcher(xmlText);
ArrayList spaceArray = new ArrayList(100);
while(myMatcher.find())
{
spaceArray.add(new Integer(myMatcher.start())); // ListArray.add() expects and Object, but int is a primitive type, so create an Integer object
}

arrayforloop: for(int i=0; i < spaceArray.size(); i++)
{
/*
When the value of the number (the index of the "space" hit) is greater than or equal to the value of the hitStart
(the index of the "[hit]" start), count backwards and forwards to get the index values for spliting the string.
*/
if( ((Integer)spaceArray.get(i)).intValue() >= hitStart ) // Going from Object -> Integer -> int
{
int wordIndexStart = ( ( i - numberOfWords ) <= 0 ) ? 0 : i - numberOfWords; // These two get the locations in the array
int wordIndexEnd = ( ( i + numberOfWords ) >= spaceArray.size() ) ? spaceArray.size() : i + numberOfWords;
int substringStart = (wordIndexStart == 0) ? 0 : ((Integer)spaceArray.get(wordIndexStart)).intValue(); // These two get the values of the locations in the array
int substringEnd = ( (Integer)spaceArray.get(wordIndexEnd) ).intValue();

xmlText = xmlText.substring(substringStart, substringEnd);
break arrayforloop;
}
}
}
else
{
xmlText = "";
}

// Do these two last
myPattern = Pattern.compile("\\[hit\\]");
myMatcher = myPattern.matcher(xmlText);
xmlText = myMatcher.replaceAll("<span class=\"sr-hit\">");

myPattern = Pattern.compile("\\[/hit\\]");
myMatcher = myPattern.matcher(xmlText);
xmlText = myMatcher.replaceAll("</span>");

return xmlText;
}
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If I have followed your logic, you could have all those patterns compiled as static variables, instead of as new pattern objects every time the method is called. Compiled Pattern objects are safe for multithreading but Matchers are not.
Bill
 
Don't get me started about those stupid light bulbs.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!