Win a copy of Svelte and Sapper in Action this week in the JavaScript forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Bear Bibeault
  • Junilu Lacar
Sheriffs:
  • Jeanne Boyarsky
  • Tim Cooke
  • Henry Wong
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • salvin francis
  • Frits Walraven
Bartenders:
  • Scott Selikoff
  • Piet Souris
  • Carey Brown

RegExp performance for returning contextual search results

 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

I have a search results page on which I want to display search term results in context ... the search term plus 15 words on either side.

I've written a function that is working (pasted below). Essentially, it receives as string and then I use regular expressions plus an ArrayList to determine the substing to return. Performance is ok, but I'm wondering if there's a better way to tackle this problem.

Anyone have a suggestion?

Thanks

--
public static String returnSnippet(String xmlText, int numberOfWords)
{
Pattern myPattern;
Matcher myMatcher;
// Move this out so it's passed in since this is a utility class?
String[] patternArray = {
"<chapter-title(.*?)</chapter-title>",
"<subject-title(.*?)</subject-title>",
"<resource-metadata>(.*?)</resource-metadata>",
"<children>(.*?)</children>",
"<child(.*?)</child>",
"<a-head(.*?)</a-head>",
"<b-head(.*?)</b-head>",
"<lrh>(.*?)</lrh>",
"<rrh>(.*?)</rrh>",
"<head(.*?)</head>",
"<titlegroup(.*?)</titlegroup>",
"<para role(.*?)</para>",
"<contributor(.*?)</contributor>",
"<author(.*?)</author>",
"<fmname(.*?)</fmname>",
"<lname(.*?)</lname>",
"<(.|\n)+?>"
};

// Do these two first
myPattern = Pattern.compile("<hit (.*?)>");
myMatcher = myPattern.matcher(xmlText);
xmlText = myMatcher.replaceAll("[hit]");

myPattern = Pattern.compile("</hit>");
myMatcher = myPattern.matcher(xmlText);
xmlText = myMatcher.replaceAll("[/hit]");

for (int i = 0; i < patternArray.length; i++)
{
myPattern = Pattern.compile(patternArray[i]);
myMatcher = myPattern.matcher(xmlText);
xmlText = myMatcher.replaceAll("");
}

// Add the logic to count words before and after here
// See RegexTestHarness.java in C:\j2sdk1.4.2_11\lib on my machine for notes / test version
myPattern = Pattern.compile("\\[hit\\].*?\\[/hit\\]");
myMatcher = myPattern.matcher(xmlText);
if(myMatcher.find()) // Using if captures the first instance only; using while will loop through them all
{
int hitStart = myMatcher.start();
int hitEnd = myMatcher.end();

myPattern = Pattern.compile("\\s");
myMatcher = myPattern.matcher(xmlText);
ArrayList spaceArray = new ArrayList(100);
while(myMatcher.find())
{
spaceArray.add(new Integer(myMatcher.start())); // ListArray.add() expects and Object, but int is a primitive type, so create an Integer object
}

arrayforloop: for(int i=0; i < spaceArray.size(); i++)
{
/*
When the value of the number (the index of the "space" hit) is greater than or equal to the value of the hitStart
(the index of the "[hit]" start), count backwards and forwards to get the index values for spliting the string.
*/
if( ((Integer)spaceArray.get(i)).intValue() >= hitStart ) // Going from Object -> Integer -> int
{
int wordIndexStart = ( ( i - numberOfWords ) <= 0 ) ? 0 : i - numberOfWords; // These two get the locations in the array
int wordIndexEnd = ( ( i + numberOfWords ) >= spaceArray.size() ) ? spaceArray.size() : i + numberOfWords;
int substringStart = (wordIndexStart == 0) ? 0 : ((Integer)spaceArray.get(wordIndexStart)).intValue(); // These two get the values of the locations in the array
int substringEnd = ( (Integer)spaceArray.get(wordIndexEnd) ).intValue();

xmlText = xmlText.substring(substringStart, substringEnd);
break arrayforloop;
}
}
}
else
{
xmlText = "";
}

// Do these two last
myPattern = Pattern.compile("\\[hit\\]");
myMatcher = myPattern.matcher(xmlText);
xmlText = myMatcher.replaceAll("<span class=\"sr-hit\">");

myPattern = Pattern.compile("\\[/hit\\]");
myMatcher = myPattern.matcher(xmlText);
xmlText = myMatcher.replaceAll("</span>");

return xmlText;
}
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If I have followed your logic, you could have all those patterns compiled as static variables, instead of as new pattern objects every time the method is called. Compiled Pattern objects are safe for multithreading but Matchers are not.
Bill
 
I'm a lumberjack and I'm okay, I sleep all night and work all day. Lumberjack ad:
Thread Boost feature
https://coderanch.com/t/674455/Thread-Boost-feature
    Bookmark Topic Watch Topic
  • New Topic