Win a copy of Java EE 8 High Performance this week in the Java/Jakarta EE forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

in-container search engine  RSS feed

Posts: 6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm trying to think of a way to set up a search engine that "stays" inside the J2EE container, more specifically that spiders it's own webapplication without going through a firewall set up "in front of" the server.
All our HTML goes through one servlet (older J2EE though, no Filters...), so I was thinking of going straight for the HttpServletRequest and Response.
- create request inside container (no actual webclient involved)
- pass on to doGet for
- "capture" response (again in-container) and run through Lucene
- rip URL-links from output
- create new request from link
- rince, repeat
- How to create/spoof a proper Request and/or capture a Response?
For now, I was thinking fo subclassing these and just semi-implement what I need.
- How to create a reasonable representation of the output?
Ideal would be what HttpUnit does: Break up the page in Title, Body Text, Links, Forms (all in nice Java objects). That way, I could get a very nice classifying thing going for Lucene (eg. your own metadata indexed seperately). The problem is HttpUnit acts as a client, so it will use network functionality (URLConnections), which is what I don't want.
- How to convert (HttpUnit/?) links back to meaningfull Requests?
Most HTML parsers I've seen will have their own native Link representation, but I need it in J2EE Request form, preferably with state info as well (Cookies,..).
Advantages of this approach?
- security: most servers don't need port 80 connectivity from the outside, so it can be firewalled. (I've actually seen this with one of our clients) Since spidering uses (and should use) actual server names, it automatically passes through this.
- performance: since you're so close to the web-application, I expect a huge performance gain
- the spider doesn't show up in the logs more realistic use-profiling
Ideas? Suggestions? Anybody that wants to join in?
Don't get me started about those stupid light bulbs.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!