• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Paul Clapham
  • Ron McLeod
Sheriffs:
  • Jeanne Boyarsky
  • Liutauras Vilda
Saloon Keepers:
  • Tim Holloway
  • Carey Brown
  • Roland Mueller
  • Piet Souris
Bartenders:

search engine to search both at database and web application level

 
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Plesae suggest a search engine which can search both at database and
we application level
 
Bartender
Posts: 10336
Hibernate Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Lucene is good.
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Lucene rocks, but it needs an indexer for each data source that you want to search. I'm not aware that one for databases exists, although it wouldn't be hard to write one. Just wanted to give a heads-up that it's not a simple plug-and-play solution.
 
Sheriff
Posts: 22862
132
Eclipse IDE Spring TypeScript Quarkus Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Ulf Dittmer:
I'm not aware that one for databases exists, although it wouldn't be hard to write one.


It's not, I've written a simple tool to search any database for any String myself a while ago. DatabaseMetaData will help you out for retrieving the tables from a connection, and for the rest it's just "SELECT * FROM <table>", iterate through the result set and the columns (using ResultSetMetaData) and voila.
 
Paul Sturrock
Bartender
Posts: 10336
Hibernate Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Some databases, SQL Server for example, provide this sort of service out the box. It includes a free text search service - so some sort of Lucene/SLQ Server mix would be a possibility. Other databases presumably have competing offerings.
 
sapna rana
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi ,

I have tried to use Nutch but it is only web search and does not include any search at database level.

And following error appear while using it

/**********************************/

Nutch search engine(nutch-0.7.2).

After install Nutch and Tomcat, I tried to crawl three url one of them is my web application on jboss.

using command as:

nutch crawl urls -dir crawl -depth 3>& crawl.log

where urls ia a file under the nutch directory and contains three urls as
"http://localhost:8080/vinweb"
"http://www.orkut.co.in"
"http://apache.com"


But, after crawling, I checked the crawl.log, seems it
didn't fetch anything

080901 193120 FetchListTool started
080901 193121 Overall processing: Sorted 0 entries in 0.0 seconds.

following is my crawl.log file
*****************************************
run java in C:\Program Files\Java\jdk1.5.0_12
080901 193120 parsing file:/E:/SearchTools/nutch-0.7.2/conf/nutch-default.xml
080901 193120 parsing file:/E:/SearchTools/nutch-0.7.2/conf/crawl-tool.xml
080901 193120 parsing file:/E:/SearchTools/nutch-0.7.2/conf/nutch-site.xml
080901 193120 No FS indicated, using default:local
080901 193120 crawl started in: crawl
080901 193120 rootUrlFile = urls
080901 193120 threads = 10
080901 193120 depth = 3
080901 193120 Created webdb at LocalFS,E:\SearchTools\nutch-0.7.2\crawl\db
080901 193120 Starting URL processing
080901 193120 Plugins: looking in: E:\SearchTools\nutch-0.7.2\plugins
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\clustering-carrot2
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\creativecommons
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\index-basic\plugin.xml
080901 193120 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\index-more
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\language-identifier
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\nutch-extensionpoints\plugin.xml
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\ontology
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-ext
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\parse-html\plugin.xml
080901 193120 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-js
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-msword
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-pdf
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-rss
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\parse-text\plugin.xml
080901 193120 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\protocol-file
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\protocol-ftp
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\protocol-http\plugin.xml
080901 193120 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\protocol-httpclient
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\query-basic\plugin.xml
080901 193120 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\query-more
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\query-site\plugin.xml
080901 193120 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\query-url\plugin.xml
080901 193120 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\urlfilter-prefix
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\urlfilter-regex\plugin.xml
080901 193120 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter
080901 193120 found resource crawl-urlfilter.txt at file:/E:/SearchTools/nutch-0.7.2/conf/crawl-urlfilter.txt
.080901 193120 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
080901 193120 bad url: "http://localhost:8080/vinweb"
.080901 193120 bad url: "http://www.orkut.co.in"
....080901 193120 Added 0 pages
080901 193120 FetchListTool started
080901 193121 Overall processing: Sorted 0 entries in 0.0 seconds.
080901 193121 Overall processing: Sorted NaN entries/second
080901 193121 FetchListTool completed
080901 193121 logging at INFO
080901 193122 Updating E:\SearchTools\nutch-0.7.2\crawl\db
080901 193122 Updating for E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193120
080901 193122 Finishing update
080901 193122 Update finished
080901 193122 FetchListTool started
080901 193122 Overall processing: Sorted 0 entries in 0.0 seconds.
080901 193122 Overall processing: Sorted NaN entries/second
080901 193122 FetchListTool completed
080901 193122 logging at INFO
080901 193123 Updating E:\SearchTools\nutch-0.7.2\crawl\db
080901 193123 Updating for E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193122
080901 193123 Finishing update
080901 193123 Update finished
080901 193123 FetchListTool started
080901 193123 Overall processing: Sorted 0 entries in 0.0 seconds.
080901 193123 Overall processing: Sorted NaN entries/second
080901 193124 FetchListTool completed
080901 193124 logging at INFO
080901 193125 Updating E:\SearchTools\nutch-0.7.2\crawl\db
080901 193125 Updating for E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193123
080901 193125 Finishing update
080901 193125 Update finished
080901 193125 Updating E:\SearchTools\nutch-0.7.2\crawl\segments from E:\SearchTools\nutch-0.7.2\crawl\db
080901 193125 reading E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193120
080901 193125 reading E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193122
080901 193125 reading E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193123
080901 193125 Sorting pages by url...
080901 193125 Getting updated scores and anchors from db...
080901 193125 Sorting updates by segment...
080901 193125 Updating segments...
080901 193125 Done updating E:\SearchTools\nutch-0.7.2\crawl\segments from E:\SearchTools\nutch-0.7.2\crawl\db
080901 193125 indexing segment: E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193120
080901 193125 * Opening segment 20080901193120
080901 193125 * Indexing segment 20080901193120
080901 193125 * Optimizing index...
080901 193125 * Moving index to NFS if needed...
080901 193125 DONE indexing segment 20080901193120: total 0 records in 0.047 s (NaN rec/s).
080901 193125 done indexing
080901 193125 indexing segment: E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193122
080901 193125 * Opening segment 20080901193122
080901 193125 * Indexing segment 20080901193122
080901 193125 * Optimizing index...
080901 193125 * Moving index to NFS if needed...
080901 193125 DONE indexing segment 20080901193122: total 0 records in 0.0 s (NaN rec/s).
080901 193125 done indexing
080901 193125 indexing segment: E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193123
080901 193125 * Opening segment 20080901193123
080901 193125 * Indexing segment 20080901193123
080901 193125 * Optimizing index...
080901 193125 * Moving index to NFS if needed...
080901 193125 DONE indexing segment 20080901193123: total 0 records in 0.0 s (NaN rec/s).
080901 193125 done indexing
080901 193125 Reading url hashes...
080901 193125 Sorting url hashes...
080901 193125 Deleting url duplicates...
080901 193125 Deleted 0 url duplicates.
080901 193125 Reading content hashes...
080901 193125 Sorting content hashes...
080901 193125 Deleting content duplicates...
080901 193125 Deleted 0 content duplicates.
080901 193125 Duplicate deletion complete locally. Now returning to NFS...
080901 193125 DeleteDuplicates complete
080901 193125 Merging segment indexes...
080901 193125 crawl finished: crawl

*******************************************

and following entries are made at my crawl-urlfilter.txt.

*******************************************

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*synapse.com (where synapse is my domain name)

+^http://([a-z0-9]*\.)*apache.org

+^http://([a-z0-9]*\.)*localhost:8080/vinweb

+^http://([a-z0-9]*\.)*orkut.co.in


# skip everything else
-.

*************************************************

And the search result is return NULL in web UI.

Any suggestion will be very helpful,

/**********************************/

At the same time Please suggest which one is better to use Lucene or Nutch.

I have to implement it in my application in struts/jboss.

Please suggest,

Suggestion are of great value for me.

Thanks in advance




Please suggest.
Thanks,
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have no hands-on experience with Nutch, but it's just a web crawling engine. I'm fairly certain that it uses Lucene underneath to do the indexing and searching.
 
Ranch Hand
Posts: 83
Spring Tomcat Server Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you are looking for a database level search then you need to store all required data in a Search Engine Index.

Fetch all search-able data from database and store in (say Lucene) index.

Direct search on Database may not be that effective if you have to use SQL for each search.
 
sapna rana
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

I have implemented Lucene in my application and able to search and index a pdf,word,text and Html document.
Plesae provide me reference how can be parse and index a excel and XML.

Thanks n Regards
 
Paul Sturrock
Bartender
Posts: 10336
Hibernate Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Doesn't Lucence's own documentation have links for that? Excel, you just need to use POI. XML you can either parse it first or just treat it as plain text.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You'd need to write code that reads XLS files and extracts the text from it; then you can feed the text to Lucene. Apache POI is a library that allows you to access the text in an XLS file.

For XML it's probably easiest to use the SAX API; the characters method of the document handler provides you with the text contained in the file.
 
sapna rana
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
How can we index a database using lucene .
I have tried to write a DBindex in order to retrieve some records from database and then write a index file . But when i search the same value, no results found.

Please provide me details how we can search a database as that is our main focus to do.
 
Marshal
Posts: 80874
506
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I think if you are searching databases, this thread is no longer a "beginner's" thread. I shall have to move it.
 
Paul Sturrock
Bartender
Posts: 10336
Hibernate Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by sapna rana:
How can we index a database using lucene .
I have tried to write a DBindex in order to retrieve some records from database and then write a index file . But when i search the same value, no results found.

Please provide me details how we can search a database as that is our main focus to do.



There is no more detail to add really. You need to index your source of data, if its a database your indexer will need to connect via JDBC to do this. Can you show us your code and the query you expect to return results?

Also, there is a tool called luke that will let you examine the index and run ad-hoc queries. Sometime the issue is nothing more than a mistake in your query syntax.
 
sapna rana
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Please find mu code as follows
*****************************************************

package com.knowledgebooks.utils;

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.hibernate.mapping.Index;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.Connection;
import java.sql.DriverManager;
import java.io.File;
import java.io.IOException;
import java.sql.*;
import java.io.*;
import org.apache.lucene.index.*;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;


/**
* Created by IntelliJ IDEA.
* User: rachna.rana
* Date: Sep 12, 2008
* Time: 11:46:31 AM
* To change this template use File | Settings | File Templates.
*/
public class DBIndex {

/**
* @param args
*/
private Connection con;
private String dbDriver,connectionURL,user,password;

public DBIndex()
{
con=null;
dbDriver="com.mysql.jdbc.Driver";
connectionURL="jdbc:mysql://172.16.80.214:3306/vinprocure1";
user="root";
password="root";
}

public void setDBDriver(String driver)
{
this.dbDriver=driver;
}

public void setConnectionURL(String connectionURL)
{
this.connectionURL=connectionURL;
}

public void setAuthentication(String user,String password)
{
this.user=user;
this.password=password;
}

public Connection getConnection()
{
try{
Class.forName(dbDriver);
con= DriverManager.getConnection(connectionURL,user,password);
}
catch(Exception e){
e.printStackTrace();
}
return con;
}

private boolean isIndexExist(String indexPath)
{
boolean exist=false;
try{
IndexReader ir=IndexReader.open(indexPath);
exist=true;
ir.close();
}
catch(IOException e){
System.out.println("ioexception:-> "+e);
}catch(Exception e){
System.out.println("exception:-> "+e);
}

return exist;
}


public static void main(String[] args) {
DBIndex dbi=new DBIndex();
try{
Connection connection=dbi.getConnection();
String query="select user_id,login_name,first_name,last_name,email_address from vinusers";
Statement statement=connection.createStatement();
ResultSet contentResutlset=statement.executeQuery(query);
System.out.println("111");
IndexWriter writer = new IndexWriter(new File("c:\\dbindex"),new StandardAnalyzer(),true);

while(contentResutlset.next()){
//Adding all fields' contents to a single string for indexing

String contents=contentResutlset.getString(2)+""+contentResutlset.getString(3)+" "+contentResutlset.getString(4);

System.out.println("Indexing Content no.(ID) " + contentResutlset.getShort(1)+"\n"+contents);
System.out.println("Indexing Content no.(ID) " + contentResutlset.getString(1));

//Creating index for a single content(record in contents table)
Document doc = new Document();
doc.add(new Field("contents", contents,false,false,true));
doc.add(new Field("id",contentResutlset.getString(2),true,false,false));
writer.addDocument(doc);

}
writer.close();
contentResutlset.close();
statement.close();
connection.close();
}//try
catch(Exception e){
System.out.println(e.getMessage());
}//catch*/
}
}
****************************************************************

It will create a file in C:\dbindex and write the index value into it
which contain the result of the query feched above.

During search of the same name result found is 0
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Please go back and edit your post to UseCodeTags. It's unnecessarily hard to read as it is.

How are you searching the index?
 
reply
    Bookmark Topic Watch Topic
  • New Topic