• Post Reply Bookmark Topic Watch Topic
  • New Topic

Flat File Search  RSS feed

 
Greg Kearn
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All
What's the fastest way to search a flat CSV file (27,000 rows, 5 fields). I'm currently using BufferedReader and breaking the fields with StringTokenizer. How can I make searching this file faster?
Thanks
 
Dirk Schreckmann
Sheriff
Posts: 7023
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Depends on what specifically you are searching for. Possibly regular expressions could help.
 
Mark Herschberg
Sheriff
Posts: 6037
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can you be more specific? What are you looking for? What do you know, if anything, about it's location in a file? WHat do you know about the other data?
--Mark
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The StringTokenizer is creating 5 String objects for each line read. This is probably unnecessary overhead, especially if most lines don't have what you're looking for anyway. StringTokenizer is more for ease of use than efficiency. The same may well be true for regular expressions, but at least they're much more flexible. Some alternatives are: use String's indexOf() method to find a particular character, if it exists. Or, read a group of chars into a char[], and loop through the array to see what's in it. Avoid creating objects unless there's a good reason.
As Ilja says though, the best solution will depend on what you're searching for. Give some more details, and we may have better recommendations.
 
Greg Kearn
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I give the user the ability to search on one of the three: City, Area Code, Zip.
The results that are displayed are "City, State, Zip, Area Code, Time Zone".
Thanks
Greg
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'd try using String's indexOf(String) to see if a given line contains the substring you're looking for. Most of the time this will be no, so no need to spend any further time on that line. If it's a yes, then you can use a StringTokenizer to work through the commas and make sure the string is in the right place.
There are faster ways, I think, but they'll be more complex to implement. A couple things to consider - should case be ignored? Does the user's search pattern need to allow wildcards? If either answer is yes, indexOf() will be hard to use effectively, and either a regex package, or a handbuilt parser based around a char[] array read, would be the way I'd go. If performance is still too poor with indexOf(), then it probably will be for regex too - work on reading a char[] array and finding the pattern you want in it.
 
Greg Kearn
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What about a HashMap? Would that be any use in for this problem?
Thank
Greg
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It depends how many times you expect to need to search for data, and whether you can afford to keep all the data in memory. You could read the whole file line by line, and for each line put a new entry in each of fours hashtables. Each table would use a different field as its key, and would map to an object holding all the (other) data for that line. This would make the initial read-and-load of the file somewhat slower - but would allow you to look up data very quickly afterwards. So it's worthwhile if you've got the memory, and searching will be done often enough that the benefit of fast searches outweighs the initial detriment of the slow load time. Also, the hashtable will be no use if you want to search using wildcards, or substrings within a key. Only the exact value of hte key will do. If any of this is a problem, you're probably better off with one of the other approaches mentioned.
 
Greg Kearn
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jim,
Thanks for your input. I'm going to try the indexOf(String) idea. I learned that you can search values when the values aren't placed in tokens. I'll see if it's any faster. Currently the search for Miami takes about 45 seconds.
--Greg
 
Mark Herschberg
Sheriff
Posts: 6037
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jim always beats me to the wise answers :-)
He's right that you need to consider how often it'll be searched. If often, it should be loaded into memory.
Another option is sorting (and possible indexing, as well). Is your data sorted? If so, it can be searched musch faster. Of course, you have to balance how often the data changes versus how often it's searched. If it is sorted, you may want to create an index to speed searching.
--Mark
 
Greg Kearn
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mark/Jim
To speed the search the data is pre-sorted into 3 files, citydata, areadata and zipdata. The key field is the first field in the row. Ex citidata: Miami, 305, 33341, E. zipdata: 33341, Miami, 305, E. areadata: 305, Miami, 33341, E.
Is it possible to index this file?
Thanks
 
Thomas Smets
Ranch Hand
Posts: 111
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I haven't got much experience withthat but may be you could give a look at the jakarta ORO matcher : ORO matcher & Steve / pat RegularExpression matcher : Regular Expressions in Java
AS I posted a question directly to Steven (the author of the javaregex library), he gave me the following reply !

I don't have the time to make a login and reply right now, however...
The String.indexOf is probably the best thing suggested on the board.
A better answer would probably be to read the whole file into memory, then search it via a Boyer Moore Horspool algorythm. Package pat uses this if the pattern is a string literal. To avoid some of the overhead of regex's you can use the SkipBMH class contained within it. The BMH algorythem can find things faster than indexOf().
One key to speed is making fewer Strings. That is why I recommend reading
the whole file into memory (if possible).
Once you've found the item you're looking for, you can backtrack to find the beginning and end of the line -- then maybe you can use things like
substring or StringTokenizer.
Of course, this assumes you have a fairly normal CSV file. In principle, CVS allows you to put newlines in quotes, making them part of a field,
and so on. This kind of processing can get quite complex.
Cheers,
Steve

AS I mpentionned earlier this is the reply Steven R Brandt. You can contact him @ javaregex.com
Hope this helps,

[ May 04, 2002: Message edited by: Thomas SMETS ]
 
Mark Herschberg
Sheriff
Posts: 6037
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Greg Kearn:
Mark/Jim
To speed the search the data is pre-sorted into 3 files, citydata, areadata and zipdata. The key field is the first field in the row. Ex citidata: Miami, 305, 33341, E. zipdata: 33341, Miami, 305, E. areadata: 305, Miami, 33341, E.
Is it possible to index this file?
Thanks

Well, if the data is sorted, you can keep track of indexes, like, say where area codes starting with each digit are located; then you can skip the earlier bytes of data. There are probably more complex alternatives, but this is the basic idea.
--Mark
 
Greg Kearn
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How can an index be kept? The file has to be read every time starting at the top and working its way down looking for the matching value. Is there another way to search a 27,000 row csv file?
I did try the indexOf search and it did save about 5 seconds off a long search.
Thanks
Greg
 
Mark Herschberg
Sheriff
Posts: 6037
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It really depends on your constraints. You could create a second file for holding the index, or you can put the index at the top of this file. Even if it's a sorted file, without an index, you can speed your searching by only checking for a single digit match at first, making the comparisons faster (although some JVMs may do this automatically).
in cases like this you must balance where you put the costs. If you do more searching this makes sense. if you do more data modifications, this may be more trouble then it's worth.
--Mark
 
Frank Carver
Sheriff
Posts: 6920
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Also if the data doesn't change very often you might consider loading it into a database. take that 45 second hit once when the data changes, then get much faster searching through JDBC.
Installing and using a database doesn't need to be a big, complex undertaking; something like hypersonic SQl is basically just a jar file and a config to tell it where to store its data.
Using a database would mean that you can configure things like indexing in a high-level manner without worying about how to code it yourself.
 
Greg Kearn
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I gave hypersonic sql a try without success. So I tried PointBase and that did work. Time started when the button was pressed and ended when the data stopped scrolling. The time results are as follows:
City Top/Bot Srch PointBase Srch
Acme - :03 - :09
Miami - 1:10 - :43
Zwolle- 1:25 - :09
Thanks for all your input!
--Greg
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!