• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

How to search a pattern in a large file using java

 
palanisamy subramani
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have tried to search a pattern in a small file, able to get results, but when i go and search in a 1GB file, getting heap error.

What is the best way to search a pattern in a big files using java.

Thanks
 
Rob Spoor
Sheriff
Pie
Posts: 20613
63
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If the pattern only appears on separate lines you shouldn't store each line, but only the current one. In pseudo code:
 
palanisamy subramani
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Rob for your quick reply,
Here pattern is in more than one line So i cannot go line by line.

To search the pattern(more than a line) i have to load entire file in memory and have to do search for the pattern,due to this getting heap memory error.

What will be the best way to search the pattern in the above scenario?

Thanks
 
John de Michele
Rancher
Posts: 600
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Palanisamy:

That is almost certainly the wrong way to go about it. What precisely are you searching for, and what have you tried so far?

John.
 
Winston Gutkowski
Bartender
Pie
Posts: 10509
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
palanisamy subramani wrote:Thanks Rob for your quick reply,
Here pattern is in more than one line So i cannot go line by line.

To search the pattern(more than a line) i have to load entire file in memory and have to do search for the pattern

Like John, I suspect that's not correct.

Are you saying that the pattern can be 1Gb long? It seems unlikely.

So the usual solution is to read in as much as you need to to guarantee a match.
Alternatively, break up the pattern into logical pieces that can be searched for procedurally.


Winston
 
Rob Spoor
Sheriff
Pie
Posts: 20613
63
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am right now going through a 3GB log file, matching a specific pattern on each line, and moving that line to a file depending on the value of that pattern. (To be more precise, I'm splitting a single 3GB Apache HTTPD log file of several months into one log file per day.) No problem with that.
 
palanisamy subramani
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
John,

I have tried with less than 1MB files, got results. When i go for 1GB file, got heap error.
If this is a wrong way, then what is the best way to do that?

Thanks
 
palanisamy subramani
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Giving more information on this,

Notes:
A day a file, may end up with more than 1GB with log data and XML data inside.
Pattern is like 5 lines of XML .That pattern may repeat many times.

Summarising options provided by you guys,
1) Split the file into small file and read from that. -- multiline pattern may split between files, pattern may miss.
2) Split the pattern into line by line pattern -- complex logic is required to filter the pattern.


All your comments are apreciated

 
Mike Simmons
Ranch Hand
Posts: 3090
14
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
[composed without seeing the last comment above]

Can you tell us about the pattern? What is it? Can you identify an initial part of the pattern that only takes up one line, and search for that first? Is there any size limit for how much text can be inside the pattern?
 
Mike Simmons
Ranch Hand
Posts: 3090
14
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
palanisamy subramani wrote:Pattern is like 5 lines of XML .That pattern may repeat many times.

Is the whole file XML? It may well be easier to use an XML parser, one that doesn't load the whole DOM into memory. In the early days of Java XML processing, that would have meant using a SAX parser; I'm not sure what the best choices are now.

Is there a particular start and end tag that you're looking for? Do you want all instances of that start and end tag? Or is the pattern more complex than that?
 
John de Michele
Rancher
Posts: 600
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Palanisamy:

The problem with reading large files whole into memory is exactly what you describe - you run out of memory, it's horribly inefficient, wastes resources, etc.. If that five line XML pattern is consistent, then what you probably want to do is check for the first line, and if that matches, check to see if the next four lines match. That way, your file can be 1MB, or 1GB, or 1TB, and you don't have the problem of accidentally splitting files in the middle of the pattern you're looking for.

John.
 
palanisamy subramani
Greenhorn
Posts: 29
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

I broke the multiline pattern into single line patten and able to search huge file without any issue.

Thanks to all for your valuable comments!!!
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic