• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

How to search a pattern in a large file using java

 
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have tried to search a pattern in a small file, able to get results, but when i go and search in a 1GB file, getting heap error.

What is the best way to search a pattern in a big files using java.

Thanks
 
Sheriff
Posts: 22787
131
Eclipse IDE Spring Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If the pattern only appears on separate lines you shouldn't store each line, but only the current one. In pseudo code:
 
palanisamy subramani
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Rob for your quick reply,
Here pattern is in more than one line So i cannot go line by line.

To search the pattern(more than a line) i have to load entire file in memory and have to do search for the pattern,due to this getting heap memory error.

What will be the best way to search the pattern in the above scenario?

Thanks
 
Rancher
Posts: 600
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Palanisamy:

That is almost certainly the wrong way to go about it. What precisely are you searching for, and what have you tried so far?

John.
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

palanisamy subramani wrote:Thanks Rob for your quick reply,
Here pattern is in more than one line So i cannot go line by line.

To search the pattern(more than a line) i have to load entire file in memory and have to do search for the pattern


Like John, I suspect that's not correct.

Are you saying that the pattern can be 1Gb long? It seems unlikely.

So the usual solution is to read in as much as you need to to guarantee a match.
Alternatively, break up the pattern into logical pieces that can be searched for procedurally.


Winston
 
Rob Spoor
Sheriff
Posts: 22787
131
Eclipse IDE Spring Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I am right now going through a 3GB log file, matching a specific pattern on each line, and moving that line to a file depending on the value of that pattern. (To be more precise, I'm splitting a single 3GB Apache HTTPD log file of several months into one log file per day.) No problem with that.
 
palanisamy subramani
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
John,

I have tried with less than 1MB files, got results. When i go for 1GB file, got heap error.
If this is a wrong way, then what is the best way to do that?

Thanks
 
palanisamy subramani
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Giving more information on this,

Notes:
A day a file, may end up with more than 1GB with log data and XML data inside.
Pattern is like 5 lines of XML .That pattern may repeat many times.

Summarising options provided by you guys,
1) Split the file into small file and read from that. -- multiline pattern may split between files, pattern may miss.
2) Split the pattern into line by line pattern -- complex logic is required to filter the pattern.


All your comments are apreciated

 
Master Rancher
Posts: 4919
74
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
[composed without seeing the last comment above]

Can you tell us about the pattern? What is it? Can you identify an initial part of the pattern that only takes up one line, and search for that first? Is there any size limit for how much text can be inside the pattern?
 
Mike Simmons
Master Rancher
Posts: 4919
74
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

palanisamy subramani wrote:Pattern is like 5 lines of XML .That pattern may repeat many times.


Is the whole file XML? It may well be easier to use an XML parser, one that doesn't load the whole DOM into memory. In the early days of Java XML processing, that would have meant using a SAX parser; I'm not sure what the best choices are now.

Is there a particular start and end tag that you're looking for? Do you want all instances of that start and end tag? Or is the pattern more complex than that?
 
John de Michele
Rancher
Posts: 600
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Palanisamy:

The problem with reading large files whole into memory is exactly what you describe - you run out of memory, it's horribly inefficient, wastes resources, etc.. If that five line XML pattern is consistent, then what you probably want to do is check for the first line, and if that matches, check to see if the next four lines match. That way, your file can be 1MB, or 1GB, or 1TB, and you don't have the problem of accidentally splitting files in the middle of the pattern you're looking for.

John.
 
palanisamy subramani
Greenhorn
Posts: 29
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

I broke the multiline pattern into single line patten and able to search huge file without any issue.

Thanks to all for your valuable comments!!!
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic