This week's book giveaway is in the Beginning Java forum.
We're giving away four copies of Murach's Java Programming and have Joel Murach on-line!
See this thread for details.
Win a copy of Murach's Java Programming this week in the Beginning Java forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

How to pull data from a very large file  RSS feed

 
Alvin Cardona
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This thread has been very helpful to try to solve a similar problem I am doing in a research however in my case the file is really huge (around 16gb) and need to carve out data between a particular header and footer. Also these headers and footers can repeat more than once in such a file and ever time the content in between (including the headers and footers themselves) need to be saved into a new file. So at the end the result could end up with some hundred small files. I am working with data dumps which I already converted into hex format. Each line has 16 hex values each separated by a space. To add more complexity the worst case scenario is when part of the header or footer is  split up over two lines.

Any tips on how best to approach this would really appreciated

thanks
 
Winston Gutkowski
Bartender
Posts: 10573
65
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Alvin Cardona wrote:Any tips on how best to approach this would really appreciated

Well, firstly, you need to understand EXACTLY how the file and all it's "headers" are formatted, so you can write a procedure to find the data you need in ANY situation, and also know when it ends.

Case in point: Why would a header be split over two lines? It seems an unnecessary complication to me, but it may be beyond your control.
And if it is split, what tells you that it is? Many text file formats include a "continuation" marker (usually a backslash (\) or a tilde (~)) placed at the last character on a line to indicate that it continues on the next record.

Also: Why are your values in hex? Small, memory-constrained devices sometimes do this to save space, but there are other alternatives; such as outputting the data in human-readable ascii form and then ploughing it through a compression pipe (ascii text usually compresses extremely well). In Unix, you could do this with compress and zcat, and I suspect compression might be even better than outputting "raw" data.

And finally: Why is the file so huge? Repeatedly pulling a few hundred lines out of a 16 gig "serial lump" is basically about as slow as you could possibly make things, so it might be worth re-thinking your approach.
For example: What about a database? This sort of stuff is precisely what they're designed for.

No answers, I'm afraid. Only questions...

HIH

Winston
 
Alvin Cardona
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Winston,

thanks for the feedback first of all. Here are some answers for the questions:

1. The header or footer may be split up on different lines as a file can be residing anywhere, so am considering this extreme case scenario. Imagine saving an image or any file on your portable device, you have no control where it will be saved.
2. Am using HEX as need to scan the file using particular signatures which are known in HEX. (yet could work on a simple text file with just letters so as to try and simplify the situation).
3. The file is huge as it is related to smartphones and a complete image of the device's memory is being done (a complete memory dump). Considering a basic model nowadays it start from  8GB. If the device is completely cut off intentionally (no means to connect to it) unless it is read directly by means of a hardware interface, the contents cannot be retrieved. I built up a small file (the size of a floppy) so as to test upon. I know the real thing IF I manage to get it working is going to take ages unless I focus on the user data area only.


Thanks once again

 
Carey Brown
Bartender
Posts: 2700
41
Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston has some very good points. Do you have any control over the format of this file?
I ran into a similar project a couple of years ago where I had to analyze a log file about the same size as you are mentioning.

  • Post an example of a header and two hex lines below it.
  • Post an example of a split header with the two hex lines that follow it.
  • Post an example of a footer with the two hex lines that precede it.
  • Post an example of a hex pattern that you might search for.


  • Can your search pattern span multiple lines?
     
    Winston Gutkowski
    Bartender
    Posts: 10573
    65
    Eclipse IDE Hibernate Ubuntu
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Alvin Cardona wrote:1. The header or footer may be split up on different lines as a file can be residing anywhere, so am considering this extreme case scenario. Imagine saving an image or any file on your portable device, you have no control where it will be saved.

    Hmmm. Still sounds like an unnecessary complication to me.

    2. Am using HEX as need to scan the file using particular signatures which are known in HEX. (yet could work on a simple text file with just letters so as to try and simplify the situation).

    Fair enough. I presume you understand your "format" better than me.

    3. The file is huge as it is related to smartphones and a complete image of the device's memory is being done (a complete memory dump). Considering a basic model nowadays it start from  8GB.

    There are still lossless binary compression methods that may help - including "blocked" Base64 - particularly as "memory" often has vast swathes of '0's in it.

    If the device is completely cut off intentionally (no means to connect to it) unless it is read directly by means of a hardware interface, the contents cannot be retrieved. I built up a small file (the size of a floppy) so as to test upon. I know the real thing IF I manage to get it working is going to take ages unless I focus on the user data area only.

    I think it would be good to know why you're doing this. Saving entire memory images seems like an odd pastime to me.

    Winston
     
    Alvin Cardona
    Greenhorn
    Posts: 15
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hi Carey

    No there is no control on the original file. To cut the story short,the idea here is that an application's or game's resources could be used to hide information (steganography) which in turn could be used for malicious purposes.. If the user suspects he has been spotted, might easily try to destroy the information or uninstall the application, if not even disable access to the device. My intention is to bypass all the obstacles, acquire an image of the device and see if it is possible to acquire again all resources associated and even spot the tampered resource/s. Might seem a bit like a Hollywood movie plot i know but believe it or not it has been used for corporate espionage or a lot worse.

    If for instance JPG is the resource used -

    header - FF D8 FF
    footer - FF D9

    So in one line could exist FF and the D8 FF on the line after and similarly for the footer. Since there is not just one instance of such a file, it could be present for instance many times. I was thinking of using an array list but am having doubts as regards the size and if i am on the right track.

    The data I could reduce it since the user will be bound (userdata partition), but it is also going to be quite huge.

    thanks
     
    Carey Brown
    Bartender
    Posts: 2700
    41
    Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Alvin Cardona wrote:header - FF D8 FF

    Just taking this header example, I might (incorrectly) infer this rule:

  • Header begins with hex FF and ends with hex FF.
  • The header identifier consists of 1 or more hex values that are not FF.
  • No other data uses hex FF.
  • The beginning and ending FF may appear on different lines.

  • I'm assuming that this is grossly incorrect but you can see that without being able to describe precisely what the pattern is it would be impossible to write code to automate this.
     
    Alvin Cardona
    Greenhorn
    Posts: 15
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    my apologies Carey .... had a typo...value missing .... FF D8 FF E0
     
    Carey Brown
    Bartender
    Posts: 2700
    41
    Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Alvin Cardona wrote:my apologies Carey .... had a typo...value missing .... FF D8 FF E0
    How do you know that this is a header and not data?
     
    Alvin Cardona
    Greenhorn
    Posts: 15
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    According to the research done (unless I missed something), such sequences in that order are only allowed at the beginning and a corresponding footer. There is a whole set of signatures for each file type by which each is recognized.
     
    Norm Radder
    Rancher
    Posts: 2041
    26
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    How is the data file created that it has "lines"?  My idea of a line is some text ended by line-end characters.  A memory dump wouldn't have lines.
     
    Alvin Cardona
    Greenhorn
    Posts: 15
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hello Norm,

    I am converting it with a function I developed. Tested the output with that of a hex editor such as WinHex and the outputs are exactly the same.

    To cater for the header/footer being on two lines i was thinking to use some form of "padding" if it can be called like that. The size of this would be two and so in other words the checking would be done across 18 characters. Still I am not grasping the idea how to copy the content from start till end and then in what appropriate structure

    In theory it looks easy... in practice I am finding it quite the opposite.
     
    Norm Radder
    Rancher
    Posts: 2041
    26
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Can the original program that converts it do the analysis?  Why read it more than once? 
    Why are there "lines" and data that spans "lines"? 
     
    Carey Brown
    Bartender
    Posts: 2700
    41
    Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Copying is relatively simple. Defining what constitutes the 'start' and 'end' is the tricky part.
    It seems that you are able to do this with a hex-editor. Can you describe what you see that lets you know when you've found the 'start'?
     
    Carey Brown
    Bartender
    Posts: 2700
    41
    Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Norm Radder wrote:Can the original program that converts it do the analysis?  Why read it more than once? 
    Why are there "lines" and data that spans "lines"? 

    Very true. This gets back to the original question posted to the OP about whether or not he has control over the format, implying also that the OP would have access to the raw, unformatted, data.
     
    Alvin Cardona
    Greenhorn
    Posts: 15
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    I built it up myself the converter but did not implement the scanning feature yet as am trying to find out which are the better ways to approach it.

    Theoretically the converted file could be left in 1 single line of characters however this does seem impractical mostly because need to get the position to copy certain needed areas. I cannot imagine how this would work with massive files of say 8GB. My idea (maybe I am wrong) was to have these chunks into around 2 bytes and work on that.

    I could be missing something here but till now I cannot think otherwise.

     
    Alvin Cardona
    Greenhorn
    Posts: 15
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Carey using the original format would lead to nowhere since it is in binary format. I am basing myself on those hex signatures that is why i converted the output. Using a hex editor hidden files can be found but need to do it automatically over massive files.... hoping it can be done.
     
    Norm Radder
    Rancher
    Posts: 2041
    26
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    The original scan program could find places of interest in the huge file and save their locations somewhere as long values that could be used by the skip() method to position a FileInputStream to the place for reading the data.

    i converted the output

    Not sure what that means.  Is the memory dump  the contents of the bytes in memory?  What would those bytes be converted to?  Why would they be converted?
     
    Carey Brown
    Bartender
    Posts: 2700
    41
    Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Alvin Cardona wrote:I built it up myself the converter but did not implement the scanning feature yet as am trying to find out which are the better ways to approach it.

    Theoretically the converted file could be left in 1 single line of characters however this does seem impractical mostly because need to get the position to copy certain needed areas. I cannot imagine how this would work with massive files of say 8GB. My idea (maybe I am wrong) was to have these chunks into around 2 bytes and work on that.

    I could be missing something here but till now I cannot think otherwise.

    How does the original, un-converted data, mark the headers and footers?

    Seems like formatting/converting may actually be complicating things.
     
    Carey Brown
    Bartender
    Posts: 2700
    41
    Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    You could dump the raw binary into a file and then read the file using a RandomAccessFile.
     
    Alvin Cardona
    Greenhorn
    Posts: 15
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Carey ... the original file is a binary file and is useless for scanning it using the mentioned signatures. The snapshot below is a sample of the converted file together with the ASCII (if there is the need). The HEX section is enough to work with. This is a converted floppy disc image which I am using due to size considerations.



    Capture.PNG
    [Thumbnail for Capture.PNG]
     
    Norm Radder
    Rancher
    Posts: 2041
    26
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
      a binary file and is useless for scanning
    No disk file is useless for scanning. It's a question interpreting the data as needed.
     
    Alvin Cardona
    Greenhorn
    Posts: 15
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Yes agreed but in this case I was unable to conduct any form of scanning on the original raw data. In addition the only information I could base myself upon revolved around HEX values. That is why I am hoping to solve it like that.
     
    Norm Radder
    Rancher
    Posts: 2041
    26
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    unable to conduct any form of scanning on the original raw data.

    I would guess that the converted file is at least twice as large as the original file if each byte of the original is converted into two characters.
    If the analysis tools you have can  only read text files, that would be an argument for doing the conversion.  If you are writing all the analysis tools, I'm not sure a conversion would be needed.
     
    Alvin Cardona
    Greenhorn
    Posts: 15
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    I am building it up from scratch Norm. Need just to sort out this approach how to go about extracting parts of this file into other files and should be sorted.... and hoping to prove something
     
    Norm Radder
    Rancher
    Posts: 2041
    26
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    I am building it up from scratch

    Sorry, I have no idea what "from scratch" means in this context.  Normally that would mean to me that there was no input data and that the program was building the output entirely from self generated data.
     
    Carey Brown
    Bartender
    Posts: 2700
    41
    Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Alvin Cardona wrote:Carey ... the original file is a binary file and is useless for scanning it using the mentioned signatures. The snapshot below is a sample of the converted file together with the ASCII (if there is the need). The HEX section is enough to work with.

    I would argue with binary files being useless for scanning. Perhaps if you intend to use some off the shelf scanning software that might be true. If you are writing your own then you can have it do whatever you want.

    Regardless, let's say for a moment that your hex file is what we have to work with. You still haven't described clearly how you would detect headers and footers. I'm assuming that whatever delineates the data in the non-converted binary data is also being converted to hex at which point it may no longer be distinguishable from the other hex data. Would it be possible to break the data into separate files during the conversion process?
     
    Stefan Evans
    Bartender
    Posts: 1836
    10
    • Likes 3
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hmmm. 

    So you have binary image of someones phone (possibly from your explanation without their consent/knowledge).
    And you want to scan this image for certain chunks of data identified by a header/footer.

    Maybe  <password>GrabThisBit</password>  ?
    Or <bankAccountNumber>LetsGoToTown</bankAccountNumber> ?

    something like that?
     
    Alvin Cardona
    Greenhorn
    Posts: 15
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Yes Stefan... that is it.... In my case, for image file resources... particularly those which allow data to be compressed in them for steganographic purposes.
     
    Stephan van Hulst
    Saloon Keeper
    Posts: 7476
    134
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Alvin, before we continue this discussion, maybe you can give us more details about what exactly these purposes are?
     
    Dave Tolls
    Ranch Hand
    Posts: 2721
    30
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    One thing.
    This is a disk image?
    How are you identifying the start (or end) of a file?

    FF D8 FF E0, for example, is simply a header.  It can't be used to identify the beginning of the file as that data could easily exist within the file.

    The same goes for the footer.

    There's also the question of possible fragmentation, though that won't be much of an issue with images as there's not going to be much editing going on with them I suppose.

    And finally, don't many phones have some sort of encryption on them?
    If these are people who hide nefarious data inside images, I would also expect them to have this stuff encrypted, so a straight disk dump would be nigh on useless.
     
    Alvin Cardona
    Greenhorn
    Posts: 15
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hi Dave, am using a particular device to interface with the motherboard. To make it easier and exclude the real scenario, the question can be simply termed as... if there is a file full of letters, and need to take out the content from say a sequence A B C up to and including XYZ, and this repeats multiple times
     
    Dave Tolls
    Ranch Hand
    Posts: 2721
    30
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    OK.



    Neither size of input nor output file(s) will be an issue.
     
    Alvin Cardona
    Greenhorn
    Posts: 15
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Dave I was thinking to read each value and check the ones after up to 3 places to make it more effcient. So if the sequence is A B C D ... searching for A, if A, then check the one after till the sequence is completely matched. The same for the footer.

    thanks
     
    Dave Tolls
    Ranch Hand
    Posts: 2721
    30
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Not sure I see the difference.

    Reading in the hex values you have to check for the first one in the sequence you are hunting for.
     
    • Post Reply Bookmark Topic Watch Topic
    • New Topic
    Boost this thread!