Forum:

Beginning Java

How to pull data from a very large file

Greenhorn

Posts: 15

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

This thread has been very helpful to try to solve a similar problem I am doing in a research however in my case the file is really huge (around 16gb) and need to carve out data between a particular header and footer. Also these headers and footers can repeat more than once in such a file and ever time the content in between (including the headers and footers themselves) need to be saved into a new file. So at the end the result could end up with some hundred small files. I am working with data dumps which I already converted into hex format. Each line has 16 hex values each separated by a space. To add more complexity the worst case scenario is when part of the header or footer is split up over two lines.

Any tips on how best to approach this would really appreciated

thanks

Winston Gutkowski

Bartender

Posts: 10780

I like...

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Alvin Cardona wrote:Any tips on how best to approach this would really appreciated

Well, firstly, you need to understand EXACTLY how the file and all it's "headers" are formatted, so you can write a procedure to find the data you need in ANY situation, and also know when it ends.

Case in point: Why would a header be split over two lines? It seems an unnecessary complication to me, but it may be beyond your control.
And if it is split, what tells you that it is? Many text file formats include a "continuation" marker (usually a backslash (\) or a tilde (~)) placed at the last character on a line to indicate that it continues on the next record.

Also: Why are your values in hex? Small, memory-constrained devices sometimes do this to save space, but there are other alternatives; such as outputting the data in human-readable ascii form and then ploughing it through a compression pipe (ascii text usually compresses extremely well). In Unix, you could do this with compress and zcat, and I suspect compression might be even better than outputting "raw" data.

And finally: Why is the file so huge? Repeatedly pulling a few hundred lines out of a 16 gig "serial lump" is basically about as slow as you could possibly make things, so it might be worth re-thinking your approach.
For example: What about a database? This sort of stuff is precisely what they're designed for.

No answers, I'm afraid. Only questions...

HIH

Winston

"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here

Alvin Cardona

Greenhorn

Posts: 15

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Hi Winston,

thanks for the feedback first of all. Here are some answers for the questions:

1. The header or footer may be split up on different lines as a file can be residing anywhere, so am considering this extreme case scenario. Imagine saving an image or any file on your portable device, you have no control where it will be saved.
2. Am using HEX as need to scan the file using particular signatures which are known in HEX. (yet could work on a simple text file with just letters so as to try and simplify the situation).
3. The file is huge as it is related to smartphones and a complete image of the device's memory is being done (a complete memory dump). Considering a basic model nowadays it start from 8GB. If the device is completely cut off intentionally (no means to connect to it) unless it is read directly by means of a hardware interface, the contents cannot be retrieved. I built up a small file (the size of a floppy) so as to test upon. I know the real thing IF I manage to get it working is going to take ages unless I focus on the user data area only.

Thanks once again

Carey Brown

Saloon Keeper

Posts: 10732

I like...

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Winston has some very good points. Do you have any control over the format of this file?
I ran into a similar project a couple of years ago where I had to analyze a log file about the same size as you are mentioning.

Post an example of a header and two hex lines below it.

Post an example of a split header with the two hex lines that follow it.

Post an example of a footer with the two hex lines that precede it.

Post an example of a hex pattern that you might search for.

Can your search pattern span multiple lines?

JavaRanch-FAQ HowToAskQuestionsOnJavaRanch UseCodeTags DontWriteLongLines ItDoesntWorkIsUseLess FormatCode JavaIndenter SSCCE API-17 JLS JavaLanguageSpecification MainIsAPain KeyboardUtility

Winston Gutkowski

Bartender

Posts: 10780

I like...

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Alvin Cardona wrote:1. The header or footer may be split up on different lines as a file can be residing anywhere, so am considering this extreme case scenario. Imagine saving an image or any file on your portable device, you have no control where it will be saved.

Hmmm. Still sounds like an unnecessary complication to me.

2. Am using HEX as need to scan the file using particular signatures which are known in HEX. (yet could work on a simple text file with just letters so as to try and simplify the situation).

Fair enough. I presume you understand your "format" better than me.

3. The file is huge as it is related to smartphones and a complete image of the device's memory is being done (a complete memory dump). Considering a basic model nowadays it start from 8GB.

There are still lossless binary compression methods that may help - including "blocked" Base64 - particularly as "memory" often has vast swathes of '0's in it.

If the device is completely cut off intentionally (no means to connect to it) unless it is read directly by means of a hardware interface, the contents cannot be retrieved. I built up a small file (the size of a floppy) so as to test upon. I know the real thing IF I manage to get it working is going to take ages unless I focus on the user data area only.

I think it would be good to know why you're doing this. Saving entire memory images seems like an odd pastime to me.

Winston

"Leadership is nature's way of removing morons from the productive flow" - Dogbert
Articles by Winston can be found here

Alvin Cardona

Greenhorn

Posts: 15

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Hi Carey

No there is no control on the original file. To cut the story short,the idea here is that an application's or game's resources could be used to hide information (steganography) which in turn could be used for malicious purposes.. If the user suspects he has been spotted, might easily try to destroy the information or uninstall the application, if not even disable access to the device. My intention is to bypass all the obstacles, acquire an image of the device and see if it is possible to acquire again all resources associated and even spot the tampered resource/s. Might seem a bit like a Hollywood movie plot i know

but believe it or not it has been used for corporate espionage or a lot worse.

If for instance JPG is the resource used -

header - FF D8 FF
footer - FF D9

So in one line could exist FF and the D8 FF on the line after and similarly for the footer. Since there is not just one instance of such a file, it could be present for instance many times. I was thinking of using an array list but am having doubts as regards the size and if i am on the right track.

The data I could reduce it since the user will be bound (userdata partition), but it is also going to be quite huge.

thanks

Carey Brown

Saloon Keeper

Posts: 10732

I like...

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Alvin Cardona wrote:header - FF D8 FF

Just taking this header example, I might (incorrectly) infer this rule:

Header begins with hex FF and ends with hex FF.

The header identifier consists of 1 or more hex values that are not FF.

No other data uses hex FF.

The beginning and ending FF may appear on different lines.

I'm assuming that this is grossly incorrect but you can see that without being able to describe precisely what the pattern is it would be impossible to write code to automate this.

JavaRanch-FAQ HowToAskQuestionsOnJavaRanch UseCodeTags DontWriteLongLines ItDoesntWorkIsUseLess FormatCode JavaIndenter SSCCE API-17 JLS JavaLanguageSpecification MainIsAPain KeyboardUtility

Alvin Cardona

Greenhorn

Posts: 15

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

my apologies Carey .... had a typo...value missing .... FF D8 FF E0

Carey Brown

Saloon Keeper

Posts: 10732

I like...

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Alvin Cardona wrote:my apologies Carey .... had a typo...value missing .... FF D8 FF E0

How do you know that this is a header and not data?

JavaRanch-FAQ HowToAskQuestionsOnJavaRanch UseCodeTags DontWriteLongLines ItDoesntWorkIsUseLess FormatCode JavaIndenter SSCCE API-17 JLS JavaLanguageSpecification MainIsAPain KeyboardUtility

Alvin Cardona

Greenhorn

Posts: 15

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

According to the research done (unless I missed something), such sequences in that order are only allowed at the beginning and a corresponding footer. There is a whole set of signatures for each file type by which each is recognized.

Norm Radder

Rancher

Posts: 5008

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

How is the data file created that it has "lines"? My idea of a line is some text ended by line-end characters. A memory dump wouldn't have lines.

Alvin Cardona

Greenhorn

Posts: 15

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Hello Norm,

I am converting it with a function I developed. Tested the output with that of a hex editor such as WinHex and the outputs are exactly the same.

To cater for the header/footer being on two lines i was thinking to use some form of "padding" if it can be called like that. The size of this would be two and so in other words the checking would be done across 18 characters. Still I am not grasping the idea how to copy the content from start till end and then in what appropriate structure

In theory it looks easy... in practice I am finding it quite the opposite.

Norm Radder

Rancher

Posts: 5008

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Can the original program that converts it do the analysis? Why read it more than once?
Why are there "lines" and data that spans "lines"?

Carey Brown

Saloon Keeper

Posts: 10732

I like...

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Copying is relatively simple. Defining what constitutes the 'start' and 'end' is the tricky part.
It seems that you are able to do this with a hex-editor. Can you describe what you see that lets you know when you've found the 'start'?

JavaRanch-FAQ HowToAskQuestionsOnJavaRanch UseCodeTags DontWriteLongLines ItDoesntWorkIsUseLess FormatCode JavaIndenter SSCCE API-17 JLS JavaLanguageSpecification MainIsAPain KeyboardUtility

Carey Brown

Saloon Keeper

Posts: 10732

I like...

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Norm Radder wrote:Can the original program that converts it do the analysis? Why read it more than once?
Why are there "lines" and data that spans "lines"?

Very true. This gets back to the original question posted to the OP about whether or not he has control over the format, implying also that the OP would have access to the raw, unformatted, data.

JavaRanch-FAQ HowToAskQuestionsOnJavaRanch UseCodeTags DontWriteLongLines ItDoesntWorkIsUseLess FormatCode JavaIndenter SSCCE API-17 JLS JavaLanguageSpecification MainIsAPain KeyboardUtility

Alvin Cardona

Greenhorn

Posts: 15

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

I built it up myself the converter but did not implement the scanning feature yet as am trying to find out which are the better ways to approach it.

Theoretically the converted file could be left in 1 single line of characters however this does seem impractical mostly because need to get the position to copy certain needed areas. I cannot imagine how this would work with massive files of say 8GB. My idea (maybe I am wrong) was to have these chunks into around 2 bytes and work on that.

I could be missing something here but till now I cannot think otherwise.

Alvin Cardona

Greenhorn

Posts: 15

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Carey using the original format would lead to nowhere since it is in binary format. I am basing myself on those hex signatures that is why i converted the output. Using a hex editor hidden files can be found but need to do it automatically over massive files.... hoping it can be done.

Norm Radder

Rancher

Posts: 5008

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

The original scan program could find places of interest in the huge file and save their locations somewhere as long values that could be used by the skip() method to position a FileInputStream to the place for reading the data.

i converted the output

Not sure what that means. Is the memory dump the contents of the bytes in memory? What would those bytes be converted to? Why would they be converted?

Carey Brown

Saloon Keeper

Posts: 10732

I like...

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Alvin Cardona wrote:I built it up myself the converter but did not implement the scanning feature yet as am trying to find out which are the better ways to approach it.

Theoretically the converted file could be left in 1 single line of characters however this does seem impractical mostly because need to get the position to copy certain needed areas. I cannot imagine how this would work with massive files of say 8GB. My idea (maybe I am wrong) was to have these chunks into around 2 bytes and work on that.

I could be missing something here but till now I cannot think otherwise.

How does the original, un-converted data, mark the headers and footers?

Seems like formatting/converting may actually be complicating things.

JavaRanch-FAQ HowToAskQuestionsOnJavaRanch UseCodeTags DontWriteLongLines ItDoesntWorkIsUseLess FormatCode JavaIndenter SSCCE API-17 JLS JavaLanguageSpecification MainIsAPain KeyboardUtility

Carey Brown

Saloon Keeper

Posts: 10732

I like...

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

You could dump the raw binary into a file and then read the file using a RandomAccessFile.

JavaRanch-FAQ HowToAskQuestionsOnJavaRanch UseCodeTags DontWriteLongLines ItDoesntWorkIsUseLess FormatCode JavaIndenter SSCCE API-17 JLS JavaLanguageSpecification MainIsAPain KeyboardUtility

Alvin Cardona

Greenhorn

Posts: 15

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Carey ... the original file is a binary file and is useless for scanning it using the mentioned signatures. The snapshot below is a sample of the converted file together with the ASCII (if there is the need). The HEX section is enough to work with. This is a converted floppy disc image which I am using due to size considerations.

Capture.PNG

Norm Radder

Rancher

Posts: 5008

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

a binary file and is useless for scanning

No disk file is useless for scanning. It's a question interpreting the data as needed.

Alvin Cardona

Greenhorn

Posts: 15

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Yes agreed but in this case I was unable to conduct any form of scanning on the original raw data. In addition the only information I could base myself upon revolved around HEX values. That is why I am hoping to solve it like that.

Norm Radder

Rancher

Posts: 5008

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

unable to conduct any form of scanning on the original raw data.

I would guess that the converted file is at least twice as large as the original file if each byte of the original is converted into two characters.
If the analysis tools you have can only read text files, that would be an argument for doing the conversion. If you are writing all the analysis tools, I'm not sure a conversion would be needed.

Alvin Cardona

Greenhorn

Posts: 15

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

I am building it up from scratch Norm. Need just to sort out this approach how to go about extracting parts of this file into other files and should be sorted.... and hoping to prove something

Norm Radder

Rancher

Posts: 5008

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

I am building it up from scratch

Sorry, I have no idea what "from scratch" means in this context. Normally that would mean to me that there was no input data and that the program was building the output entirely from self generated data.

Carey Brown

Saloon Keeper

Posts: 10732

I like...

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Alvin Cardona wrote:Carey ... the original file is a binary file and is useless for scanning it using the mentioned signatures. The snapshot below is a sample of the converted file together with the ASCII (if there is the need). The HEX section is enough to work with.

I would argue with binary files being useless for scanning. Perhaps if you intend to use some off the shelf scanning software that might be true. If you are writing your own then you can have it do whatever you want.

Regardless, let's say for a moment that your hex file is what we have to work with. You still haven't described clearly how you would detect headers and footers. I'm assuming that whatever delineates the data in the non-converted binary data is also being converted to hex at which point it may no longer be distinguishable from the other hex data. Would it be possible to break the data into separate files during the conversion process?

JavaRanch-FAQ HowToAskQuestionsOnJavaRanch UseCodeTags DontWriteLongLines ItDoesntWorkIsUseLess FormatCode JavaIndenter SSCCE API-17 JLS JavaLanguageSpecification MainIsAPain KeyboardUtility

Stefan Evans

Bartender

Posts: 1845

posted 7 years ago

3
Number of slices to send:

Optional 'thank-you' note:

Send

Hmmm.

So you have binary image of someones phone (possibly from your explanation without their consent/knowledge).
And you want to scan this image for certain chunks of data identified by a header/footer.

Maybe <password>GrabThisBit</password> ?
Or <bankAccountNumber>LetsGoToTown</bankAccountNumber> ?

something like that?

Alvin Cardona

Greenhorn

Posts: 15

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Yes Stefan... that is it.... In my case, for image file resources... particularly those which allow data to be compressed in them for steganographic purposes.

Stephan van Hulst

Saloon Keeper

Posts: 15524

364

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Alvin, before we continue this discussion, maybe you can give us more details about what exactly these purposes are?

Dave Tolls

Rancher

Posts: 4801

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

One thing.
This is a disk image?
How are you identifying the start (or end) of a file?

FF D8 FF E0, for example, is simply a header. It can't be used to identify the beginning of the file as that data could easily exist within the file.

The same goes for the footer.

There's also the question of possible fragmentation, though that won't be much of an issue with images as there's not going to be much editing going on with them I suppose.

And finally, don't many phones have some sort of encryption on them?
If these are people who hide nefarious data inside images, I would also expect them to have this stuff encrypted, so a straight disk dump would be nigh on useless.

Alvin Cardona

Greenhorn

Posts: 15

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Hi Dave, am using a particular device to interface with the motherboard. To make it easier and exclude the real scenario, the question can be simply termed as... if there is a file full of letters, and need to take out the content from say a sequence A B C up to and including XYZ, and this repeats multiple times

Dave Tolls

Rancher

Posts: 4801

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

OK.

Neither size of input nor output file(s) will be an issue.

Alvin Cardona

Greenhorn

Posts: 15

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Dave I was thinking to read each value and check the ones after up to 3 places to make it more effcient. So if the sequence is A B C D ... searching for A, if A, then check the one after till the sequence is completely matched. The same for the footer.

thanks

Dave Tolls

Rancher

Posts: 4801

posted 7 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Not sure I see the difference.

Reading in the hex values you have to check for the first one in the sequence you are hunting for.