This week's book giveaway is in the OO, Patterns, UML and Refactoring forum. We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line! See this thread for details.
I want to read a file using DataInputStream and display or write content between two known "signal keywords".
Here is what I want to do, but i don't know how to make it work.
(1) Start from 0 index (very first byte)
(2) read 9 bytes and append to a StringBuilder //9, because of the length of <WELCOME>, which is my signal keyword
(3) check if the StringBuilder's content is "<WELCOME>"
(4) if yes, start reading all bytes until </WELCOME> is found.
(5) if no, shift 1 index, that is, read from the second byte and read 9 bytes...
(6) repeat all of the above, until all bytes are read
Here are some snippets I managed to cobble up, but I a know I am nowhere near.
Please give me some idea.
The file I am trying to read contains both xml part (header and subheaders) and binary (float) part.
The header and subheader sections contain information about the binary part, like binary data length.
The signal words are known, like <header>, <name>, etc.
Here is a sample file:
My ideas was, if I could read the stream byte by byte, I should come to a point where the next 9 bytes (length of "<DataLen>") are "<DataLen>".
Similarly, I would find where is </DataLen>, so if I could have two "ends", I could find the length of the binary data.
I wanted to read the 123456 (value of the DataLen) bytes , starting from the string "data.bin".
Also for line 14, why use checking for "equals", shouldn't it be contains or startWith?
Since at the end of the loop, the StringBuilder contains (or does not contain) the exact keyword ("<WELCOME>"), I thought the equals method would be okay, no?
No, no, this discussion is not going in the direction (that) I wanted it to go.
As I wrote earlier, let me make something that works. I will worry about performance later.
Afterall, I am not reading terabytes of data.
May be the way I phrased my question was confusing. Apologies.
I will rephrase and report what I have been doing later.
Joined: Oct 08, 2004
Luan Cestari :
I admit that I have come to my senses
What is happening is that it takes about 50 seconds to search for a predefined keyword. Not acceptable.
I have not even begun to read the real stuff-- I need to read large chunk of binary float data ("large binary float data here"- from my previous post) and that would take some time.
Then there will be 50-60 files like that. No, no.
I can only say that the way I have done is not right. There must be smarter way.
What I did:
I read the file into a byte array. Then I appended appropriate number of bytes and looked for the keyword. It works, but as mentioned above, it is super slow.
There may be a way to read MIME/multipart file, reading 'Content-Type: text/xml; charset="UTF-8"' and extracting header information and reading 'Content-Type: application/octet-stream' and extracting the binary(float) data.
Here are the code snippets:
Once I had the byte array (bs), I could get a string like this:
Since the header portion of the multilayered XML starts with "<?xml" and ends with "</Header>", I found the appropriate start and end indices, and copied the content inside those indices to a file - using file output stream-for further processing (for example, I need the binary float data length, which is inscribed inside the header info.)
Let me repeat a portion of the data file that is being streamed in:
There may be several binary float data, if there is "subheader info", there will be the binary float data.
The binary float data is always preceded by the keyword "data.bin", which I used to calculate the start index of each block of the binary float.
Right now I am struggling to read float value from a given index and store it in an array for further processing.
From the byte array ("bs"), I can get ASCII text easily as I wrote above, reading each byte.
For float, I need to read 4 bytes. I read 4 bytes (using byteBuffer and getFloat, but I can't find a way to convert that into a correct long or longint value. (Datawise, they are 12 to 14 digit numbers, some reading from some instrument)
When you (or other gurus!), get a chance, please comment.
I read your post, but I was in a hurry today. I started a project in github to solve your scenario in a very simple way as possible. I didn't finsih due the time, but I think tomorrow we can already discuss if it archive the desire results =D The URL: https://github.com/luan-cestari/SimpleReadFiles
Joined: Oct 08, 2004
Hi Luan Cestari,
Looking forward to read your gibhub entries.
Sorry for the delay, I got very many things to do these days.
I just finished the example that I would like to show you. In average it took 50 milliseconds to read 10 MB file and make a checksum with those bytes (just need to change this to the business rules) (I'm using an old notebook, using 5400 rpm disk, maybe you can get better results)(I also leave some configuration to be tuned, like the number of files that you want to be processed in parallel) (I created 4 10MB files with some data and it only took 110 milliseconds to finish using 2 threads).
Now we can use this project to try some solution to your problem together =)
You went that extra mile for this.. thank you so much.
Honestly, I will need some time to understand what is going on.. weekend is near, I will read your code and see how I can use it for this task I have at hand.
My girlfriend and I finished the project that might help you =D We testes some IO options in Java and also Regex to get the highest throughput =) The code is in the same repository in GIthub and we also created a post to give more detaisl about ->http://www.ourdailycodes.com/2013/09/regex-parsing-and-some-benchmarks-in.html
Let me know if you need any help =)
Joined: Oct 08, 2004
Hi! Sorry for a late reply.
I read your blog post as well as source code. Thanks a lot for the time and effort you (and your girlfriend) put for this.
It's something I can use in some way or the other.
Next phase of my task involves using a parser to read mime formatted content. I guess I should close this thread as solved and open a new one.