File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes I/O and Streams and the fly likes Reading from DataInputStream Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Reading from DataInputStream " Watch "Reading from DataInputStream " New topic
Author

Reading from DataInputStream

Tien Shan
Ranch Hand

Joined: Oct 08, 2004
Posts: 38
I want to read a file using DataInputStream and display or write content between two known "signal keywords".

Here is what I want to do, but i don't know how to make it work.
(1) Start from 0 index (very first byte)
(2) read 9 bytes and append to a StringBuilder //9, because of the length of <WELCOME>, which is my signal keyword
(3) check if the StringBuilder's content is "<WELCOME>"
(4) if yes, start reading all bytes until </WELCOME> is found.
(5) if no, shift 1 index, that is, read from the second byte and read 9 bytes...
(6) repeat all of the above, until all bytes are read

Here are some snippets I managed to cobble up, but I a know I am nowhere near.
Please give me some idea.
K. Tsang
Bartender

Joined: Sep 13, 2007
Posts: 2584
    
    9

First off, is your data file a XML file? If so why not use the XML API?

If you are to read in like a string, you may want to use regular expression to determine to content between <WELCOME> and </WELCOME>.

Also for line 14, why use checking for "equals", shouldn't it be contains or startWith?

K. Tsang JavaRanch SCJP5 SCJD/OCM-JD OCPJP7 OCPWCD5 OCPBCD5
Tien Shan
Ranch Hand

Joined: Oct 08, 2004
Posts: 38
K. Tsang:
Thanks for the comments.

The file I am trying to read contains both xml part (header and subheaders) and binary (float) part.
The header and subheader sections contain information about the binary part, like binary data length.
The signal words are known, like <header>, <name>, etc.

Here is a sample file:


My ideas was, if I could read the stream byte by byte, I should come to a point where the next 9 bytes (length of "<DataLen>") are "<DataLen>".
Similarly, I would find where is </DataLen>, so if I could have two "ends", I could find the length of the binary data.
I wanted to read the 123456 (value of the DataLen) bytes , starting from the string "data.bin".

Also for line 14, why use checking for "equals", shouldn't it be contains or startWith?

Since at the end of the loop, the StringBuilder contains (or does not contain) the exact keyword ("<WELCOME>"), I thought the equals method would be okay, no?
Luan Cestari
Ranch Hand

Joined: Feb 07, 2010
Posts: 163

I think you should use NIO. That seems to suits better for the type of operation and you would have a better performance. Here a link with some snippets:
http://stackoverflow.com/questions/9046820/fastest-way-to-incrementally-read-a-large-file


Please, visit me for some cool tech post at www.ourdailycodes.com
Tien Shan
Ranch Hand

Joined: Oct 08, 2004
Posts: 38
Luan Cestari:
Thank you for the suggestion.
I didn't understand why/how the NIO would help. In any case, I want to make it work first and worry about performance issues later.

By the way, I have about 60 files, 10 MB each, that needs to be read and loaded into memory. I don't know if NIO would be a better choice in that case.

Luan Cestari
Ranch Hand

Joined: Feb 07, 2010
Posts: 163

I thought NIO due you talked about get some specific part of file.

Since you are using BufferedInputStream, all the data is already being loaded into the memory. Maybe the processing cost to transform that data into String to search on it (using REGEX for example) would make the solution much simpler and (maybe) faster. I found this thread interesting -> http://www.coderanch.com/t/278043//java/Searching-large-text-file what do you think?

--edit
these links can be useful too:
http://stackoverflow.com/questions/737318/should-i-use-datainputstream-or-bufferedinputstream
http://stackoverflow.com/questions/18111264/reading-mime-formatted-file-containing-header-xml-and-binary-float-data-sets
Tien Shan
Ranch Hand

Joined: Oct 08, 2004
Posts: 38
No, no, this discussion is not going in the direction (that) I wanted it to go.
As I wrote earlier, let me make something that works. I will worry about performance later.
Afterall, I am not reading terabytes of data.

May be the way I phrased my question was confusing. Apologies.
I will rephrase and report what I have been doing later.



Tien Shan
Ranch Hand

Joined: Oct 08, 2004
Posts: 38
Luan Cestari :
I admit that I have come to my senses

What is happening is that it takes about 50 seconds to search for a predefined keyword. Not acceptable.
I have not even begun to read the real stuff-- I need to read large chunk of binary float data ("large binary float data here"- from my previous post) and that would take some time.
Then there will be 50-60 files like that. No, no.

I can only say that the way I have done is not right. There must be smarter way.

What I did:
I read the file into a byte array. Then I appended appropriate number of bytes and looked for the keyword. It works, but as mentioned above, it is super slow.
There may be a way to read MIME/multipart file, reading 'Content-Type: text/xml; charset="UTF-8"' and extracting header information and reading 'Content-Type: application/octet-stream' and extracting the binary(float) data.

Here are the code snippets:



Once I had the byte array (bs), I could get a string like this:


Since the header portion of the multilayered XML starts with "<?xml" and ends with "</Header>", I found the appropriate start and end indices, and copied the content inside those indices to a file - using file output stream-for further processing (for example, I need the binary float data length, which is inscribed inside the header info.)

Let me repeat a portion of the data file that is being streamed in:


There may be several binary float data, if there is "subheader info", there will be the binary float data.
The binary float data is always preceded by the keyword "data.bin", which I used to calculate the start index of each block of the binary float.

Right now I am struggling to read float value from a given index and store it in an array for further processing.
From the byte array ("bs"), I can get ASCII text easily as I wrote above, reading each byte.
For float, I need to read 4 bytes. I read 4 bytes (using byteBuffer and getFloat, but I can't find a way to convert that into a correct long or longint value. (Datawise, they are 12 to 14 digit numbers, some reading from some instrument)

Again, snippets:

When you (or other gurus!), get a chance, please comment.

Thank you.




Luan Cestari
Ranch Hand

Joined: Feb 07, 2010
Posts: 163

Hi =D

I read your post, but I was in a hurry today. I started a project in github to solve your scenario in a very simple way as possible. I didn't finsih due the time, but I think tomorrow we can already discuss if it archive the desire results =D The URL: https://github.com/luan-cestari/SimpleReadFiles

Regards,
Luan
Tien Shan
Ranch Hand

Joined: Oct 08, 2004
Posts: 38
Hi Luan Cestari,
Looking forward to read your gibhub entries.
Thank you!!
Luan Cestari
Ranch Hand

Joined: Feb 07, 2010
Posts: 163

Hi Tie

Sorry for the delay, I got very many things to do these days.

I just finished the example that I would like to show you. In average it took 50 milliseconds to read 10 MB file and make a checksum with those bytes (just need to change this to the business rules) (I'm using an old notebook, using 5400 rpm disk, maybe you can get better results)(I also leave some configuration to be tuned, like the number of files that you want to be processed in parallel) (I created 4 10MB files with some data and it only took 110 milliseconds to finish using 2 threads).

Now we can use this project to try some solution to your problem together =)

Best Regards!

Luan
Luan Cestari
Ranch Hand

Joined: Feb 07, 2010
Posts: 163

Note: I updated the repository with those changes I just cited -> https://github.com/luan-cestari/SimpleReadFiles
Tien Shan
Ranch Hand

Joined: Oct 08, 2004
Posts: 38
Hi Luan,
You went that extra mile for this.. thank you so much.
50ms?!That's impressive.
Honestly, I will need some time to understand what is going on.. weekend is near, I will read your code and see how I can use it for this task I have at hand.

Again, thanks a lot- doumou arigatou.
Tien
Luan Cestari
Ranch Hand

Joined: Feb 07, 2010
Posts: 163

Welcome =)

I pushed some changes and made some more tests and also made a modification branch to look more like you said -> https://github.com/luan-cestari/SimpleReadFiles/tree/another_approach

I think the main goal now is to create a regex to extract the data from the text =)

Bye
Luan
Luan Cestari
Ranch Hand

Joined: Feb 07, 2010
Posts: 163

I made a post about this case =D => http://www.ourdailycodes.com/2013/09/simple-project-analysing-throughput-of.html

I will finish some code here and I will post here =)
Luan Cestari
Ranch Hand

Joined: Feb 07, 2010
Posts: 163

My girlfriend and I finished the project that might help you =D We testes some IO options in Java and also Regex to get the highest throughput =) The code is in the same repository in GIthub and we also created a post to give more detaisl about ->http://www.ourdailycodes.com/2013/09/regex-parsing-and-some-benchmarks-in.html

Let me know if you need any help =)
Tien Shan
Ranch Hand

Joined: Oct 08, 2004
Posts: 38
Luan:
Hi! Sorry for a late reply.
I read your blog post as well as source code. Thanks a lot for the time and effort you (and your girlfriend) put for this.
It's something I can use in some way or the other.

Next phase of my task involves using a parser to read mime formatted content. I guess I should close this thread as solved and open a new one.

Cheers!
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Reading from DataInputStream