• Post Reply Bookmark Topic Watch Topic
  • New Topic

Reading a large text file ,modifying it and writing it to another file  RSS feed

 
megha joshi
Ranch Hand
Posts: 206
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi ,

I am trying to read a 1.77 GB text file line by line, do a replace function on each of the lines and then write it to another text file.
From the previous threads in this forum I learned that Buffered Reader writer and normal string functions would be best as its a line by line read of text file.
Now I have two issues:

1) The size of the output file I get is bigger than the input file, I dont know whats the error.
2) I am trying to use threads to improve performance as one thread could read the file and another could write the file. Also I am using a queue in between the two threads, plus using the thread wait(), notify() mechanism for synchonization and to assure that the queue size does not increase too much. But somehow I am getting the same performance as the once without threads would give.

My code with threads and without threads is posted below. Any help would be greatly appreciated...

Code with threads...



Code without threads:


Regards,
Megha

[ June 06, 2007: Message edited by: megha joshi ]
[ June 07, 2007: Corrected Thread Code]
[ June 07, 2007: Message edited by: megha joshi ]
 
steve souza
Ranch Hand
Posts: 862
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Before doing a multithreaded app I would do the more simple sequential process of reading a line, processing it, and writing it. Profile this and tune accordingly.

Also I'm no scripting guru, but would it be easier to do this in a scripting tool? AWK?
[ June 06, 2007: Message edited by: steve souza ]
 
megha joshi
Ranch Hand
Posts: 206
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I used the simplest core level functions without thread(second code template) to fine tune performance and then was trying to get threads to work to reduce time by splitting work..

I have the program in perl doing the same thing in less time, but combined with a series of modifications apart from braket replacements, perl is taking more time. I dont know if its IO or if its replacements

I was looking for the java alternative to reduce time...

Thanks,
Megha
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you for using code tags, however the indentation in your code jumps around rather randomly; it's hard to read sometimes. I think this may be because you're using a mixture of tabs and spaces, and your tabs are set differently than the web browser displays them. I recommend that you either use no tabs for indentation, or only tabs. I prefer no tabs, since by default tabs take 8 spaces, and that's much more than you really need. And not everyone has a really big display; some people use laptops, so really wide lines aren't a good idea.

I see several empty catch blocks in your code. These are almost always a bad idea, as they make it needlessly difficult to find errors when they occur. A simple alternative is to use printStackTrace():


[Megha]: 1) The size of the output file I get is bigger than the input file, I dont know whats the error.

Is this true for both your programs, or just the threaded version? Have you tried looking at the contents of the files, to see how they compare? I realize you don't want to look at all of a 1.77 GB file, but try looking at the beginning, at least, and see how similar they look. You may get important clues about what's going on.

It looks to me like your threaded code is reading the first line of the file over and over. I would think that you want to read each line once, instead.

As Steve says, you should really get the non-threaded version working correctly first, then see if you can improve its speed, before trying the threaded version. The non-threaded version should be a lot simpler.

To improve speed: a profiler would be a very useful tool here. But even if you don't have it, you can run a few simple tests by just commenting out a few lines of code:

1. How long does it take to read every line in the input file, and do nothing with them?
2. How long does it take to read every line in the input file and write it to the output file, with no character replacement?
3. How long does it take to do the same thing with character replacement?

In this way, you can discover which parts of this process are important to speed up, and which are not. I can imagine some ways to speed up the character replacement, but in all likelihood they are unimportant, because you're spending almost all the time reading and writing.

For the threaded version: if you want to limit the amount of data that can be read in at once before it's written out, try using a LinkedBlockingQueue instead. You can experiment with the size of the queue, to see what size gives you the best performance, or if it matters at all.

If you print "reading from queue" and "writing from queue" every time you read or write a line, that's going to slow things down quite a bit. That's fine for debugging while you're trying to get this to work, but I hope you comment those lines out later when you want it to be fast.
[ June 06, 2007: Message edited by: Jim Yingst ]
 
megha joshi
Ranch Hand
Posts: 206
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks all,

I didnt realize its reading same line again and again, I will look into the non-threaded version for it and correct it. I dont have the way to look at the contents of the file as this big file doesnt load,but I guess I should start with smaller file first.

I will also try the debugging tricks and then post the code and results if I have more queries.

Thanks for helping.Any more suggestions would also be great.
Megha
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How long does it need?
How fast is a plain harddrive - copy compared?

My first ideas are sed and tr.
It's common part of most Unix/Linux installation, and available for the win32-plattform too here:

http://unxutils.sourceforge.net/

here are the commands:

even shorter and much faster on my machine using tr:

cat might be type on your platform.

sed: 4s/ 10 MB
tr: 0.3s/ 10 MB

(btw: Using only tabs, and a laptop, and 8 spaces)
 
megha joshi
Ranch Hand
Posts: 206
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,


My this program is part of a big program which does the following steps:
1) Reads entries from one file1 and forms hash table
2) Reads entries from second file2 and substitutes values for any of the keys it finds in each line based on the hashtable formed from first file1.
3) Does additional regex based replacements ...on each line.
4) Writes the modified string line by line to third file3.

I feel IO is the bottle neck for now and want to find efficient way of doing it...later I would focus on the hash lookups and substitutions...

With this simple program can anyone guide me how do I improve the IO performance for reading max 3 million lines file2 ???

Currently the reading file part takes around 2 mins and writing file part takes 2 minutes on Solaris, it takes 4 mins total for the whole program.



Thanks for helping,
Megha
[ June 06, 2007: Message edited by: megha joshi ]
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Since you're using JDK 5 or 6, it's worth a try to use a Scanner instead of the Readers, and for output create a PrintWriter using a constructor that takes a File. These may be faster than other combinations; I'm not sure. Try it and see.
 
megha joshi
Ranch Hand
Posts: 206
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Jim,

I tried it but no significant improvement...would there be a reason why for reading and writing text files line by line, Java I/O wouldnt be faster than perl I/O...

My modified code is as below...



Thanks,
Megha
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When I worked on parsing large text files I used the following Threads:

1. A thread reading bufferloads of text and putting them in a queue, when the queue reaches an arbitrary size - sleep for a while.

2. A thread taking bufferloads of text from 1 and processing them, writing bufferloads of results to a queue for thread 3.

3. A thread writing the output bufferloads, sleeping when none are available.

This managed to keep a dual processor system pretty busy.

Bill
 
megha joshi
Ranch Hand
Posts: 206
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks William,

If you see my first post...I have used threads and queue in the same way as you said...but I guess my IO methods are not efficient...what methods did you use for IO...?

Thanks,
Megha
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Since I knew the input text was all ASCII, I read bytes directly into buffers. When your read characters, Java has to individually convert each byte to a char - this burns CPU time like crazy.

As near as I can tell from your description, you are not actually using the architecture I described - you have left out the processing Thread that works while Threads 1 and 3 wait on IO.

Be aware that a Thread doing Java IO spends a lot of time waiting for the operating system - especially if there is a network involved. I bet that if you look at the actual CPU utilization, you will find it is not very high.

Bill
 
megha joshi
Ranch Hand
Posts: 206
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Since I knew the input text was all ASCII, I read bytes directly into buffers. When your read characters, Java has to individually convert each byte to a char - this burns CPU time like crazy.


Ultimately I want to do string lookup and string converstions on each line of the file...can you please elaborate on how I can do this along with reading bytes....and not converting to chars...What is the combination of core API functions you use...
I read somewhere that if I use FileInputStream etc it would be more overhead so using reader or scanner would be good in case of reading strings from text file...

thanks,
Megha
 
megha joshi
Ranch Hand
Posts: 206
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
After some more findings I think I cant use the FileInputStream with self buffer or BufferedFileInputStream becuase I want each of the lines of file one by one and process them...and I dont see how I can do it with Low level byte or byte[] IO.

So I am still wondering how do I increase IO performance...
[ June 07, 2007: Message edited by: megha joshi ]
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
[Megha]: would there be a reason why for reading and writing text files line by line, Java I/O wouldnt be faster than perl I/O...

Well, Perl is generally pretty good at that sort of thing to begin with; there's no particular reason to assume you can do it faster with Java. Maybe you can, or maybe you can't, but don't enderestimate Perl's speed at processing text files. That's what it was originally designed for.

Multiple threads may still help you in Java even if the task isn't CPU-bound. If you read from one file and write to another file, you may spend some time waiting for the hard drive to move back and forth from one location to another. By using a LinkedBlockingQueue and letting one thread read many lines at once, then letting another thread write many lines at once, it's possible for the application to spend less time waiting for the hard drive to reposition, and more time just reading and writing. So multiple threads may still help you once you have them set up correctly.

[Megha]: After some more findings I think I cant use the FileInputStream with self buffer or BufferedFileInputStream becuase I want each of the lines of file one by one and process them...and I dont see how I can do it with Low level byte or byte[] IO.


Well, it's possible, but you have to write more code yourself. Loop through the byte[] array and look for '\n' and '\r' characters indicating a new line. In some cases you will have to cast a char to byte:

byte b = (byte) ch;

which should be pretty easy, as long as the chars are all simple ASCII characters. (Or more properly, ISO 8859-1 characters.) If you have any characters outside the range 0-255, this will mutilate them badly.
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ultimately I want to do string lookup and string converstions on each line of the file...can you please elaborate on how I can do this along with reading bytes....and not converting to chars..


If each line must be treated as a String then you might as well use one of Java's line to String reading input methods. Some fiddling with the buffer size may marginally improve performance.

As Jim said, your program may spend a lot of time waiting for the operating system to write to the disk. That is why I used a separate thread for each of the processes - processing a collection of lines can proceed in its own thread while the IO threads are waiting on the operating system.

Your input Thread might read - for example - 1000 lines, creating a String[] that can be placed in a Queue for a processing Thread.

I am assuming we are tallking about a Windows system here - have you been using the Task Manager to monitor your program's actual CPU utilization? If your program is IO bound, you may be surprised at how low the utilization really is.

Bill
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!