• Post Reply Bookmark Topic Watch Topic
  • New Topic

2 Very big files

 
Erik Pragt
Ranch Hand
Posts: 125
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello everyone.
I'm currently creating an application for checking the availability of ADSL in a persons zipcode area. All the zipcodes are stored in 2 files (eg: aol.txt, and compuserve.txt) and the intention is to read through the file, (it's a sort of CSV file, but the layout of the files differ), save the zipcode, and the corresponding result (for example: available, not available, available in June, etc).
Now the problem is that when I read the files, everything is very slow! (I must read 500.000 lines, or approximately 25 MB). Therefor I want to put all the read lines in memory. I tried putting it in a Hashtable, something like this:
Key = <zipcode>
Object = (String) aol | not available
(the ADSL provider and the result separated by a |)
But when I try this, I keep getting the message: Out of Memory. And I only create 3 objects! This problem is keeping me busy for the last 4 weeks, so I hope someone can please help me.. I'm getting a bit desperate!
Greetings,
Erik

(ps, I can include some sample code if you want, or show you the files I have to go through..just let me know!)
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, you have a number of choices, depending on how much time you want to spend, and whether memory or speed is more important to you. A few simple options: try increasing the allowed heap size using the -mx option on the java command. Make sure that for the provider string, you don't store a new string each time. You should be able to use a reference to an already-existing string, so you don't have 500000 copies of "aol" in memory. (Just 500000 references to it. ). You can also probably reduce memory usage by storing the zip code as an Integer rather than a String. (Assuming you're operating only in the US or other countries that don't use letters within the zip.) You may also benefit from replacing your Hashtable with a HashMap, but offhand I don't think this will make a big difference for memory usage.
Now for 500000 records you may well find that the memory costs are still too high to put everything in RAM. If you're writing a probram that will run on a limited number of machines where you have some control over the environment, you may find the best solution is to set up a database and store all your data in a table. Alternately if you don't have access to a good database, or you're distributing this program to other systems that won't have a DB, you can implement your own scheme for storing the data in files but with quicker access.
One simple scheme: split the big files into many small files with names based on the first 2 or 3 digits of the zip. E.g. info for zip code 95050-6233 might go in file zips950.lst (along with all other zip codes starting with 950). On startup you can read through the big files and copy each entry into the appropriate small file. To look up a particular zip code, figure out the corresponding file name it would be in, and then read each line of that file to see if it's there. The more zip digits you use in the filename, the more files you'll have, and the fewer entries in each file when you need to read it quickly.
Another more complex option is to develop an indexing system. First you need all your data in one big file, ordered by zip code. Then go through it and for every Nth record figure out the offset in bytes from the beginning of the file, and store this in either an index file, or something like a TreeMap. Later when you want to look up a particular zip, you find the offest of the nearest indexed zip before your target. Open a new FileInputStream / BufferedInputStream combo and use skip() to get to the desired offset. (Read the API carefully - you need to skip() in a loop to make sure you've skipped the desired number of bytes.) Once you're at the desired offset, start reading lines from that point, and stop when you get to the zip you want. (Or if you find a higher zip value, you'll know your target wasn't there, since the file was sorted.) Be sure to close the streams when you're done, so you'll be able to open new ones the next time you need to do this search.
It may seem tempting to use RandomAccessFile in the above example, since that has the ability to look back at an earlier record, and you won't have to keep opening and closing streams. But I don't recommend it. My experience with RAF is that it's horribly slow in general, and it's ultimately quicker to open new streams, seek, read, and close the streams. Unless they've substantially improved RAF since 1.2.2. Don't hold your breath though.
Numerouse variations are possible for each of the above ideas - it's up to you to determine which best fits your needs. Good luck.
 
Erik Pragt
Ranch Hand
Posts: 125
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jim Yingst,
Thank you for your complete answer to my question. I will try to respond to your solution, although my English isn't that good at the moment.
The idea of the application is to distribute it on a few laptops, all very slow, and very limited in memory (Windows NT 4.0, PII 300, 64 MB of Ram). The machines do not have a database, and sometimes the files have to be updated by someone else, because the machines will be distributed throughout the country (I live in the Netherlands, and yes, we do have zipcodes with letters in it, like 1311RA ). The machines are not connected to the internet, unfortunately, else it wouldn't be much of a problem.
I created a flash frontend (a keyboard, with some input and output fields), and the Java app is a Servlet running on JRun 3.0 with JDK 1.4.0.
I tried to run in on the laptop, but after a few seconds, the OS complained about the memory usage, trying to increase the virtual memory, and the machine stopped working. This happened while I was trying out my most optimized dummy application, just storing a lot of data in memory, without creating a lot of objects. (Maybe one or 2 per zipcode). Btw, the Runtime.getRuntime().freeMemory() gave the impression I still had enough memory, although I was only checking one every 2000 records. Still the program crashed with an OutOfMemory error. So I don't think putting everything in memory (although very fast on my Solaris machine) is a very good idea.
I also thought about splitting everything up in files, maybe in files of 500-1000 zipcodes (e.g. zipcde 1000AA-1999ZZ in file 1, 2000AA-2999ZZ in file 2, etc). Personally, I thing this is the best solution. The only problem I have (I think) is defining the difference in provider, unless I'm splitting the files up per provider. I cannot explain what I mean, so I give you an example
Let's say I have 2 big files, kpn.txt, and versatel.txt. Btw, both files are NOT sorted.
I can split them up, and merge them back together:
I can run through the kpn.txt, split all the zipcodes in 10 files each with a range of 1000 zipcodes. After that, I can process the versatel.txt files, merge them with the splitted kpn files, and have my program look through the files. But now I don't know which zipcode string belongs to which provider, so I have to add an identifier to each row.
An other option is to process the kpn.txt file, split them in 10 files, and do the same for the versatel files. This way, I do not have to merge the files, but the disadvantage is that I have to search through 2 files when looking for the zipcode, instead of 1, though I'm not sure if that's really a problem. I think not. What do you think, or do you might have an other option?
I don't know if your last option is a really good idea, because my files are not sorted. The kpn file starts with 6811AA, and e.g. ends with 2024TR. There's no real logic in it, and I don't know if sorting the file in my java app (e.g. using quicksort or some other algorithm) is the best way to do it.
To conclude, I thank you again for your 'uitgebreide' (I don't know the English word!!) explanation, and if I don't hear from you again, I think I will go for the split files option.
Thanks,
Erik Pragt
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Erik- I wouldn't feel any need to apologize for your English skills. You were quite clear and articulate. Well, except for 'uitgebreide' of course.
If the files are not sorted to begin with, then file splitting does seem like the best option. Especially if you've been working on this for a while now and may be nearing some sort of deadline - the simplest solution that meets your needs is probably best at this point. (This is usually true in general as well, I think.)
Btw, the Runtime.getRuntime().freeMemory() gave the impression I still had enough memory, although I was only checking one every 2000 records.
If you just look at freeMemory() it can be misleading. You won't necessarily see it steadily decreasing as you load records, because the JVM may instead keep requesting more memory from the OS. That is, until you reach the maximum memory allowed to your application. Then you will start seeing freeMemroy() steadily decrease. But if you only got reports every 2000 records, you may not have seen enough data points in this phase to see the trend clearly. An easy way to get ongoing reports of memory usage is to run java using the -vergose:gc option. Alternately, try looking at totalMemory() - freeMemory() instead - you should see this steadily increasing as records are loaded.
Regarding the file splitting, I understand your dilemma as to where to specify the provider. I think it depends on whether the user is more likely to be looking up 1 zip to find all providers for that zip, or if the user just wants to know if versatel available at a given zip, not caring about others. My own gut instinct is to put all providers in the same set of 10 (or 100, or whatever) files, and add an attribute or attributes to specify providers for each zip. If the files are unsorted it probably doesn't matter much right now - would you rather search one file with 2000 entries, or two files with 1000 entries? Coding the former sounds simpler, but you just shift a bit of the complexity over to the creation of the split files, and the interpretation of the extra attribute. I think these probably work out about the same in terms of complexity and performance.
Incidentally, I'd suggest you make the "split ratio" for the zip codes configurable. 10 files of 1000 zips may work well, but you may later find that 100 files of 100 zips is much better. Or even 50 of 200. You could define a constant ZIP_DIVISOR, and calculate
String fileName = baseFileName + (numericPart(zipCode) / ZIP_DIVISOR) + ".lst";
to identify which file a given zip would go in. This gives you something you can tune later.
Enjoy...
[ June 26, 2002: Message edited by: Jim Yingst ]
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!