• Post Reply Bookmark Topic Watch Topic
  • New Topic

Read Some 10,000 Files  RSS feed

 
JiaPei Jen
Ranch Hand
Posts: 1309
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have to read some 10,000 files -- the exact nuber is uncertain, and retrieve information from each of those files. I am going to do the reading in a loop.
I do not know the exact number of files are there. What should I do? If I use Vector, is the Vector going to get too lengthy?
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It really depends how much information you want to save for each of these files. 10000 elements in a Vector (or better yet, ArrayList) is no problem. But if, for example, you're saving the complete contents of the file as a String, and putting that in the Vector, then you'll use a lot more memory. (As an estimate, take the total size of all the files on disk, and double it (for ascii-to-Unicode conversion).) Still possible (esp if the files are small), but questionable. You might discover it's not really necessary to save all that info anyway. Depends what data you need, and what you're doing with it...
 
JiaPei Jen
Ranch Hand
Posts: 1309
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Those are resume files. I have to retrieve
1. first name, 2. last name, 3. telephone number, 4. e-mail address and 5. work experience (This is where the problem is. I have to dump the whole resume body in this field)
and create a .txt file. This .txt file will later on be uploaded to a relational database.
By the way, knowing that the .txt file will be loaded into a relational database, how should I arrange the fields of each record in the .txt file that I try to create?
For example, What should be used to separate name and telephone number? By space(s)? or something else? Should I use any special character at the end of each record?
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, that does sound like it will eat up a lot of memory if you put it all into one Vector (thanks to the work experience field, mostly). But if you're going to write a text file from this which will later be used to upload to a database - couldn't you just add a new record to the text file for each file you read? Once you do that, is there really any need to also save the data in the Vector as well? (All of it?)
For the text file, you have many choices possible. A common choice nowadays would be to use some sort of XML. Alternately you can just pick some simple delimiter(s), assuming you can find characters that will never be found inside any of the other fields. Good choices here are '|' or maybe '~' or '^'. If it were my program, I'd probably separate fields with '|' and use '~' plus a newline to separate records. I assume a newline by itself isn't enough, since the work experience may contain newlines. If there are no newlines in the work experience, then you can use newlines as record separators.
 
JiaPei Jen
Ranch Hand
Posts: 1309
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the ideas.
 
Leslie Chaim
Ranch Hand
Posts: 336
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Some no help..
This problem can probaly be solved in Perl and with the Perl DBI module. Any text processing of this nature is so natural to Perl and it is safe to say that your Java program will be at least 3-times bigger than the Perl script.
 
JiaPei Jen
Ranch Hand
Posts: 1309
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jim, while I read those some 10,000 files in a loop, how do I know when there is no more file to be read? I mean how to end the loop?
Leslie, I am going to see if people agree on using Perl.
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jim, while I read those some 10,000 files in a loop, how do I know when there is no more file to be read? I mean how to end the loop?
Well, how do you know what files to read in the first place? I assume you've got some sort of array or Collection containing file names or File objects. Use whatever end condition is appropriate to the data strucure you're using.
A likely scenario: if you've got all the files to be read in a single directory, create a File object representing the directory. Then use the list() method (or listFiles()) to get an array of all the files in the directory, and loop through the array:

Leslie, I am going to see if people agree on using Perl.
Leslie's probably right in general. All other things being equal, perl would probably be simpler. But this depends on a lot of issues, like: How well do you know Perl? How well do you know Java? How much time and motivation do you have to improve your skills in a new language? Are there other people who will need to work with your code as well? Will they be more comfortable with Perl or Java? How important is code maintanance for your project?
Generally, I think Perl is more suited to small projects by one person (or a limited number of skilled programmers who work together well), while Java is better for larger projects with more people. A Perl solution will probably be shorter and simpler, but harder to read, debug, and maintain when looked at by someone who didn't write it. (Or even by the person who did write it, after a few weeks have gone by.) But if you & your co-workers are much more knowledgeable of one language than the other, that's probably the language to use unless you have time & desire to improve your skills in a new language.
[ December 13, 2002: Message edited by: Jim Yingst ]
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!