• Post Reply Bookmark Topic Watch Topic
  • New Topic

Text files - setting my own record separator

 
Leslie Chaim
Ranch Hand
Posts: 336
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is there any of the Java API's where I can set the record separator to an arbitrary String or heavens a regex?
I have a case where:

would read a line of 7532656 bytes and I would like to read them in bits and say something like:

Thanks,
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't know of a simple way to do this with the standard library, unless you're willing to load the whole line into memory. (Once you've got the complete String you can just use split(String pattern), or use java.util.regex.Pattern and Matcher.) However it also shouldn't be too complex to make a RecordReader class which operates similar to a Buffered Reader, but with a configurable pattern as separator. The main difficulty is probably in dealing with the end of each read. When you do read(char[]) for example, you don't know in advance if the last chars read are the first part of a record of a record separator - so after scanning the line for record separators, you'll also need to retain the last n chars and put them at the beginning of the next read(), so that you can thus rejoin the two halves of any record separator which was split between reads. I'm not sure what n should be - maybe use a default like 10 or 20, and make it configurable. If you use a simple fixed-length String as record separator, then n can cimply be the length of the String. But if you use a variable-length Pattern, n may be much harder to derive.
It seems quite possible that someone has already put an open-source version of something like this under jakarta commons, sourceforge, or somewhere like that. I didn't find it in a brief search, but you may want to spend more time at it.
[ February 24, 2003: Message edited by: Jim Yingst ]
 
Michael Morris
Ranch Hand
Posts: 3451
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Another possibility, since you are using a Reader, would be to use a StringTokenizer. Instead of using a regex, you just set the delimiter(s) to whatever ends the record.
Michael Morris
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The biggest problem with StringTokenizer is that it can't work with multi-character delimiters - which seems to be what Leslie is asking for. E.g. if the separator is "xxx_foo_xxx" and you have the input in a String str, you can split up the records with
String[] records = str.split("xxx_foo_xxx");
This works great - individual records are now stored in each element of the array. But if you try
StringTokenizer st = new StringTokenizer(str, "xxx_foo_xxx");
you will get a tokenizer that considers any single 'x', '_', 'f', or 'o' to be a delimiter. Not what was wanted in this case. There's really no way to get StringTokenizer to look at larger patterns as delimiters - it only thinks in terms of single chars.
Note that when using split(), you need to watch out for some special characters, and escape them with \\ = e.g. for delimiter "[foo.bar]" use
String[] records = str.split(\\[foo\\.bar\\]");
Annoying, but it works.
And none of this really addresses the problem of dealing with a line seven million characters long, if you don't want to clobber your memory usage for no good reason - that's going to take some more customized parsing such as I described in my previous post, I think.
 
Michael Morris
Ranch Hand
Posts: 3451
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

The biggest problem with StringTokenizer is that it can't work with multi-character delimiters - which seems to be what Leslie is asking for.

Guess I didn't read the post completely, huh? It's nice to have regexes in Java now. I never have to (guess the expletive here) with PERL again! The dudes who came up with that language must've been smoking crack. It's been several years since we were all Solaris around here and I'm having to recall how to construct regexes. I used to amaze myself at what I could do with grep, sed and awk!
Michael Morris
 
Leslie Chaim
Ranch Hand
Posts: 336
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes Jim, you have read my post completely . Ideally, I would like the ability to use a regex. For my particular study, (which I will show later) I merely split on one character.
However, I would like to understand what Mike is suggesting. Yes, I am using a Reader (a bufferedReader). How would a StringTokenizer help me? Once I said:
line = rdr.readLine();
I have already hogged the memory, No? Is there a way to "hook up" a StringTokenizer to a stream? Please explain.
BTW, congratulation Mike on your recent promotion to bartender status.
Now whatever follows should probably be in the Object Oriented Scripting forum but we all get sidetracked just like Michael did. Furthermore, I am still left with some questions from my original topic.
Warning! War zone ahead!
Recently, David Weitzman asked if I want to revive an old language war. David also mentioned about the author of a certain other Java certification who has been known to talk smack about Perl. I have read that entire post (which BTW is no MD to me , but back then I was not a rancher and I believe there was no Scripting forum yet). There was one thing missing from the post: real examples!
Michael, there is a Perl thingy behind here, and I have never seen crack . I was just trying to do something in Java (practicing on regex) that I did in one line of Perl:

Compare that with:

not to mention the compile-and-run phase.
BTW, Jim my HP box has no problems with loading 7 million bytes in memory.
Originally posted by Michael Morris:

Guess I didn't read the post completely, huh? It's nice to have regexes in Java now. I never have to (guess the expletive here) with PERL again! The dudes who came up with that language must've been smoking crack. It's been several years since we were all Solaris around here and I'm having to recall how to construct regexes. I used to amaze myself at what I could do with grep, sed and awk!

Sure, that Perl one-liner is crammed and it takes time to master. I repeat, it takes time to master. One more time, it takes time to master it takes time to master .
Perhaps I should have said it once more: it takes time to master. Period.
Nevertheless, I hope you can appreciate the power of it! Compare to the 28 lines in Java. (Oh, only 24 if you follow the JLS guide ). No, these folks were not smoking crack, you are just too lazy to learn the thing!!! Why don't you do what I did:
  • Read Learning Perl Twice.
  • Read Programming Perl 2� Times.
  • Read Mastering Regular Expressions 4 times.
  • Read the Chapter on Perl from Mastering Regular Expressions 6 times. BTW, there is also a chapter on Java, the most comprehensive you will ever get. It's really marveling how Jeff handles the details without bore.
  • Subscribe and read all archives from the The Perl Journal.


  • Do all the above with one question in mind: I know that Perl is supposed to be great at text processing, how can I master this chore? Then get into some Object Oriented Perl. (Unfortunately, I have not read that book. Just as everybody else, I jumped on the Java ban-wagon, and hey I do not regret it, it's just that there is something special with the perl way)
    Finally, if you have done all of the above and you still don�t like it, that's when we can square off into a real battle but not before you have done the above.
    Again, I hated Perl just as you did. Luckily, I did not give in since I had this burning question in my mind and now I have come absolutely love it. I will argue with anyone: Perl's text processing is second-to-none. Perhaps, Michael, you should first start by learning Perl's Culture.
     
    Michael Morris
    Ranch Hand
    Posts: 3451
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Ok Leslie,
    I will try not to disparage PERL again, while you are around. My problem with languages like PERL is that I am a purist at heart. Why should you have 37 ways to do the same damn thing? In my humble opinion there is something inherently wrong with a language that can cause a stack overflow just by getting one character out of order in an argument. I'll give you that PERL's regexes are powerful but few ever totally grasp the full breadth of them anyway and it is too easy to do something you never intended. Everybody can learn how to shoot a rifle, but few ever master the use of tactical nuclear artillery or ever need to.
    Michael Morris
     
    Leslie Chaim
    Ranch Hand
    Posts: 336
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    Can we get back to the topic please?
    I am just learning here (and having fun )
    Thanks
     
    Jim Yingst
    Wanderer
    Sheriff
    Posts: 18671
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hey, YOU started the Perl war here. Didn't even have the courtesy to post it as a follow-up to one of the previous Perl discussions. Now we've got a distributed Perl war on our hands, and it's all your fault. Thanks a lot.
    What the hell, let's blame David Weitzman too. OK, now it's both your faults.

    BTW, Jim my HP box has no problems with loading 7 million bytes in memory.
    If you don't object to hogging the memory, then this problem is reasonably simple:

    Aside from memory used, the only problem I see is if newlines are allowed within records - the readLine() will break on the newline. You can get around this by reading everything into memory. (How's your HP box holding up now?)

    I may yet get around to putting together something that doesn't require the while line to be read into memory. But for now, gotta move on to other stuff. Cheers...
    [ February 25, 2003: Message edited by: Jim Yingst ]
     
    David Weitzman
    Ranch Hand
    Posts: 1365
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    What the hell, let's blame David Weitzman too

    I don't start language wars. I just perpetuate them.
     
    Leslie Chaim
    Ranch Hand
    Posts: 336
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Originally posted by David Weitzman:

    I don't start language wars. I just perpetuate them.

    And I did mention David that you asked me if I want to revive an old language war. BTW, I do have something to say on that other thread, but I need time.
    Cheers,
    Leslie
     
    • Post Reply Bookmark Topic Watch Topic
    • New Topic
    Boost this thread!