• Post Reply Bookmark Topic Watch Topic
  • New Topic

java.util.regex and parsing repetitions  RSS feed

 
Andreas Schildbach
Ranch Hand
Posts: 34
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,
I'd like to parse ACN chess history files, which look like:
1. e4 e5 2. Nf3 {this is a comment} Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1
Since the format is not too complicated, I am trying to avoid using fully fledged parsers like a ANTLR generated parser.
However, the StringTokenizer is not enough because of the comments (which can contain spaces).
I was hoping to be able to use a regular expression (java.util.regex).
I was starting to formulate:
((\\d+\\.) (\\S+) (\\S+) )*
The problem is that if a pattern loops (using + or *) and the loop contains capture groups, I can only read the last matches of the capture groups. In this case, I can only read the last (\\d+\\.) (\\S+) (\\S+) match.
Is it possible somehow to get all capture groups (maybe as a series of events/method calls)?
How would you parse such a format?
Regards,
Andreas
 
Dmitry Melnik
Ranch Hand
Posts: 328
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is your task just to check the syntax? or to enterpret the data somehow?
 
Andreas Schildbach
Ranch Hand
Posts: 34
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I want to interpret the data, not only check the syntax.
 
Dmitry Melnik
Ranch Hand
Posts: 328
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would write my own tokenizers, one for chopping the sequence by individual moves (and handling comments correctly), another for enumerating the terminal tokens (including comments in case you need them). Then using regexps I'd select appropriate parser/interpreter for the each terminal being parsed.
Do you need any help with implementation?
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The problem is that if a pattern loops (using + or *) and the loop contains capture groups, I can only read the last matches of the capture groups.
Right. If you write an expression to try to match the entire file at once, you won't be able to retrieve info from most of the individual lines, because that outermost * means group 1, 2, 3 will have many differnt values during the course of a single find() evaluation, and you will only be able to recover data from the last line. No good. Instead, try to write a regext which matches exactly one move. The write a while loop which applies that pattern repeatedly to find all moves:

You can modify the pattern as needed, but that's the basic idea.
BTW - how are the following indicated:
castling
en passant capture
check
checkmate
resign
I'm too lazy to look up the format myself, but I suspect at least some of these will be special cases for you to consider...
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!