Okay, maybe this isn't an advanced question, but I'm not quite sure where to put it. In fact, my questions don't have to do with
Java directly, but more on that later.
I'm working on a project that reads in a decent-sized
RTF file and parses the content for information. I already figured out how to use the javax.swing.text package (and related subpackages) to obtain the content as a plain
String. Now I'm trying to figure out how to parse this String to obtain the information that I want. Basically, the text contains multiple records that are all in a fairly uniform format:
Let me describe my meta-language before I go any further:
< > encloses a description of the text
[ ] encloses optional text
... means that the
pattern can repeat
My first thought is to write a lexer and parser similar to what I learned about in my compilers class. The above description would work fairly well as a grammar, I think. If I need to, I can change it into BNF even. My first question is would it be worth my time to download JavaCC or something similar to help write the lexer and parser? Ultimately, the program I am writing should be able to parse multiple documents with varying formats. Does JavaCC support multiple grammars?
Whether I use a tool like JavaCC or roll my own parser, there are a few complications:
1) In some situations, new lines have some significance. This is especially true for addresses. The only way I know how to tell where the City, State ZIP starts is by looking for a new line character. Usually a parser ignores whitespace, though, so I'm not sure how to deal with this. I would like to obtain the first two address lines, city, state, and zip separately, but I'm not entirely sure how. The optional second line in the address also complicates things. Does anyone have a suggestion here?
2) Should I view labels like "Debtor Address:" as a single token or as two tokens? I'm not sure which would be best/easiest to implement. It might not matter if I'm using a compiler-compiler, but I would like to know if anyone has suggestions here.
I will greatly appreciate any input anyone has.
Thanks,
Layne