• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Liutauras Vilda
  • Jeanne Boyarsky
  • paul wheaton
Sheriffs:
  • Ron McLeod
  • Devaka Cooray
  • Henry Wong
Saloon Keepers:
  • Tim Holloway
  • Stephan van Hulst
  • Carey Brown
  • Tim Moores
  • Mikalai Zaikin
Bartenders:
  • Frits Walraven

Parsing text

 
Ranch Hand
Posts: 3061
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Okay, maybe this isn't an advanced question, but I'm not quite sure where to put it. In fact, my questions don't have to do with Java directly, but more on that later.

I'm working on a project that reads in a decent-sized RTF file and parses the content for information. I already figured out how to use the javax.swing.text package (and related subpackages) to obtain the content as a plain String. Now I'm trying to figure out how to parse this String to obtain the information that I want. Basically, the text contains multiple records that are all in a fairly uniform format:


Let me describe my meta-language before I go any further:
< > encloses a description of the text
[ ] encloses optional text
... means that the pattern can repeat

My first thought is to write a lexer and parser similar to what I learned about in my compilers class. The above description would work fairly well as a grammar, I think. If I need to, I can change it into BNF even. My first question is would it be worth my time to download JavaCC or something similar to help write the lexer and parser? Ultimately, the program I am writing should be able to parse multiple documents with varying formats. Does JavaCC support multiple grammars?

Whether I use a tool like JavaCC or roll my own parser, there are a few complications:

1) In some situations, new lines have some significance. This is especially true for addresses. The only way I know how to tell where the City, State ZIP starts is by looking for a new line character. Usually a parser ignores whitespace, though, so I'm not sure how to deal with this. I would like to obtain the first two address lines, city, state, and zip separately, but I'm not entirely sure how. The optional second line in the address also complicates things. Does anyone have a suggestion here?

2) Should I view labels like "Debtor Address:" as a single token or as two tokens? I'm not sure which would be best/easiest to implement. It might not matter if I'm using a compiler-compiler, but I would like to know if anyone has suggestions here.

I will greatly appreciate any input anyone has.

Thanks,

Layne
 
(instanceof Sidekick)
Posts: 8791
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I learned text parsing (and programming) in a line-oriented world so I'd be tempted to read a line at a time, test for the easy tags like "Creditor:" or "Creditor Address:" and rely on position in a sequence for the others like the first few lines. That would probably be lots of code and fragile relative to changes or variations in format. I'll look forward to more sophisticated ideas!
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I think I would read the whole thing into a String[] - then locate the groups of 1 or more lines that go with a particular marker such as "Debtor Address:" and send them to a method specific to each data type. In other words, separate the tasks of locating the data from interpreting the data.
The specific methods would not have to worry about detecting the end of the data type.
Bill
 
Layne Lund
Ranch Hand
Posts: 3061
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by William Brogden:
I think I would read the whole thing into a String[] - then locate the groups of 1 or more lines that go with a particular marker such as "Debtor Address:" and send them to a method specific to each data type. In other words, separate the tasks of locating the data from interpreting the data.
The specific methods would not have to worry about detecting the end of the data type.
Bill



That's actually very similar to what I've decided to try next. At the moment, I'm putting all the text into a single String rather than using a String[]. Then I'm trying to use regular experessions and the java.util.regex package to locate the data. I plan on writing individual methods to parse the individual blocks of data from there.

But now I'm running into trouble with the regular expressions stuff. I am starting by just trying to match the "<###> OF <###> DOCUMENTS" that occurs at the beginning of each record. Here's is the method that does most of the work:

The readRTFFile() uses the javax.swing.text.rtf to read the RTF file and return the contents as a String. I could easily split() this into a String[] if I want to. This method is actually in an abstract base class because there are at least two slightly different file formats. (This might end up being unneccessary because I may be able to deal with the slight differences if I write my regular experssion just the right way. But I'll figure that out later after I figure out what's wrong with my current regex.) At the moment, the subclass that I'm writing for testing returns the following regular experssion:

The two SOPs before the while loop print out (including the regex as I expect it). However, m.find() must be returning false because the SOP at the beginning of the loop doesn't print. This is where I'm stumped. I also printed the content String and it looks fine. So why am I not getting a match for my regex? Any ideas?

Thanks for your time to read my questions. I sure hope someone has some suggestions that can help me fix this latest problem.

Regards,

Layne
 
Layne Lund
Ranch Hand
Posts: 3061
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Okay, I'm just retarded. After some experimenting, I found that the document contains the word "of" in the first line, but my regex was looking for "OF" instead. Once I fixed that, I was able to incrementally build a regex that matches the whole record!

Thanks for your comments Stan and William. Even though I didn't use them directly, they helped me start thinking about other ways to do it.

Regards,

Layne
 
Stan James
(instanceof Sidekick)
Posts: 8791
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
For grins, look at how Fitnesse parses Wiki markup. I'll see if I can say this so it makes sense:

I copied this scheme for my Wiki and it works pretty slick. Since each REGEX finds the smallest possible match it handles nested tags nicely from the inside out. You're not replacing (tho you could delete text) so this might not work, but it's worth an evening to read Fitnesse no matter what.
 
Evacuate the building! Here, take this tiny ad with you:
Gift giving made easy with the permaculture playing cards
https://coderanch.com/t/777758/Gift-giving-easy-permaculture-playing
reply
    Bookmark Topic Watch Topic
  • New Topic