• Post Reply Bookmark Topic Watch Topic
  • New Topic

Getting tagged content (headings) from rich text files  RSS feed

 
marc weber
Sheriff
Posts: 11343
Java Mac Safari
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have rich text files (Word docs saved as rtf) that are structured using heading styles for a table of contents. I need code to get the text that's tagged as the first "heading 1" in each file.

I've downloaded a description of RTF from wotsit.org, but haven't really dug into it yet.

I took a quick pass at some Java code that basically finds the second occurrence of the literal "s1\ql" (the first of these is in the definition of the heading, and the second is the actual application of that heading), then finds the first left-brace following this. That point usually marks the beginning of the first heading 1 text. The ending of this text is usually marked by the literal "\par". This works about 90% of the time, but I haven't found a consistent pattern in the remaining 10%.

So if anyone has done this before, maybe you can offer some clues on how to work with headings in rich text.
[ May 14, 2007: Message edited by: marc weber ]
 
marc weber
Sheriff
Posts: 11343
Java Mac Safari
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think I have a solution. It's designed for a rather specific need, but if anyone's interested, here's the quick and dirty logic. (Note: An additional requirement is it must work using Java 1.3, since it will run as a Lotus Notes agent. So, among other things, regex Patterns can't be used.)
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!