• Post Reply Bookmark Topic Watch Topic
  • New Topic

Parsing html text.  RSS feed

 
Tad Dicks
Ranch Hand
Posts: 264
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have to write a class that will parse some html input (that's programmatically generated) and will put items that logically belong in lists into lists. Such as if something starts with:
A. or 1. or (1) or (a) or a.

assume the start of a list and then try to pick up lists inside of lists. The main difficulty I see is with ending a list, since these lists are inside of larger documents. The class/program is going to have to make some best guesses. I'm just wondering if there isn't already something out there similiar and if not whats the best way to tackle the problem.

-Tad
 
Jody Brown
Ranch Hand
Posts: 43
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Without seeing a sample of the html you are attempting to parse, my suggestions are a bit more limited, but off of the top of my head, you could consider the following. If your lists are stored in regular html structures (dropdown boxes, html lists etc) you could tokenise the tags (the <option> and </option> tags in a select box for example) and extract the strings between the two tokens for storage in your Java data structure. Or, alternatively, you could write a utility class that searches for common identifiers in a list , the likes of which you gave examples of, using the String.indexOf() method to search for opening and closing brackets, and then extracting the rest of the string from that point onwards for storage.

Hope this helps.
 
Tad Dicks
Ranch Hand
Posts: 264
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Unfortunately the html doesn't include things like drop down boxes etc. Most of it is tagged in p's, span, and div tags (and table tags). I was thinking along the same lines finding the indexOf for the list starts charSequences.

-Tad
 
Jody Brown
Ranch Hand
Posts: 43
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, tokenising might still be worth considering. You can tokenise at any level, and use for examle the <p> and </p> tags to grab everything between a paragraph. The same goes for the <table> and </table> tags. This might cut out a lot of the fluff before you get down to the dirty job of parsing the strings using indexOf. This might be useful if you have nested lists - you are liable to run into some processing overheads if your lists are nested fairly deeply, espeically if you use the likes of recursion to dig down into them automatically.
 
Tad Dicks
Ranch Hand
Posts: 264
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think I'm going to delve into using the Pattern/Matcher classes to do it... the span/para etc tags show up everywhere in the text, splitting things in some odd places.


-Tad
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!