I have to write a class that will parse some html input (that's programmatically generated) and will put items that logically belong in lists into lists. Such as if something starts with: A. or 1. or (1) or (a) or a.
assume the start of a list and then try to pick up lists inside of lists. The main difficulty I see is with ending a list, since these lists are inside of larger documents. The class/program is going to have to make some best guesses. I'm just wondering if there isn't already something out there similiar and if not whats the best way to tackle the problem.
Without seeing a sample of the html you are attempting to parse, my suggestions are a bit more limited, but off of the top of my head, you could consider the following. If your lists are stored in regular html structures (dropdown boxes, html lists etc) you could tokenise the tags (the <option> and </option> tags in a select box for example) and extract the strings between the two tokens for storage in your Java data structure. Or, alternatively, you could write a utility class that searches for common identifiers in a list , the likes of which you gave examples of, using the String.indexOf() method to search for opening and closing brackets, and then extracting the rest of the string from that point onwards for storage.
Unfortunately the html doesn't include things like drop down boxes etc. Most of it is tagged in p's, span, and div tags (and table tags). I was thinking along the same lines finding the indexOf for the list starts charSequences.
Well, tokenising might still be worth considering. You can tokenise at any level, and use for examle the <p> and </p> tags to grab everything between a paragraph. The same goes for the <table> and </table> tags. This might cut out a lot of the fluff before you get down to the dirty job of parsing the strings using indexOf. This might be useful if you have nested lists - you are liable to run into some processing overheads if your lists are nested fairly deeply, espeically if you use the likes of recursion to dig down into them automatically.