Greetings. I'm probably going to have to eventually end up hiring somebody to do this, but I thought I would try and throw it around here first to see if anybody would like to tackle it. I'm a complete JAVA greenhorn/newbie, so I'm in no position to try and do this on my own yet. I've tried going about it in PHP, but I'm still coming up short. I have a MySQL database that I'm trying to update with product availability information. The only way I can receive this information is through an HTML form that is filled out through my WHOLESALER's website. The information I receive is in HTML table format. I try and check this information on a nightly basis, but trying to update upwards of 2000+ products every night by hand, one by one is virtually impossible. I'm trying to somehow parse out the information in the "source" of the HTML that I get back into a comma delimited file so I can easily update the database. I'm only needing the information in the first <TD> field (product ID) and the eigth <TD> field (YES or NO availability). If I could get this information into a text file formatted as: 12345, yes 23456, no 34567, no 45678, yes etc... It would make updating my product database 1000 times easier. Here is an example of the source HTML I receive back when trying to get the status:
This is what I get, except it is about 2000 table rows of this. There is no deviation in the table format. There are always 10 <TD>s per table row. The line spacing and everything is always exactly the same every <TR>. I would need to trim off the extra blank spaces at the end of the number between the <TD> and </TD>. I know it is not a question of wether or not it can be done, but more of a question of wether or not I can find someone willing to help. Since I'm only 5-6 weeks into my JAVA journey, I know I would drown if I tried to swim at this alone. I really do appreciate any help/assistance you guys offer. I will forever be in your debt (and maybe learn a thing or two also). Thank you kindly, -Brock
Nonsense. There are tons of HTML parsers out there without having to get involved with Swing. In fact, for something this simple, I'd venture to say that a few moderately clever regular expressions could do the trick. hth, bear
Perl seems like a great choice for this. I would suggest a perl script to be run, and Java could work on the output. (When you get a little more advanced, you could have the Java program run the perl program.) Another option is Jython, with is a combination of Java and Python, I don't know Python very well, but my friend loves it (although is now a Jython fan :-) and from what he's told me, it may be a good option. Supposedly it can run your Java code as well as do the parsing. --Mark
Re: Bear's comment - don't be put off too much by the "swing" part of the package name. You can use javax.swing.text.html.parsers.ParserDelegator to get a SAX-like parser that runs independently of any GUI stuff. It's not too bad to work with if you've used SAX, but if not, beware - it's very poorly documented. I gather it's mostly there for internal use by other swing components, so they didn't feel the need to explain it to anyone else. But it works just fine once you get past the documentation issue. :roll: However it does seem that regular expressions would be much simpler. And as it happens I'm in a mood to play some with the regex package in JDK 1.4, so what the heck...
It's tempting to do the whole thing with one big pattern, but probably much harder to debug. For my purposes it was easiest to put the sample HTML in a local file and write results to standard output; adapt as necessary. If you've got 2000 products in this format, the StringBuffer could take upwards of 2MB to store this info all at once. Probably not too big a problem, but this might not scale very well if the product count grows (or if other products have significantly more data). The algorithm could be adapted to split the incoming file into chunks of data - the catch is that any <TR> data left unclosed at the end on one chunk would have to be prepended to the start of the next chunk. I'm probably going to have to eventually end up hiring somebody to do this D'oh! Well, feel free to send money if it comes to that. [ August 19, 2002: Message edited by: Jim Yingst ]
Wow, you guys are amazing. I really appreciate the help. Thanks for the example Jim. You whipped that thing out there pretty quick. I was thinking it was going to have to be done by counting <TD>s, and outputting the first and eight, kinda like what you did there Jim, and then I realized. I went through the whole 2000 product file last night and realized there was no deviance in the amount of lines between each. Once you start with the product number (<TD> 1), it is exactly 15 lines down to the availability (<TD> 8). Then if you go exactly 10 lines down again, you get back to the next product number and so on: 10, 15, 10, 15... I think you example there Jim will work brilliantly. I'm going to give it a go round this afternoon when I get in from work. For sake of my own knowledge (on that long, LONG road to JAVA mastery I'm walking down), in this situation, would it be better to count <TD>s and parse out the first and eigth, or count lines in the file and parse out the 10th and 15th? Thanks again for all your help. I'm in awe of all the support and knowledge you guys lend people in this forum.
in this situation, would it be better to count <TD>s and parse out the first and eigth, or count lines in the file and parse out the 10th and 15th? Well, I'm thinking the <TD> solution is more robust. Is there any possibility that someone might somehow insert an extra line somewhere? E.g. in the freeform description part - probalby there aren't supposed to be any newlines in there, and maybe other parts of the system actually guarantee this somehow. But if just one extra newline does get in somewhere, the count gets screwed up, and all subsequent records are hosed. Conversely, using the TD counting approach, (a) newlines don't really matter; (b) if something else does go wrong, you have a better chance of detecting it by checking things like are there exactly 10 TD's per TR, etc, and thus being able to log an error; and (c) even if one TR is somehow badly screwed up, there's a good chance that will only affect the one record, and you can probably still parse the next TR all right. The main reason I can imagine to maybe not use the TD counting is if you're not familiar with regular expressions. You don't want a solution that you couldn't modify if you need to. But regex's are very useful in general, so it's probably worth your time to study them if you haven't. I believe there is an upcoming JavaRanch newsletter series on them from Dirk Schreckmann which you can keep an eye out for. In the meantime you can see the javadoc on the java.util.regex classes, or google "java regex" (maybe "java regex tutorial"). Beware that there are also several other Java regex packages developed by other parties (generally before 1.4 came out). Several of these are quite good, and could work as well for you - just pay attention to which package a given article is talking about, so you're not caught unaware when something is different from java.util.regex.
Jim, First thing I did, after getting home from work this afternoon, was compile your example and test it out. It worked brilliantly! Thanks a million. Please tell me you have a paypal email... I know I can't pay you what you are really worth, but I feel like I owe you something after that. Atleast let me buy ya dinner! Since I don't have telnet access to my MySQL server online (we have that dang phpMyAdmin tool), I went ahead and tweeked that last part so it would write out the SQL queries in a text file so I could upload it to my phpMyAdmin and have it automatically update it.
There is no way I could have gotten to that point without your help though. I hope to be as good as you some day! Thanks again for the expert advice! -Brock [ August 19, 2002: Message edited by: Brock Barnes ]
I know this topic was resolved a while ago but I found it an interesting read and wanted to make one observation and ask one question. OBSERVATION: Many people do not realize it, but </td> and </tr> are not required tags. The following table will work just fine... <table> <tr /> <td />Element 1 <td />Element 2 <td />and so forth... <tr /> <td />Element 1 <td />Element 2 <td />and so forth... </table> Note the / in the tag. This is proper format for a tag with no ending tag. So <br> should be coded as <br />. As I understand it this is all W3C stuff. The only reason I bring this up is that using the expression posted earlier this would no longer work, and since you cannot control the other company's HTML you need to be aware of it.
QUESTION: For conversation alone, I must ask "Isn't this what XML is designed to do?" I don;t know much about it but this scenario seemed ideal for XML.
Wait a minute, I'm trying to think of something clever to say...<p>Joel
Well, sure, </td> and </tr> are not required in traditional HTML. Nor is it a requirement that a table have 10 rows, or that the first row of a table is populated with a product ID, or the eighth, a yes or no indicating availability. Those are all characteristics of this problem, based on the sample HTML shown. Personally my feeling is that if there are any changes to the base HTML tag structure of the document, I'd want to show an error message, forcing a human to look at the new format and evaluate what to do about it. Note the check to see that there are exactly 10 rows, and an error if not. We could have happily parsed a table with 11 rows - but I wouldn't trust such a table under the circumstances. We weren't given documentation about what is and is not allowed in the HTML file here (according to the client who produces it, not W3C); we're just guessing based on examples. If any of our guesses are incorrect, I'd want to hear abaout it ASAP. But this sort of thing depends on the situation I suppose. I can imagine a situation where have some indication that the end tags may be missing where HTML rules allow it, but other structual features of the file will remain as seen. In this case I could modify the patterns:
(I added case insensitivity as well.) I assume here that the only tags potentially present are TABLE, TR, and TD. If we allow too many hypothetical situations here, the regex will become rather unmanageable and we should probably use (or even build) a more general HTML parser. But that would be a different problem... Isn't this what XML is designed to do? Ummm, maybe - depends what you mean. The file format shown is almost well-formed XML (as well as HTML), so we could use an XML parser instead if we wished - if one thing were fixed. There would need to be a single root element wrapping everything = <table> and </table> would do, or we could simply add <html> and </html> to the beginning and end, before passing the input on to the parser. I could envision a SAX parser-based solution that would be about the same complexity as the regex solution I provided. I supose it's a question of whether the programmer is more comfortable with (or interested in learning about) XML parsing, or regular expressions. Either one could be applied here. However, note that XML can be more brittle than HTML parsing. If your client suddenly decides to stop providing </TD> and </TR> tags as you suggested, you no longer have valid XML. I don't think there's any way to "fix" this within an XML-parsing-based solution. Not nearly as easily as I was able to modify the regex solution, anyway. Alternately, perhaps you were suggesting that the source file should have been in [a more useful form of] XML rather than the unlabeled HTML which forced us to count rows. I'd agree that it might be nice if we could ask the client to give us something like this instead:
That would certainly be less ambiguous than the format provided. But that presupposes that we have some say in how the input is presented to us. Maybe we do. But how much work will it be to get our client to change the input format? And is there any real benefit, considering we seem to have a viable solution to the HTML parsing already? Maybe there is; maybe changing the format will also help solve some other problems elsewhere in the system. More likely though, it would just create extra work - if there are other systems depending on this input, we'd have to change them too. If we were present during the design (or redesign) of the systems in question, moving to XML might be a useful alternative. But I doubt that's the case now. [ August 27, 2002: Message edited by: Jim Yingst ]
Just as an aside, there are a few products available which take "loose" HTML as input, and somehow convert it to valid XML which may be processed using a regular XML parser. I have used JTidy in the past, and although it has a few quirks, it can be a real lifesaver.
Yeah Jim, using an XML source file was what I was thinking about. I realize that this is probably out of our control for this project, which is why I said "For conversation alone". I just wanted to make sure I recognized a legitimate XML scenario when I saw one. This is actually a pretty tough scenario to deal with, especially as time goes on: new developers, new methods, different styles... all the problems that come with maintaining code get amplified by the reliance on an already ambiguous standard (HTML). Good topic! Cheers!
Wait a minute, I'm trying to think of something clever to say...<p>Joel
Won't you please? Please won't you be my neighbor? - Fred Rogers. Tiny ad: