I'm currently having a problem with a screen scrapping project. Heres my dillema: My program executes perfectly, I can scrape all the HTML off of a website, however,
I'm attempting to somehow sort through the HTML and JUST pull out Game Scores. I'm thinking that I need to use some type of String Class Method? Heres my program so far:
The website I'm attempting to Scrap is http://www.scores.com
The HTML Tags that I want look something like this:
The above piece represents a single unit of team names, winner, games played and total. If you see teams and winner are grouped using an id which differs in the starting letter as1-ncaab-201111220287 and hs1-ncaab-201111220287. This way you can map a team and its winner.
When you read a line say <td class="teams">UCLA Bruins</td> , you can have a string match for "<td class="teams">" and store UCLA Bruins as a team and when you encounter a match for "<td class="teams winner">" for the corresponding id (hs1-ncaab-201111220287) you can store the winner of the team.
I think the learning curve on XSLT/Xpath may be a little steep, so if this is all you are going to do and HtmlUnit meets your needs, stick with it. If you plan on doing some more intensive scraping or maybe even do some scripting from page to page, etc then spending some time in the XSLT arena may be worth your time.
I like the way you are going with this, however, I am very unfamiliar with using the String class to search like you mentioned.
I've been trying to do some research and I came across Regular Expressions, but to no avail. Thanks again.
** I'm supposed to do this manually, without the extra HTML parser programs and what not. **
Anyways, below code might help to take the data between two start and end tags.... say the team name "UCLA Bruins" in the line <td class="teams">UCLA Bruins</td>. You have to do String comparison to check which information is present in the currently parsed line.
Suppose the below are the contents of the text file test.txt -
Below code will parse and give you information between the tags. Note a simple regex is used.
Sooner or later when you have your solution, you have to store the team information. Try to use a separate class for that purpose, instead of using multiple strings. Below given a sample information storage class (add getters / setters). Create different objects and store information in them.
- X 2
John Jai wrote:Converting HTML to XML might be safer.
I think that's the way I would go too. Regexes are powerful, but not generally the best choice for hierarchical parsing.
@t_day: You might want to have a look at JTidy, which is a Java port of the old chestnut HtmlTidy, that converts HTML to XHTML (well-formed, and therefore suitable for most parsers). In fact I believe it has it's own parser built in, although I haven't used it myself.
Similarly when you parse a winner, you can set a boolean to note that winner is getting parse.d And when you hit a <td> beneath a winner, you store it in a winner's games played data.
It will be like flipping of the booleans corresponding to what information is read.
That will be tedious and hence take some time on converting the HTML into an XML first.
t dav wrote:I am not able to figure this one out, I've been messing with it for a while now. Any help?
Well, you haven't given us much to go on (check out the ItDoesntWorkIsUseless page), but your 'lineScanner' implementation, specifically the
bit seems a bit heavyweight to me for what you need (mind you, I loathe Scanner, so I'm not the best to judge).
You already know that your line contains the team name, and you also know that it's between the ">" that ends the 'td' tag and the next "<", so why not just use regular String methods, viz:I really fear you're getting a bit bogged down in these regexes. Sometimes the simplest is the best.
complete HTML coding. Specifically, I'm stuck on figuring out how to single out each name and what not and save them.
As for what Winston just posted, could you help explain what that does a bit?
Also, the code is giving this error:
Scraping being the first chart you see,
heres what I have so far:
I'm running into some runtime errors, stating
Exception in thread "main" java.lang.NullPointerException
t dav wrote:So I changed the website that I am scraping off of, the new website is: http://www.vegasinsider.com/top-betting-trends/...
Hunh? You haven't even got your code working and you've already changed the site?
I'm with Tim here. The HTML for these sites is simply too complicated and too varied to be trying to pull out specific pieces of data without some sort of parser (and it's not likely to be very simple even then).
Your previous code at least had a fairly specific string (TEAMS_COLUMN) that you could rely on, but now you're just looking for ">1<". I suspect that's a non-starter, and will give you a ton of false hits.
You're also mixing your screen scraping code with your GUI. DON'T.
Write a program/class that can successfully scrape a site and display results without any Swing code at all. Once you've got that working, then add the GUI stuff.
Also, I think you need to write down a procedure for scraping a screen on paper. Right now, you're just coding like mad, and dealing with errors as you get them.
I call that "gorilla programming" (...problem...code...ugh...) - otherwise known as the Jean-Paul Sartre methodology - and is NOT the way to become a successful programmer.
So what I thought would be right would be to use a string array and substrings and go down each line and copy exactly what I need.
Seeing as how the HTML code is the same for what I need, I thought this would work.
Overall, using selenium web driver, you are talking a few lines of code to access the page and pull the elements
- using the driver class get method to open the URL you want to work with
- execute the findElements method of driver and pass it a By.xpath("") expression with the xpath in the quotes.
from there iterate, inspect, and explore the elements to start mining the data you want.
The XPath syntax does have a learning curve. I have used it along with Selenium to parse pages similar, and even more complex than what you are doing.