I'm trying to write a simple program that saves an html page given its url. However, what I'm retrieving is not the same html that the browser (in my case Mozilla) uses. To see what I mean: - run the following code - open the embedded link (http://www.ranchhouseinn.com/ranch.html) in a browser and then select "save page as..." and save it to c:/good.html (or whataver you wish) - compare that file with c:/copytest.html that was generated by the code.
So my questions are: - Why are these files different? - How can I get the html in good.html using java?
Sorry, bad example. And it's not a caching problem, either. Here's a better example:
- run the following code - open the embedded link (http://www.hikinglasvegas.com/peaks_of_the_sierra.htm) in a browser and then select "save page as..." and save it to c:/good.html (or whatever you wish) - open both c:/copytest.html and c:/good.html IN NOTEPAD - search for 'Mallory' in both files
In good.html, you'll see there's a fully qualified url for this link: - href="http://www.hikinglasvegas.com/Mt_Malloryl_Photo_pg.htm" In copytest.html, you'll see it's been shortened: - href="Mt_Malloryl_Photo_pg.htm"
This is my problem. I'm trying to parse out individual URLs from the document but when I go the code route they're shortened. Any ideas?
After running your program, I found that the only differences were in the whitespace. I suspect that most browsers perform some kind of formatting to the HTML when it saves it. However, I used View | Source instead of File | Save As.
You say that the URL's are "shortened". Do you mean that it doesn't contain the full domain name and path? I suspect that the browser adds this when you save the file. To see what I mean, you can click View | Source in the main menu. Notice that all the URLs are shortened in this way. This is because that's exactly how the HTML is on the server when it is sent to the browser. The web designer probably used relative path names in order to make it easier to move the whole website to a different location. The browser is smart enough to add the information necessary to create an absolute path so that the links aren't broken from the saved file. You will need to add logic to your program to perform this same function.
I also have a few comments about your code:
1) Sun's Java Coding Conventions suggest that class names start with upper case. It would be more appropriate to use CopyTest as the class name instead of copyTest.
2) Using UBB CODE tags will preserve your formatting. This will make it much easier for us to read your code, and thereby expediting our ability to help you more quickly.
3) The file name "c:\copytest.html" is VERY platform specific. I had to change this in order to run it on Unix. You should do your best to avoid such platform-specific code. In this case, it would probably be good enough to just use "copytest.html" as the file name. This will save it in the "current directory" on any system.