Win a copy of Kotlin in Action this week in the Kotlin forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

getting an html page  RSS feed

 
stephen dimitrov
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm trying to write a simple program that saves an html page given its url. However, what I'm retrieving is not the same html that the browser (in my case Mozilla) uses. To see what I mean:
- run the following code
- open the embedded link (http://www.ranchhouseinn.com/ranch.html) in a browser and then select "save page as..." and save it to c:/good.html (or whataver you wish)
- compare that file with c:/copytest.html that was generated by the code.

So my questions are:
- Why are these files different?
- How can I get the html in good.html using java?

Thanks

>>>

import java.io.*;
import java.net.*;

public class copyTest {

public static void main(String[] args) {

try {
URL url = new URL("http://www.ranchhouseinn.com/ranch.html");
URLConnection connection = url.openConnection();

InputStream is = connection.getInputStream();
FileWriter fs = new FileWriter("c:/copytest.html");

int read=0;
while ((read = is.read()) != -1) {
fs.write(read);
}
fs.flush();
fs.close();

} catch (Exception e) {
e.printStackTrace();
}
}
}
 
Bear Bibeault
Author and ninkuma
Marshal
Posts: 66184
146
IntelliJ IDE Java jQuery Mac Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How do they differ?
 
Jeff Bosch
Ranch Hand
Posts: 805
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
One possibility is that your browser cached an older version. If you refresh the browser you may see the change disappear.
 
stephen dimitrov
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sorry, bad example. And it's not a caching problem, either. Here's a better example:

- run the following code
- open the embedded link (http://www.hikinglasvegas.com/peaks_of_the_sierra.htm) in a browser and then select "save page as..." and save it to c:/good.html (or whatever you wish)
- open both c:/copytest.html and c:/good.html IN NOTEPAD
- search for 'Mallory' in both files

In good.html, you'll see there's a fully qualified url for this link:
- href="http://www.hikinglasvegas.com/Mt_Malloryl_Photo_pg.htm"
In copytest.html, you'll see it's been shortened:
- href="Mt_Malloryl_Photo_pg.htm"

This is my problem. I'm trying to parse out individual URLs from the document but when I go the code route they're shortened. Any ideas?

>>>

New code:

import java.io.*;
import java.net.*;

public class copyTest {

public static void main(String[] args) {

try {
URL url = new URL("http://www.hikinglasvegas.com/peaks_of_the_sierra.htm");
URLConnection connection = url.openConnection();

InputStream is = connection.getInputStream();
FileWriter fs = new FileWriter("c:/copytest.html");

int read=0;
while ((read = is.read()) != -1) {
fs.write(read);
}
fs.flush();
fs.close();

} catch (Exception e) {
e.printStackTrace();
}
}
}
 
Layne Lund
Ranch Hand
Posts: 3061
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
After running your program, I found that the only differences were in the whitespace. I suspect that most browsers perform some kind of formatting to the HTML when it saves it. However, I used View | Source instead of File | Save As.

You say that the URL's are "shortened". Do you mean that it doesn't contain the full domain name and path? I suspect that the browser adds this when you save the file. To see what I mean, you can click View | Source in the main menu. Notice that all the URLs are shortened in this way. This is because that's exactly how the HTML is on the server when it is sent to the browser. The web designer probably used relative path names in order to make it easier to move the whole website to a different location. The browser is smart enough to add the information necessary to create an absolute path so that the links aren't broken from the saved file. You will need to add logic to your program to perform this same function.

I also have a few comments about your code:

1) Sun's Java Coding Conventions suggest that class names start with upper case. It would be more appropriate to use CopyTest as the class name instead of copyTest.

2) Using UBB CODE tags will preserve your formatting. This will make it much easier for us to read your code, and thereby expediting our ability to help you more quickly.

3) The file name "c:\copytest.html" is VERY platform specific. I had to change this in order to run it on Unix. You should do your best to avoid such platform-specific code. In this case, it would probably be good enough to just use "copytest.html" as the file name. This will save it in the "current directory" on any system.

Layne
[ March 30, 2005: Message edited by: Layne Lund ]
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!