• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Devaka Cooray
  • Ron McLeod
  • Jeanne Boyarsky
Sheriffs:
  • Liutauras Vilda
  • paul wheaton
  • Junilu Lacar
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Piet Souris
  • Carey Brown
  • Tim Holloway
Bartenders:
  • Martijn Verburg
  • Frits Walraven
  • Himai Minh

getting an html page

 
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm trying to write a simple program that saves an html page given its url. However, what I'm retrieving is not the same html that the browser (in my case Mozilla) uses. To see what I mean:
- run the following code
- open the embedded link (http://www.ranchhouseinn.com/ranch.html) in a browser and then select "save page as..." and save it to c:/good.html (or whataver you wish)
- compare that file with c:/copytest.html that was generated by the code.

So my questions are:
- Why are these files different?
- How can I get the html in good.html using java?

Thanks

>>>

import java.io.*;
import java.net.*;

public class copyTest {

public static void main(String[] args) {

try {
URL url = new URL("http://www.ranchhouseinn.com/ranch.html");
URLConnection connection = url.openConnection();

InputStream is = connection.getInputStream();
FileWriter fs = new FileWriter("c:/copytest.html");

int read=0;
while ((read = is.read()) != -1) {
fs.write(read);
}
fs.flush();
fs.close();

} catch (Exception e) {
e.printStackTrace();
}
}
}
 
Sheriff
Posts: 67682
173
Mac Mac OS X IntelliJ IDE jQuery TypeScript Java iOS
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
How do they differ?
 
Ranch Hand
Posts: 805
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
One possibility is that your browser cached an older version. If you refresh the browser you may see the change disappear.
 
stephen dimitrov
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Sorry, bad example. And it's not a caching problem, either. Here's a better example:

- run the following code
- open the embedded link (http://www.hikinglasvegas.com/peaks_of_the_sierra.htm) in a browser and then select "save page as..." and save it to c:/good.html (or whatever you wish)
- open both c:/copytest.html and c:/good.html IN NOTEPAD
- search for 'Mallory' in both files

In good.html, you'll see there's a fully qualified url for this link:
- href="http://www.hikinglasvegas.com/Mt_Malloryl_Photo_pg.htm"
In copytest.html, you'll see it's been shortened:
- href="Mt_Malloryl_Photo_pg.htm"

This is my problem. I'm trying to parse out individual URLs from the document but when I go the code route they're shortened. Any ideas?

>>>

New code:

import java.io.*;
import java.net.*;

public class copyTest {

public static void main(String[] args) {

try {
URL url = new URL("http://www.hikinglasvegas.com/peaks_of_the_sierra.htm");
URLConnection connection = url.openConnection();

InputStream is = connection.getInputStream();
FileWriter fs = new FileWriter("c:/copytest.html");

int read=0;
while ((read = is.read()) != -1) {
fs.write(read);
}
fs.flush();
fs.close();

} catch (Exception e) {
e.printStackTrace();
}
}
}
 
Ranch Hand
Posts: 3061
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
After running your program, I found that the only differences were in the whitespace. I suspect that most browsers perform some kind of formatting to the HTML when it saves it. However, I used View | Source instead of File | Save As.

You say that the URL's are "shortened". Do you mean that it doesn't contain the full domain name and path? I suspect that the browser adds this when you save the file. To see what I mean, you can click View | Source in the main menu. Notice that all the URLs are shortened in this way. This is because that's exactly how the HTML is on the server when it is sent to the browser. The web designer probably used relative path names in order to make it easier to move the whole website to a different location. The browser is smart enough to add the information necessary to create an absolute path so that the links aren't broken from the saved file. You will need to add logic to your program to perform this same function.

I also have a few comments about your code:

1) Sun's Java Coding Conventions suggest that class names start with upper case. It would be more appropriate to use CopyTest as the class name instead of copyTest.

2) Using UBB CODE tags will preserve your formatting. This will make it much easier for us to read your code, and thereby expediting our ability to help you more quickly.

3) The file name "c:\copytest.html" is VERY platform specific. I had to change this in order to run it on Unix. You should do your best to avoid such platform-specific code. In this case, it would probably be good enough to just use "copytest.html" as the file name. This will save it in the "current directory" on any system.

Layne
[ March 30, 2005: Message edited by: Layne Lund ]
 
Catch Ernie! Catch the egg! And catch this tiny ad too:
the value of filler advertising in 2021
https://coderanch.com/t/730886/filler-advertising
reply
    Bookmark Topic Watch Topic
  • New Topic