Win a copy of Programmer's Guide to Java SE 8 Oracle Certified Associate (OCA) this week in the OCAJP forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Problem getting html of a WebPage

 
Rohan Deshmkh
Ranch Hand
Posts: 127
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I wanted to know what i am doing wrong, i don't want other alternative classes to be used.This is my code:


The output that i get is many random integer values, each on one line and at the end there is -1.
I am not very sure about what does InputStream s = u.openStream(); and size=s.read(); does?
I want to know how to print the html of web page without using any other claases.
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The value is certainly not random. As the relevant javadocs explain, it's the next byte of data. For a web page, that's probably the ASCII code of a character of text (or UTF-8, ISO_8859 or whatever the page is encoded in).

If you expect text to be returned (and not binary data), wrap the InputStream into a BufferedReader, and process the output line by line. That would provide a more human-readable representation of the content.
 
Rohan Deshmkh
Ranch Hand
Posts: 127
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ulf Dittmer wrote:The value is certainly not random. As the relevant javadocs explain, it's the next byte of data. For a web page, that's probably the ASCII code of a character of text (or UTF-8, ISO_8859 or whatever the page is encoded in).

If you expect text to be returned (and not binary data), wrap the InputStream into a BufferedReader, and process the output line by line. That would provide a more human-readable representation of the content.

Hey thanks for the suggestion.But i have following questions:
1) Why should we wrap InputStream into BufferedReader?
2)I did what you said and got correct ouptut as expected, but the code given in book does not make use of Bufferedreader, although it uses some byte array which i am not able to understand.

Here is the code given in the book:


I am not understanding why the byte array is used.And in which variable is exactly the html content is stored.
AS you suugested i wrapped the code in BufferedReader and got correct output but i want to understand how the above example is working correctly.
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There are any number of ways to read an InputStream, and none is the best in all given circumstances. I prefer the BufferedReader approach because then I don't have to deal with byte arrays and creating String objects myself, but it only works if you're certain that you're reading text (which is what a web page is, but web content in general can also be binary, in which case you can't use Readers).

Note that, to be entirely correct, you would also have to handle the character encoding the web page is in. The code above assumes that it's compatible with the platform default encoding of the machine where the code runs - which is often a correct assumption, but definitely not always.
 
Rohan Deshmkh
Ranch Hand
Posts: 127
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ulf Dittmer wrote:There are any number of ways to read an InputStream, and none is the best in all given circumstances. I prefer the BufferedReader approach because then I don't have to deal with byte arrays and creating String objects myself, but it only works if you're certain that you're reading text (which is what a web page is, but web content in general can also be binary, in which case you can't use Readers).

Note that, to be entirely correct, you would also have to handle the character encoding the web page is in. The code above assumes that it's compatible with the platform default encoding of the machine where the code runs - which is often a correct assumption, but definitely not always.


Ok, i understood about InputStream but would you mind telling me how the above code that i posted, is working?I am not able to understand it.
After this statement is executed: InputStream s = u.openStream();
does s now contain all the html content that we want?Now it may be in other format, the only thing we have to do is convert into textual format (using BufferedReader or other method) and the print it.Am i right?
 
Winston Gutkowski
Bartender
Pie
Posts: 10490
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rohan Deshmkh wrote:I am not understanding why the byte array is used.And in which variable is exactly the html content is stored.
AS you suugested i wrapped the code in BufferedReader and got correct output but i want to understand how the above example is working correctly.

Basically: because it's doing a String conversion; however, it looks rather tortuous to me, and not as good as Ulf's suggestion.

Simply put, all Files and Streams contain binary data that can be read byte by byte. Only some of those Streams contain TEXT, and text needs to be converted. This is because a Java char is a TWO-byte primitive and streamed text (particularly ASCII text, but other forms too) often contains each character in one byte. The foundation classes provide a Reader which is specifically designed for converting text streams to Java characters, and there may be quite a lot going on behind the scenes that you don't see. Adding buffering (ie, with a BufferedReader) makes I/O more efficient, and also allows you to read in "lines" of data (which is the normal way of breaking up text) as Strings.

HIH

Winston
 
Rohan Deshmkh
Ranch Hand
Posts: 127
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:
Rohan Deshmkh wrote:I am not understanding why the byte array is used.And in which variable is exactly the html content is stored.
AS you suugested i wrapped the code in BufferedReader and got correct output but i want to understand how the above example is working correctly.

Basically: because it's doing a String conversion; however, it looks rather tortuous to me, and not as good as Ulf's suggestion.

Simply put, all Files and Streams contain binary data that can be read byte by byte. Only some of those Streams contain TEXT, and text needs to be converted. This is because a Java char is a TWO-byte primitive and streamed text (particularly ASCII text, but other forms too) often contains each character in one byte. The foundation classes provide a Reader which is specifically designed for converting text streams to Java characters, and there may be quite a lot going on behind the scenes that you don't see. Adding buffering (ie, with a BufferedReader) makes I/O more efficient, and also allows you to read in "lines" of data (which is the normal way of breaking up text) as Strings.

HIH

Winston


OK thanks , i will use BufferedReader approach from now .
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic