• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

GET html contents from a web server

 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hi
i am building a program to get the contents of html on an http website.
The code is found below:



But unfortunately i only receive half of the html
why is that???

When i use the URL class getcontent I get all the html, but i need to use sockets. Can someone please indicate where my error is please...
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Why am I only receiving half the data from the server??
any suggestions will be greatly appreciated...
 
Paul Clapham
Sheriff
Posts: 21416
33
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You don't get the whole result because you don't read the whole result. Instead you stop reading earlier than that because of this:

By the way, that readLine() method is deprecated. The API documentation has some suggestions about what you should be using instead.

Also, you said this:
When i use the URL class getcontent I get all the html, but i need to use sockets.

That doesn't quite make sense to me, as the URL class does use sockets. So if you use that, you are using sockets.
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Paul

Thanks for the reply.
What i meant when i said that i want to use sockets and not URL is that i want to use low-level sockets.
I have followed your suggestions but i am still only getting half of the html
This the code:


This is the output i am getting on the console:
Line 1: <html>
Line 2: <head>
Line 3: <meta NAME="GENERATOR" Content="Microsoft FrontPage 12.0">
Line 4: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Line 5: <title>Nature Net</title>
Line 6: <link REL="stylesheet" HREF="styles/style.css" TYPE="text/css">
Line 7: <script src="include/i_javascript.js" type="text/javascript"></script>
Line 8:
Line 9: <style type="text/css">
Line 10: .style1 {
Line 11: text-align: center;
Line 12: }
Line 13: .style2 {
Line 14: border-width: 0px;
Line 15: }
Line 16: .style5 {
Line 17: color: #E4761F;
Line 18: }
Line 19: </style>
Line 20:
Line 21: </head>
Line 22: <body leftmargin="0" topmargin="0" bgcolor="#FFFFFF">
Line 23: <table border="0" cellpadding="0" cellspacing="0" width="780">
Line 24: <tr>
Line 25: <td width="195" bgcolor="#4346D3" align="center" valign="middle">
Line 26: <img src="images/naturenetlogo2.gif" width="92" height="92" align="middle"></td>
Line 27: <td width="585">
Line 28: <table border="0" cellpadding="0" cellspacing="0">
Line 29: <tr>
Line 30: <td width="443" height="138" bgcolor="#84C55F" align="center" valign="center">
Line 31: <img src="images/headertitle.gif" alt="Naturenet The Environmental Learning Network" width="409" height="96">
Line 32: </td>
Line 33: <td width="142" bgcolor="#84C55F">
Line 34: <img src="images/headerpic1.gif" id="rightuppergraphic" alt="" width="142" height="140">
Line 35: </td>
Line 36: </tr>
Line 37: <tr>
Line 38: <td colspan="2" height="22" bgcolor="#FBE590" align="right"><a href="contact.html" class="navlink">
Line 39: contact us</a> |
Line 40: <a href="sitemap.html" class="navlink">sitemap</a>  </td>
Line 41: </tr>
Line 42: </table>
Line 43: </td>
Line 44: </tr>
Line 45: <tr>
Line 46: <td width="195" height="6" bgcolor="#FBE590"></td><td bgcolor="#84C55F"></td>
Line 47: </tr>
Line 48: <tr>
Line 49: <td width="195" height="500" bgcolor="#FBE590" valign="top">
Line 50: <table border="0" cellpadding="0" cellspacing="0"><tr><td bgcolor="#FBE590" width="5"></td>
Line 51: <td bgcolor="#FBE590">
Line 52: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
That is all the contents found in the DataInputStream

I also had the following contents in the console:
The request header : GET /styles/style.css HTTP/1.0
User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4
Referer:http://127.0.0.1:8080/?getURL=www.naturenet.com
Accept: text/css,*/*;q=0.1
Host: 127.0.0.1:8080
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: __qca=1223192728-64253405-47139338; __utma=96992031.3524520648312145400.1227907344.1227907344.1227907344.1; __utmz=96992031.1227907344.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)

Error getting page java.net.MalformedURLException: no protocol:

What does this mean? Why is it throwing me this request header?
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You're assuming that there's nothing more to read if available() returns 0; that's not the case: AvailableDoesntDoWhatYouThinkItDoes. Use read() instead, but be aware of ReadDoesntDoWhatYouThinkItDoes.
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The urls you indicated to me do everything in bytes, but what i want to achieve in the end is the html of any url so i can manipulate it on my web server before outputting the results. But i can not do that with bytes right?
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You can't convert the bytes to strings until you know which encoding they're in, and you won't know that until you've inspected the META tag that specifies it.

If this was my project, I'd use a library like https://sourceforge.net/projects/jwebunit which let's you retrieve (and work with) web pages on a much higher level.
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I tried this way but i am still getting only half the html


Thank you for your patience, i am new to networking and would really like to manage in low-level sockets...

I am working with low-level sockets since the class URL can only do POST and GET requests from the HTTP methods, is this true? can it do other HTTP methods sych as DELETE?
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When you say you have to inspect the META tag does that mean to find charset value?
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I tried to encode using UTF-8 but this if i am not mistaken is for text/html content
the code it as follows:


but i am still getting only half the html
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
but i am still getting only half the html

Read the ReadDoesntDoWhatYouThinkItDoes page I linked to; it explains the problem.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic