• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Bear Bibeault
  • Tim Cooke
  • Junilu Lacar
Sheriffs:
  • Paul Clapham
  • Devaka Cooray
  • Knute Snortum
Saloon Keepers:
  • Ron McLeod
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Frits Walraven
Bartenders:
  • Carey Brown
  • salvin francis
  • Claude Moore

how to extract url from html webpage  RSS feed

 
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have below response that I got by sending GET request to some server (GET /k/302.html HTTP/1.0) using java socket connection.



I have to write simple java code where I am suppose to crawl all the urls present on this current webpage (/k/302.html).
Currently I am able to extract the first url ("/") using java regular expression as .
But I am not able to get the second url which is for <img> tag.

Below is the expanded html content that I got from console where it clearly specifies that "redback.jpg" has hyperlink.



But if we see the GET response it does not clearly tells that it has hyperlink. How to extract such urls from response only? I have to do this in simple java using socket connection with HTTP standard request without use of any other external libraries.

image.png
[Thumbnail for image.png]
 
Saloon Keeper
Posts: 5412
143
Android Firefox Browser Mac OS X Safari Tomcat Server VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Regexps are notoriously difficuklt for extracting content from web pages, not the least because they're inflexible with respect to changing HTML. A better approach would be to use an API like HtmlUnit, IMO the best library for programmatic web access.
 
Marshal
Posts: 64166
215
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
. . . and, welcome to the Ranch
 
Surekha Gaikwad
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
No I cant use any external API that is what constraint I have in my project
And I am only facing issue while extracting url which is set inside <img> tag....if you can see the snapshot clearly
 
Tim Moores
Saloon Keeper
Posts: 5412
143
Android Firefox Browser Mac OS X Safari Tomcat Server VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

I cant use any external API that is what constraint I have in my project


Why? Is this a school project? Otherwise, it makes little sense.
 
Campbell Ritchie
Marshal
Posts: 64166
215
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would think you might try searching for "<a\\s+href=" and "</a>" repeatedly, and take the substrings in between.
 
Sheriff
Posts: 21719
102
Chrome Eclipse IDE Java Spring Ubuntu VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Your regex only match anchor elements, not image elements. You should either use more regexes (ugh), or make your current regex more flexible (ugh).

There is a sneaky workaround that abuses the code used by Swing to render HTML in editor panes. Class HTMLEditorKit.Parser can be used to perform (basic) HTML parsing for you (although it's limited to HTML 3.2...).

To use this, create an instance of ParserDelegator, implement a ParserCallback, and call the parse method. The trick is in writing the ParserCallback - the handleSimpleTag and handleStartTag​ give you access to the elements and their attributes. You should at least check for attributes HTML.Attribute.HREF and HTML.Attribute.SRC.


(If you want to have better HTML parsing support, you can use DocumentParser instead, if you can create a correct DTD instance.)
 
You can thank my dental hygienist for my untimely aliveness. So tiny:
Create Edit Print & Convert PDF Using Free API with Java
https://coderanch.com/wiki/703735/Create-Convert-PDF-Free-Spire
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!