I have below response that I got by sending GET request to some server (GET /k/302.html HTTP/1.0) using java socket connection.
I have to write simple java code where I am suppose to crawl all the urls present on this current webpage (/k/302.html).
Currently I am able to extract the first url ("/") using java regular expression as .
But I am not able to get the second url which is for <img> tag.
Below is the expanded html content that I got from console where it clearly specifies that "redback.jpg" has hyperlink.
But if we see the GET response it does not clearly tells that it has hyperlink. How to extract such urls from response only? I have to do this in simple java using socket connection with HTTP standard request without use of any other external libraries.
Regexps are notoriously difficuklt for extracting content from web pages, not the least because they're inflexible with respect to changing HTML. A better approach would be to use an API like HtmlUnit, IMO the best library for programmatic web access.
Your regex only match anchor elements, not image elements. You should either use more regexes (ugh), or make your current regex more flexible (ugh).
There is a sneaky workaround that abuses the code used by Swing to render HTML in editor panes. Class HTMLEditorKit.Parser can be used to perform (basic) HTML parsing for you (although it's limited to HTML 3.2...).
To use this, create an instance of ParserDelegator, implement a ParserCallback, and call the parse method. The trick is in writing the ParserCallback - the handleSimpleTag and handleStartTag give you access to the elements and their attributes. You should at least check for attributes HTML.Attribute.HREF and HTML.Attribute.SRC.
(If you want to have better HTML parsing support, you can use DocumentParser instead, if you can create a correct DTD instance.)