Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

URL string replacing

 
jack nicolson
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,
I have a very small doubt I want to extract URL from a html, like

<a href="http://www.yahoo.com/" /a>

I want http://www.yahoo.com/ this to extracted from the html page. the page is have one link only as above. I want to to know the regular expression pattern or substring finding method for it.

waiting for your reply,
Jack,
 
dharmendra Rathor
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If
"<a href="http://www.yahoo.com/" /a>" is a part of html file and we have to retrieve URL (http://www.yahoo.com/) form html page then we can use

tagged regular expression
(<a href=")(http://[a-z,.]*/)(" /a>
and value of tagged expression two ie (\2)will give the desired output.
 
jack nicolson
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for your reply,
however when I used your expression then I am getting the same result as I am getting earlier.
I did this way


String line(http response string) = line.replaceAll("<a href=","");
// line = line.replaceAll("http://[a-z,.]*/","");
line = line.replaceAll("/a>","");

Had I done something correct.

The web page from which I want to extract the Url is




<HTML>
<HEAD>
<TITLE>Moved Temporarily</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000">
<H1>Moved Temporarily</H1>
The document has moved <A HREF="http://www.yahoo.com/feeds/default/private/full/?gsessionid=w8O_URi_sRmSo66ZbxfhYQ">here</A>.
</BODY>
</HTML>

I want to retrieve only this string "http://www.yahoo.com/feeds/default/private/full/?gsessionid=w8O_URi_sRmSo66ZbxfhYQ"

Hope this will help you solve my issue.

Thanks
Jack,
 
Campbell Ritchie
Sheriff
Pie
Posts: 50258
79
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
  • Try using either String.indexOf() or a regular expression to match the location of <A href=.
  • Get the index and add the length of <A href= to it. You not have a start index.
  • From that start index, find the first occurrence of "> using the indexOf() method or a regular expression.
  • You now have start and finish indices. Use those to obtain a substring.
  • Put that substring into an ArrayList<URL> or an ArrayList<String>
  • Repeat until you reach the end of the String.
  • I think that will probably work. Try it.
     
    jack nicolson
    Greenhorn
    Posts: 5
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Thanks for your suggestion, I tried according but I got out of bound exception

    int i=line.indexOf("<A href=");
    out.println(i);
    int len = i+"<A href=".length();
    i = line.indexOf(">",len+1);
    out.println(i);
    line=line.substring(len+1,i+1);
    out.println(line);
    Please let me know if I commit any mistake in the code.

    Thanks,

    Jack.
     
    Joanne Neal
    Rancher
    Posts: 3742
    16
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    The exception will tell you which line of your code it happened on. Look at the documentation for the methods on that line and see what could cause the exception.
     
    Campbell Ritchie
    Sheriff
    Pie
    Posts: 50258
    79
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Look closely through the details of the substring() method. You may be getting problems because of the +1.
    In case you are getting lines ending with /a> rather than . . .>link text</a> you might try getting the index of /a> as well, and using that if it is less than the index of >. There is a simple method in the Math class which can do that for you.
     
    • Post Reply
    • Bookmark Topic Watch Topic
    • New Topic