• Post Reply Bookmark Topic Watch Topic
  • New Topic

regex problem  RSS feed

 
Willie Tsang
Greenhorn
Posts: 24
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
link=Pattern.compile("href=\"(.*?)[^>]*\">\" );
is there anything wrong with my regex?
my program searches for links in this format <a href="link"text</a>
 
Paul Clapham
Sheriff
Posts: 22832
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Willie Tsang wrote:is there anything wrong with my regex?


The best thing for you to do would be to try it, and see if it does what you want it to do. It's not always easy for somebody to look at code and see exactly what it does, so besides it being easier and more timely (you don't have to wait for somebody to answer your post), it's also going to answer that question more reliably.
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Willie Tsang wrote:link=Pattern.compile("href=\"(.*?)[^>]*\">\" );
is there anything wrong with my regex?

Seems that there are lots of questions about parsing HTML today; and I'll say to you what I said to the others: Regex is not a good choice for parsing markup, because it's hierarchical.

You're generally better off to use something like JTidy to obtain well-formed HTML and then use a parser (of which there are tons; in fact JTidy has one built-in).

Winston
 
Willie Tsang
Greenhorn
Posts: 24
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
thanks for the reply, i got it to work. But i this is an hw assignment which i have to use builtin code only, so i can't use jtidy. but thanks again for advice
 
Aditya Jha
Ranch Hand
Posts: 227
Eclipse IDE Java Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In my opinion, reg-ex can be successfully used to parse most forms of nicely/badly formatted HTML. Using JTidy and other means sounds logical and better prima-facie, but it may not be applicable in many cases. For example, in one of my projects I had to write a passthrough web-component, sort of a proxy which shows a remote page to the end-client, in such a manner that the end-client has no direct communication with the remote server, a la proxy sites which are used to bypass firewalls. My purpose was more official, though.

I used the following to scan URLs in the page, as I had to modify them to point to my intermediary server and let client be completely independent of the actual page-hosting server. Also, I had little control over what technology is used to generate the remote page, or whether the page is well-formed HTML or not. Using JTidy (and an HTML parser) was an issue, because it caused the standalone <div> elements to be self-closing, and that caused some problems with IE (as I vaguely remember). Plus, I wanted to retain everything, from comments to errors due to wrong formatting of HTML and pass it on to the end-client with least possible changes to the actual remote page code.

The reg-ex may seem to be a bit complex, but it can be taken as an exercise, just to trigger ideas.


This scans most places in HTML where a URL can be found. This, by no means is a complete list, but it does scans most remote sites in a comprehensive manner.

@Willie - It's good that you have already found the solution. In general, to parse elements for href values, you may try using the pattern generated by:

EDIT: In hindsight, the code in my comment may be too complex to be in the beginner's forum. But, I guess it may be useful.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!