• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Regex to find URL from anchor tag

 
Greenhorn
Posts: 26
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Ranchers!

Please help understand what may possibly be wrong with following:
error:thanks in advance!
 
Sheriff
Posts: 22783
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Check out all occurrences of (?. You must either escape the "(" ( an optional "(" in the regex), escape the "?", or add a ":" to make it a non-capturing group: "(?:assdsad)".

However, I would change your regex slightly:
- use a group for the opening quote
- use a reluctant (non-greedy) capture all: .*?
- require your opening quote using a back reference \\1

That leaves with just one small problem: what if there are no quote characters around the value? In HTML it's perfectly valid to write <a href=http://www.google.com>.
 
Ranch Hand
Posts: 258
2
IntelliJ IDE Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What are the use of <link> and <text> ?
It looks like you want to put the matched pattern part into named group??
 
shwetank singh
Greenhorn
Posts: 26
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Rob for the useful insight..i tried doing all as suggested but can't crack it..tried:
removing the optional "(" or using an escape
using ?: --has already tried with this but can't get it to work
- require your opening quote using a back reference \\1 -- didn't actually get this

could you please suggest if the approach is correct or if it could have a simpler approach!

@Raymond: yes, i am trying to do the same. suggestions?

thanks!
 
Rob Spoor
Sheriff
Posts: 22783
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
java.util.regex.Pattern has no support for named groups. Only numbered.

Here's how I would build have built this regex:
- make the entire thing case insensitive. That allows you to find <A as well as <a
- start with <a
- anything that doesn't close the tag, as a reluctant quantifier†: [^>]*?
- href
- any amount of whitespace: \s*
- =
- any amount of whitespace: \s*
- a capturing group that for the opening quote: ('|")
- a capturing group with anything, as a reluctant quantifier, for the URL: (.*?)
- the closing quote, equal to the opening quote: \1
- again, anything that doesn't close the tag, as a reluctant quantifier†: [^>]*?
- a negative lookahead for /, to prevent a case of <a xxxxx/>: (?!/)
- the closing >

If you paste all that together you get a regex that should do what you need. In the future, I would build regexes the same if I were you: write down what you think you need in words, bit by bit, then translate all these bits to separate little regexes, then combine these regexes into one larger regex.

†These two "anything that doesn't close the tag" parts are for any other attributes, like target, name, id, etc.


As for the non-quoted values, I ended up using a second regex for that. It looked the same, except the opening quote, reluctant anything, closing quote was replaced by negative lookahead to prevent quotes, any non-whitespace. After it came either > or whitespace followed by the last three parts of the above regex.
 
shwetank singh
Greenhorn
Posts: 26
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Rob!
got it:

String pattern = "<a[^>]*?href\\s*=\\s*((\'|\")(.*?)(\'|\"))[^>]*?(?!/)>";
System.out.println(m.group(1) + "<-- -->"+ m.start() + "<-- -->" + ss);

output :

'google.com'<-- -->6<-- -->hello link

will take those quotes out too..and the case when there are no quotes..will post back when done.

thanks..i did take the thought step by step..but did not write them down!..nice learning from you!
 
permaculture is largely about replacing oil with people. And one tiny ad:
a bit of art, as a gift, the permaculture playing cards
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic