• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Liutauras Vilda
Sheriffs:
  • Rob Spoor
  • Junilu Lacar
  • paul wheaton
Saloon Keepers:
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Carey Brown
  • Scott Selikoff
Bartenders:
  • Piet Souris
  • Jj Roberts
  • fred rosenberger

Regular Expression for identifying/extracting HTML Codes

 
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi All,
I am trying to build a regular expression for extracting the HTML Codes from a String.

I have build a simple expression ".*&.*;.*".
But this also returns true for string with spaces like "kjj&A fast;gre".

I don't want a String which contains even single space between "&" and ";"

And after finding out the string with HTML code, how can I extract this HTML code using regular expression?

Thanks & Regards.
 
Bartender
Posts: 1166
17
Netbeans IDE Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Err ... so change the "$.*;" so that it is more restrictive and excludes spaces !

P.S. Parsing HTML is normally better done using an HTML to XML converter and parsing the XML.
 
Jigar M Gohil
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Richard Tookey wrote:Err ... so change the "$.*;" so that it is more restrictive and excludes spaces !



Can you please help me with the exact reg exp to use?

e.g."D& amp; G" or "D& #36 ;G" should match and return true. [No Spaces between & and ;]
But "A & some text ; B" should NOT match and return false.

I came up with ".*&(#)*([\\w&&[^\\s]])*;.*"
is there any other way?


Richard Tookey wrote:
P.S. Parsing HTML is normally better done using an HTML to XML converter and parsing the XML.



Here my source is not HTML or XML. Its a simple String which might contain an HTML Code which I need to scan.
Do you still suggest the XML parser? which one to use?
 
Richard Tookey
Bartender
Posts: 1166
17
Netbeans IDE Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
So you need a regular expression that matches ( an & followed by 3 to 5 alpha characters followed by a semicolon OR an & followed by a hash (#) followed by 3 decimal digits followed by a semicolon). That is very basic regular expression stuff.


Would I still use something like JTidy and then DOM? Probably - since, even though I'm a great fan of regular expressions, I would not want to write all the necessary regular expressions and it is difficult to handle nested html tags using regular expressions.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic