Win a copy of Zero to AI - A non-technical, hype-free guide to prospering in the AI era this week in the Artificial Intelligence and Machine Learning forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Paul Clapham
  • Bear Bibeault
  • Jeanne Boyarsky
Sheriffs:
  • Ron McLeod
  • Tim Cooke
  • Devaka Cooray
Saloon Keepers:
  • Tim Moores
  • Tim Holloway
  • Jj Roberts
  • Stephan van Hulst
  • Carey Brown
Bartenders:
  • salvin francis
  • Scott Selikoff
  • fred rosenberger

regex for html tags

 
Ranch Hand
Posts: 85
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

How do I create a regular expression for html tags? Mainly I need to remove

tags from a string, but leave the content intact. Even if it removes all html tags from the string, that is fine too. I am using the String's replaceAll method, but since it needs a regex, I am not sure how to do that. Can anyone help me here? I need to turn this in by tomorrow

Thanks!
Nina
 
Marshal
Posts: 70591
287
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
HTML tags are actually quite simple; they start with <, they end with > and can have anything in the middle. You should be able to create a regular expression easily.
 
Ranch Hand
Posts: 266
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:HTML tags are actually quite simple; they start with <, they end with > and can have anything in the middle. You should be able to create a regular expression easily.



@OP:
But be careful: if your html code contains JavaScript code for example, things may go wrong. Take the following JavaScript:

<br /> <br /> The "<a" may be mistaken for the start of an anchor tag.>
 
Ranch Hand
Posts: 2908
1
Spring Java Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Piet Verdriet wrote:

Campbell Ritchie wrote:HTML tags are actually quite simple; they start with <, they end with > and can have anything in the middle. You should be able to create a regular expression easily.



@OP:
But be careful: if your html code contains JavaScript code for example, things may go wrong.

The "<a" may be mistaken for the start of an anchor tag.>



Then how about validating the text parsed between < > tag, If extract string contains characters(/w) with '=', then its html tag else It may be code or other html data, escape it..

('=' considered because of the tag like <input type="text"/>)
 
Rancher
Posts: 43016
76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You also need to think about comments -both JavaScript and HTML- as well as the contents of string constants in JavaScripts, both of which can contain lots of stuff that screws up simple regexps.

(From a theoretical point of view, most programming languages are type-1 or type-2 grammars; trying to work with them using weaker type-3 tools -such as regular expressions- will cause complications. See Chomsky hierarchy for more info.)

If this was my problem, I'd use a library like TagSoup or NekoXNI that creates valid XML from HTML, and then use XML APIs to work the result.
 
Piet Verdriet
Ranch Hand
Posts: 266
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Ulf Dittmer wrote:...

If this was my problem, I'd use a library like TagSoup or NekoXNI that creates valid XML from HTML, and then use XML APIs to work the result.



++ for that!
 
You know it is dark times when the trees riot. I think this tiny ad is their leader:
the value of filler advertising in 2020
https://coderanch.com/t/730886/filler-advertising
reply
    Bookmark Topic Watch Topic
  • New Topic