This week's book giveaway is in the OCAJP forum.
We're giving away four copies of Programmer's Guide to Java SE 8 Oracle Certified Associate (OCA) and have Khalid A Mughal & Rolf W Rasmussen on-line!
See this thread for details.
Win a copy of Programmer's Guide to Java SE 8 Oracle Certified Associate (OCA) this week in the OCAJP forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Stripping out HTML from String

 
Maksim Ustinov
Greenhorn
Posts: 26
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

I'm writing a web app using JAVA. I have a code that generates HTML code from some template and database. Funtion returns this HTML in String and I need to take out <head> tag from my html.

My code looks like this:



As you see, i need to take out <head>.....</head> from my code and <body ...>. Leave everything that is in inside body.
 
Eric Daly
Ranch Hand
Posts: 143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Do you know how to search a file? Basically search through the file, looking for the stuff you want to remove (or the first line you want to keep). You'll need to create a temporary file to copy the contents you want to keep from the original, and then when you're done, write the new stuff to the original file (or write to a new file).
 
Maksim Ustinov
Greenhorn
Posts: 26
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's not a problem. I already have the file and the content is in the string. Not i just need to create Regular Expression to remove it using .removeAll() function but I don't know how to create that RegEx.
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 34837
369
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Maksim,
You are correct that using a regular expression is the best way to approach this. Whenever I use regular expressions, I start out small and make sure my regular expression does the same thing at each step.

For example, can you write a regular expression to:
1) Remove <head>?
2) Remove <head>...</head>?
3) Remove <body withABunchOfAttributes>?
3) Remove </body>?
4) Combine steps 2-4? (hint - you need to use grouping parens for this one if you want to do it one regular expression)

This sounds like a strange requirement. Do you really want to remove all the HTML rather than just the head and body tags? In particular do you want the <html> and <table> tags present?

Also, take a look at the Pattern.DOT_ALL flag since you are matching across multiple lines. I know about this flag, use it frequently and still manage to forget it on my first shot most of the time.
 
Maksim Ustinov
Greenhorn
Posts: 26
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Jeanne for your response.
Yes, I do need to delete <html> and </html> tags but that's not a problem, the problem is with <HEAD> tags..

Here is what I came up with to take out those tags but I'm not sure if this is correct.



Please let me know how it can be optimized and it can out unlimited number of spaces and new lines ignore everything that's in between.
 
Maksim Ustinov
Greenhorn
Posts: 26
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I just did few modifications to my RegEx and here is what I've got:



One small question is, how do I modify <head> part?
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 34837
369
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Maksim,
Are you trying to delete everything between the head tags? (I think that's what you are trying to accomplish, but the reg exp is way too complicated for that. So then I second guessed my understanding.)

This matches everything between the head tags regardless of what is in between:
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic