• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Reg Ex to remove html comments?

 
Greenhorn
Posts: 23
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hello, I've been trying to write a regular expression that will let me use the replaceAll functions to delete comments from a downloaded html page. However I can't get it right. This is what I've got so far:

s = s.replaceAll("<!--.*-->", "");

Could someone tell me what I'm doing wrong? Thanks in advance.
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You don't say how this regexp doesn't work right, but there are at least two problems with it. Firstly, it doesn't handle multiple comments correctly. The regexp will delete everything from the start of the first comment until the end of the last comment. Read the javadocs of the Pattern class about the difference between greedy and reluctant quantifiers.

Secondly, it may not handle comments correctly that span multiple lines. I'm not sure what the default way of handling multiline matches is for the java.util.regex package; make sure it uses the correct setting of the Pattern.MULTILINE flag. You may need to use Pattern and Matcher explicitly in order to set that (instead of using String.replaceAll, which doesn't expose flags).
[ August 11, 2007: Message edited by: Ulf Dittmer ]
 
Chaz Andrews
Greenhorn
Posts: 23
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Sorry for not saying what was going wrong! It simply wasn't doing anything at all, not even take out part of a comment. Thank you for your suggestions, I'll get to reading
 
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you look at the javadoc for Pattern.MULTILINE, it only affects the meaning of ^ and $, which aren't relevant for Chaz's expression. The one that is relevant here is Pattern.DOTALL. Note that it's not necessary to use the Pattern and Matcher class directly - you can also use the poorly-documented flags allowed by the (?idmsux) construct (look at Pattern's "special constructs" section). To get the effect of Pattern.DOTALL, just use (?s). Quoth the API: "The s is a mnemonic for 'single-line' mode, which is what this is called in Perl." Obviously this mnemonic suggests the complete opposite of what the flag actually does, which suggests that someone in Perl regex development was smoking crack or something, and the Java folks also failed to fix this idiotic error. Anyway, if you ignore the drug-addled etymology, (?s) does what you want here. To give a more complete solution, try

[ August 12, 2007: Message edited by: Jim Yingst ]
 
Chaz Andrews
Greenhorn
Posts: 23
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Jim, thanks a lot, you hit the nail on the head with that reply. In case anyone is reading this post in future, you put the question mark and s the wrong way around, so it should be:

s = s.replaceAll("(?s)<!--.*?-->", "");
 
Jim Yingst
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Oops! Good thing I got it right in the accompanying text, so you could work out what I meant. I've corrected my post above, for the benefit of future readers.
 
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Chaz Andrews wrote:Jim, thanks a lot, you hit the nail on the head with that reply. In case anyone is reading this post in future, you put the question mark and s the wrong way around, so it should be:



It works very well. Thank you.
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic