• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Regular Expression to filter text from html file

 
Nitin Menon
Ranch Hand
Posts: 87
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I need to get the content words and keywords from a group of .html files. That means, I must have everything in a html file except the html tags and and the things written within them. But, if the tag is a meta tag, then i need to extract the key words specified in it. Tried some stuff, but not leading any where. Can anyone please help me..!
Thanks in advance..!
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 34870
369
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What did you try?

I usually build up my regular expressions gradually. Can you match:
  • An open tag
  • A close tag
  • Both
  • an attribute
  •  
    Martin Vajsar
    Sheriff
    Posts: 3752
    62
    Chrome Netbeans IDE Oracle
    • Likes 1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Wouldn't a HTML parser be more up to the task? I personally wouldn't want to maintain code that parsed HTML using regular expressions.

    I have no experience with HTML parsers personally, but googling for Java HTML parser yields some promising links.
     
    Nitin Menon
    Ranch Hand
    Posts: 87
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Sorry for the late reply. I Was away. I got the solution. I wrote regular expressions in a series of steps.
    Thank you Martin and Jeanne..!
     
    • Post Reply
    • Bookmark Topic Watch Topic
    • New Topic