Win a copy of Programmer's Guide to Java SE 8 Oracle Certified Associate (OCA) this week in the OCAJP forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Help needed for an regex expression

 
Keshav Khedkar
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,
this is the regex expression I am using: (<div)\s+id="article_body">(.*?((\1((.*?(\5|\8).*?|.*?)|.*?)\9)|(\1[^>]*?/>))){1,}(</div>)
to extract whole complete <div id='article_body'> tag. Note that this tag can have other <div> tags as well as other tags. there can be other <div> tags before or after this tag. My expression is not accurate.
Following are the contents:

// Snip

Please help me to get right regex expression.
Thanks in advance.
regards,
kk.
 
Wouter Oet
Saloon Keeper
Posts: 2700
IntelliJ IDE Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Keshav,

Please don't post huge amounts of code. Try to give an small example that explains your problem. Also tell what you think
that should happen and what actually happened.
 
Henry Wong
author
Marshal
Pie
Posts: 21366
84
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Also, the regex provided doesn't seem to make much sense. I can't figure out the purpose of all the backreferences.

Henry
 
Keshav Khedkar
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,
What I want is the whole <div id='article_body'> tag from the contents of the file attached. the regex expression I provided considers the nested nature of this tag - this tag can be nested within other <div> tags and other <div> tag can be nested into this.
My expression is giving me wrong results - it either extracts contents starting from article_body to first </div> tag or last </div> tag. both the cases are invalid. extracted contents should end up to the </div> tag meant for <div id='article_body'>.
I have numbered groups in the regex expression from left to right (don't know the right order).
cases may be-
1) there would not be any tags in article_body tag.
2) nested tags - like <div id='parent'><div id='article-body'><div>sss</div>ssdd<div><div />sfdfd</div></div></div>

for nested nature I have used backreferences to group.
other alternative solutions like best open source html parser are also welcome - suggest me a html parser.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic