• Post Reply Bookmark Topic Watch Topic
  • New Topic

Regex where some parts match on multiple lines, but other parts doesn't?  RSS feed

 
Jimi Svedenholm
Ranch Hand
Posts: 53
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I'm having problems writing a regex that puts <p>...</p> tags around each line, except lines (ie one or more lines) with a <table>...</table>.

Example input:
------
abc
<table></table>
abcd
1234
<table></table>
t
<table>
...
</table>
123 456
------

Wanted output:

------
<p>abc</p>
<table></table>
<p>abcd</p>
<p>1234</p>
<table></table>
<p>t</p>
<table>
...
</table>
<p>123 456</p>
------

First I tried this:



But that gave me:

------
<p>abc</p>
<table></table>
<p>abcd</p>
<p>1234</p>
<table></table>
<p>t</p>
<p><table></p>
<p>...</p>
<p></table></p>
<p>123 456</p>
------


So I figured that I should use the dotall embedded flag (?s) also somehow, to make the table part also notice tables on multiple lines. First I tried putting the (?s) at the beginning of the regex string, but that only resulted in a single <p>...</p> enclosing the entire String.

I then tried to make only the negative lookahead part use the dotall. Unsure of the correct syntax I tried this:



But that gave me exactly the same output as the first regex above.

Can someone see if I have made some simple misstake here? Or isn't it possible to solve this using a single regex replacement? I would really like to be able to solve this using regex, because the alternative would be to redesign some code that is part of a larger system that currently takes regex strings and their replacement strings as the only way to manipulate the text.

Regards
/Jimi
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 37465
539
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jimi,
Take a look at Pattern.DOT_ALL. It's used to treat line breaks as regular characters.

You'll need to use the longer replace functionality - create a pattern and matcher to pass the pattern flag.
 
Jimi Svedenholm
Ranch Hand
Posts: 53
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jeanne Boyarsky wrote:Jimi,
Take a look at Pattern.DOT_ALL. It's used to treat line breaks as regular characters.

You'll need to use the longer replace functionality - create a pattern and matcher to pass the pattern flag.


Ok, so what your saying is that what I want to do can't be done in a single replaceAll function call? Using a Pattern and a Matcher requires me to doing a total redesign, but I guess I just have to bite the bullet then. Thanks for your input, Jeanne.

/Jimi
 
Mike Simmons
Ranch Hand
Posts: 3090
14
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
No, I think Jeanne missed where you'd already talked about using (?s), which is a flag equivalent to Pattern.DOT_ALL. Using that may be part of the solution, but your problem is more complex than that.

Jimi wrote:First I tried putting the (?s) at the beginning of the regex string, but that only resulted in a single <p>...</p> enclosing the entire String.

Well, the next step might be to use a reluctant quantifier like ".*?" rather than a greedy quantifier like ".*". That might help you match a start tag to its closest matching end tag, rather than the farthest one.

But deeper problems await.

It's possible to nest tables within other tables - if this happens, how do you know which </table> tag matches the first <table ...> tag you see? This is a nontrivial problem. If we say that tables will never be nested in more than n other tables, it's possible to write a single hideously-complex regex that can match the tags up as desired. The bigger n is, the more complex the expression. Try it for n = 1, then n = 2. It's not pretty. I would probably shoot all my coworkers and my employer before writing one for n = 5 or 6. Furthermore, I think it's impossible to write a single regex that can handle any nesting depth n. I'm pretty sure it was proven impossible for standard, "classic" regexes (those available in the early 90's, at least). And offhand, I don't see any way to do it even with the more elaborate regexes now available to us in Java and some other languages/libraries. Nestable structures are not a good fit for regular expressions. If you can guarantee that your input will never have nested tables, great. Or maybe never nested more than 1 deep, OK. Or 2 deep... nah, it's better if I don't even go there. That way lies madness.

Ultimately, I think you'll probably be better off if you forget regexes for this problem, and use some library designed to parse HTML (or XML) instead. I haven't looked into this in a few years, but my first thought would be to use HTML Tidy to convert your HTML to valid XHTML. Then use XML tools to perform the transformations you need. Or maybe HTML Tidy can do what you want directly. This seems like the sort of thing that it works well for, but I'm not very familiar with its current capabilities. Still, if this were my problem to solve, HTML Tidy would be one of my first stops for further study.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!