• Post Reply Bookmark Topic Watch Topic
  • New Topic

a regular expression search and replace program  RSS feed

 
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Please help!!!

I am parsing an html using regular expressions and replacing the search value with a new value.

The code is:


My problem is that when i check the text (the text i am referring to here is the html page of the web page) after the replacement was done nothing was changed. (the regular expressions did find what i needed, so they did work for some sites and the method check_res_url() does work)
Any suggestions would be greatly appreciated.
 
Ranch Hand
Posts: 266
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Couple of questions:
- why are you escaping single quotes?
- what is the 'd' in (?id)?
- why do you add a flag like (?i) and also add Pattern.CASE_INSENSITIVE?
- why are you using Pattern.MULTILINE? (what do you think it does?)
- can you provide an SSCCE?

The last question is IMO the most important, but I'd like to see them all answered of course.
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Its true my regular expressions are not perfect.
Like you pointed out why bother placing (?i) when i have Pattern.CASE_INSENSITIVE already placed.
I am new to regular expressions and those in the code provided are trial and error as yet.
Even though they are not perfect i am using those so I can build a working program as yet and i will later refine them.

For the case of Pattern.MULTILINE i am still uncertain of what it provides. If i understood correctly if the string i am searching can be found on 2 lines by using Pattern.MULTILINE i am also taking that into consideration.

For the question why i am escaping single quotes - since HTML is not well formed i am assuming that the attribute values can either be found between " " (double-quotes) or ' ' (single quotes) or have no quotes what so ever.
Example href = "value" or href = 'value' or href = value

But my question doesnt refer to regular expressions. My question refers to how to replace a found regular expression with a new string. I know i must use replaceAll() method but i dont think the logic in the code sample provided is correct.

What i am trying to do is parse an HTML page
example:
<center>
<div align="center">

</div>
</center>

<table border="0">


I use the regular expression
(?id)<a\\s+(.*?)/?>
to find a tag
In this case i am looking for the tag
m4.group()
will find me the tag, which from the following example would be

I then parse this string to find the href attribute value using the second regular expression
HREF\\s*=\\s*(\"([^\"]+)\"|\'([^']+)\'|[^'\"])

m2.group(1)
would give me the attribute value.

This is the string i would like to change. Example i want to change it to "www.hello.com"
But when i do
String text1 = m2.replaceAll(new_res_url);
the html page which is contained in the variable "text" is not autmatically changed and i dont seen to know how to do that.

Any suggestions on this matter will be greatly appreciated.
 
Piet Verdriet
Ranch Hand
Posts: 266
posted 8 years ago
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the extra info, but can you post an SSCCE and explain what's going wrong? (an SSCCE is something I can simply copy and run on my machine, see the URL I linked to)
 
Piet Verdriet
Ranch Hand
Posts: 266
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
mj zammit wrote:Its true my regular expressions are not perfect.
Like you pointed out why bother placing (?i) when i have Pattern.CASE_INSENSITIVE already placed.


Okay, then leave one of them out.

mj zammit wrote:I am new to regular expressions and those in the code provided are trial and error as yet.


No problem.

mj zammit wrote:Even though they are not perfect i am using those so I can build a working program as yet and i will later refine them.


That is not a good way to learn, IMO. First get a better grasp on them: by just guessing and solving things by trial and error is not the way to go. All IMO of course!

mj zammit wrote:For the case of Pattern.MULTILINE i am still uncertain of what it provides. If i understood correctly if the string i am searching can be found on 2 lines by using Pattern.MULTILINE


No, that is not what it does. In fact, it does something that is nearly the opposite of what you think it does. For an explanation, see the paragraph "Using ^ and $ as Start of Line and End of Line Anchors" from: http://www.regular-expressions.info/anchors.html

mj zammit wrote:For the question why i am escaping single quotes - since HTML is not well formed i am assuming that the attribute values can either be found between " " (double-quotes) or ' ' (single quotes) or have no quotes what so ever.


No, I meant why do you add a backslash in front of a single quote? It is not necessary.

mj zammit wrote:But my question doesnt refer to regular expressions.


I beg to differ. The reason why "it doesn't work" is most probably because you are not using the regex-api properly.
 
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That is not a good way to learn, IMO. First get a better grasp on them: by just guessing and solving things by trial and error is not the way to go. All IMO of course!


Agreed. Regexes can get convoluted really quick, even in the hands of an experienced user. You can create a really big mess by using trial and error.

Henry
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am managing to find the tags and the attribute with the regex expressions i have (for the time being of course)
But what i am not understanding is why the found attribute value is not being replaced by an optimized string in the html page itself (ie: "text")
 
Piet Verdriet
Ranch Hand
Posts: 266
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
mj zammit wrote:I am managing to find the tags and the attribute with the regex expressions i have (for the time being of course)
But what i am not understanding is why the found attribute value is not being replaced by an optimized string in the html page itself (ie: "text")


mj zammit, you seem to be not reading my replies. That is too bad, because I cannot help you if you can't answer the questions I pose to you. I now asked you 2 times if you could post an SSCCE so that I can see exactly what you see. The problem description above tell the people wanting to help you absolutely nothing. They don't see what input text you're working with, they don't see what regex you're using, the don't know what IS and what ISN'T matched. They don't know what you want it to match instead. And so on...

Help us help you!!
 
mj zammit
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a variable "text"
This variable contains the html page of a Web page.
I want to parse this html page to find the href values of an tag.

My algorithm is as so:
1. get html of a web page
2. place the html content of the Web page in the variable "text"
3. parse the html content to find the
tag. This is done using the regular expression "(?id)<a\\s+(.*?)/?>"
4. For each
tag found parse through it again to find the attribute value of href. This is done using the regular expression "HREF\\s*=\\s*(\"([^\"]+)\"|\'([^']+)\'|[^'\"])"
5. get href's attribute value. This is done in the following line - res_url = m2.group(1);
6. change the variable res_url. Done using the method check_res_url()
7. the href's attribute value is then replaced with the variable new_res_url. This done using m2.replaceAll(new_res_url);

This replacement must be shown in the variable "text" (the html of the Web page)



When i then check the html page after this method was done no replacements where made where there should have been. Example instead of having
I wanted to see in the html page.

The regular expressions that i am using help me (for now, i still have to go over them) But the replaceAll() is not doing what i was expecting it to do. I am not sure if it has got something to do with the algorithm of my program.

 
Piet Verdriet
Ranch Hand
Posts: 266
posted 8 years ago
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
No, that still is no SSCCE.
Forget it, I can't help you.

Best of luck though.
 
Don't get me started about those stupid light bulbs.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!