• Post Reply Bookmark Topic Watch Topic
  • New Topic

Help returning values from string using regex.  RSS feed

 
Cliff Karlsson
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I just recently learned that I coud use something like:


To get a webpages content saved to a variable or a file. But what is the easiest way to retrieve some info from the string/file ? I have some basic knowledge of how regex  works but I mostly wonder how you use "capture-groups" if that is the correct name for it.

if a file contains for example:

"ddasdfno.,54543ddkaspd098304xz"

I know the regex is something like "\d+" for finding groups of digits. But how would I write some code to extract both groups of digits to new variables in java?
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
First, capture groups is not something that can be explained in a few paragraphs. Perhaps, it would be a good idea to do some research, and come back here, if you need clarification.

Regarding, using the Scanner class for capture groups, it is pretty straightforward. If the regex is to be applied on a line, you can use the findInLine() method, else you can use findWithinHorizon() method. Once matched, you can use the match() method to get the result object, which in turn, can be used to return all the groups that were captured in the last match.

Henry
 
Cliff Karlsson
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am probably writing the code really inefficiant but I have a new problem now. After saving a webpage to a file/string some characters like the " ' " is written as " ’ " in the file/string (exactly like it shows if I select view source on the webpage.)

But I am trying to replace those segments without sucess. I started thinking that there must be some way to encode the file correctly without using the replaceAll(). Am I correct?

 
Cliff Karlsson
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hmm, the forum replaced the code correctly directly in my post. This is what I tried to enter in the second quotation-marks: &_#_8217,
 
Carey Brown
Saloon Keeper
Posts: 3329
46
Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

On line 9, were you expecting the asterisk to be part of a regular expression, or the literal asterisk? The contains() method does not take a regular expression.

On line 10 you again use the asterisk. Seeing as how asterisk is a regex metacharacter you would need to escape it with a back slash.

On line 1, the regex is greedy and will suck in everything until the last "</p>". I think you'll need to look up the syntax for the non-greedy equivalent.
 
Cliff Karlsson
Greenhorn
Posts: 17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

This is what I tried but the forum replaced the actual text: \&\#8217; (minus the backslashes) to "`".
Ironicly that is exactly what I am trying to do in the code.
 
Stephan van Hulst
Saloon Keeper
Posts: 7993
143
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It's a bad idea to try and replace HTML entities with your own replace calls. You're probably better off using a third party library such as Apache Commons, which includes a StringEscapeUtils.unescapeHtml() method.
 
Knute Snortum
Sheriff
Posts: 4287
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

This regex may fail.  If you have more than one <p></p> pair in the line, it will match all the way to the last </p>.  As way said before, you want to non-greedy (lazy or reluctant) quantifier:

Pattern p = Pattern.compile("<p>(.+?)</p>");
 
Knute Snortum
Sheriff
Posts: 4287
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

As was said before, contains() takes a String as its argument, not a regex. 

if(list.get(i).contains(("& #8217;"))){ // without the space between the & and the #
 
Stephan van Hulst
Saloon Keeper
Posts: 7993
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Better safe than sorry, but I enjoy nitpicking anyway. If the HTML is valid, paragraphs shouldn't appears in paragraphs anyway :P
 
Knute Snortum
Sheriff
Posts: 4287
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

There are several things wrong with this code.  First, it is not a valid Java String.  There are no \& or \# escape codes in a String.  You might have meant this:

But there's no need to escape the & or the #; they are not metacharacters.   So we have this:

But this will fail at runtime because the regex is illformed.  You have a quantifier (*) at the beginning that is quantifying nothing.  You may have been trying to do this:

But this is wrong too.  You do not want to match anything before the ampersand and after the semicolon.  You probably just want this:

This will match the literal &#8217; and replace it with (´) everywhere it occurs in the String.
 
Knute Snortum
Sheriff
Posts: 4287
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:Better safe than sorry, but I enjoy nitpicking anyway. If the HTML is valid, paragraphs shouldn't appears in paragraphs anyway :P

Ah, that's good to know.  Thanks for the info.
 
Carey Brown
Saloon Keeper
Posts: 3329
46
Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Knute Snortum wrote:As was said before, contains() takes a String as its argument, not a regex. 
if(list.get(i).contains(("& #8217;"))){ // without the space between the & and the #

There are three possible HTML encoding of a single-right-quote:
"& #8217;"
"& #x2019;"
"& rsquo;"
again, no actual space after the '&'.
As Knute points out you don't want the back-slashes.
Ref http://www.fileformat.info/info/unicode/char/2019/index.htm
If you get the text after it has been decoded, then the text should be in uni-code, which for java would be "\u2019".
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!