• Post Reply Bookmark Topic Watch Topic
  • New Topic

Regex  RSS feed

 
abalfazl hossein
Ranch Hand
Posts: 635
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator


012456


My question is:
The methods start()will give the indexes into the text where the found match starts .
d*, this means if a number found.index must be returned.

Why doesn't it print 3? 12345

Another question:
Why it prints 6?

0>a
.
.
.
5>f

What is this 6 in out put?

Thanks in advance
 
Knute Snortum
Sheriff
Posts: 4287
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is a hard concept to grasp with regexes. "\d*" means match zero or more digits. So when you find() with "ab34ef", you match at index 0 (the first character). Why? Because right at the first character is zero digits. Do you see them? No? That's because there is zero of them. Why start() = 6? Because at the very end of "ab34ef" is zero digits.

What you probably mean is "\d+", which matches one or more digits. Now start() returns 2.
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So, you're reading an SCJP preparation book :-)

That's a FAQ for readers of that book: http://www.coderanch.com/how-to/java/SCJP-FAQ#kb-regexp
 
abalfazl hossein
Ranch Hand
Posts: 635
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What does sixth index point to?

Why doesn't it print 3? 12345
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Did you read the link I posted?
 
Knute Snortum
Sheriff
Posts: 4287
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There are zero digits at the end of the string. That's index 6.

It doesn't print 3 because it matches "34" at index 2.
 
abalfazl hossein
Ranch Hand
Posts: 635
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
ab34ef

0>a
1>b
2>3
3>4
4>e
5>f

sixth?

index starts from zero, and we have six characters.not 7
 
Knute Snortum
Sheriff
Posts: 4287
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I know it's weird, but just after the "f" is a null that matches zero digits.
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
abalfazl hossein wrote:ab34ef

0>a
1>b
2>3
3>4
4>e
5>f

sixth?

index starts from zero, and we have six characters.not 7



Strings have start *and* end indexes. Just because the start index is zero, it doesn't mean that the "a" is a match. Think about it. Is an "a" a digit? Meaning should it be a match?

And BTW, why didn't you read the link provide by Ulf? It explains everything.

Henry
 
Knute Snortum
Sheriff
Posts: 4287
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, actually, "a" *does* match \d*! There are exactly zero digits there. That's why \d* is usually not what you want. Try \d+.
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Knute Snortum wrote:Well, actually, "a" *does* match \d*! There are exactly zero digits there. That's why \d* is usually not what you want. Try \d+.


I think you misunderstood my response. I said that "a" is not a match -- meaning it is not one of the matches that is returned when the regular expression is applied on "ab34ef". I wasn't changing the example of this topic.

Henry
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
abalfazl hossein wrote:index starts from zero, and we have six characters.not 7

To put it another way, you have not read the article I pointed you to twice, why not?. You've been visiting this site for years, and have posted hundreds of times, but still you expect other people to do your work for you, when you could learn much faster by simply reading up on stuff - which people even take the time to point you to. I simply don't get that attitude.
 
Campbell Ritchie
Marshal
Posts: 56586
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Knute Snortum wrote:I know it's weird, but just after the "f" is a null that matches zero digits.
No, there isn't a null character. That happens in C, but not in Java. C strings work completely differently. It even says so in the Java Language Specification.

You match 0 characters before the a, 0 characters before the b, etc and then 0 characters after the f. Does that get us to index 6?
 
Tess Jacobs
Ranch Hand
Posts: 71
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
http://www.coderanch.com/how-to/java/SCJP-FAQ#kb-regexp

BTW, the misspelling occurrances should be changed to occurrences
 
Knute Snortum
Sheriff
Posts: 4287
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell Ritchie wrote:
Knute Snortum wrote:I know it's weird, but just after the "f" is a null that matches zero digits.
No, there isn't a null character. That happens in C, but not in Java. C strings work completely differently. It even says so in the Java Language Specification.

You match 0 characters before the a, 0 characters before the b, etc and then 0 characters after the f. Does that get us to index 6?


You're right of course. I should have been more accurate in my description.
 
Tony Docherty
Bartender
Posts: 3271
82
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tess Jacobs wrote:
http://www.coderanch.com/how-to/java/SCJP-FAQ#kb-regexp

BTW, the misspelling occurrances should be changed to occurrences

Well spotted, have a cow.
I've now corrected this and whilst I was at it I also noticed earlier in the sentence 'preceeds' which I've corrected to precedes.
 
Tess Jacobs
Ranch Hand
Posts: 71
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks.
 
abalfazl hossein
Ranch Hand
Posts: 635
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ulf Dittmer wrote:
abalfazl hossein wrote:index starts from zero, and we have six characters.not 7

To put it another way, you have not read the article I pointed you to twice, why not?. You've been visiting this site for years, and have posted hundreds of times, but still you expect other people to do your work for you, when you could learn much faster by simply reading up on stuff - which people even take the time to point you to. I simply don't get that attitude.


I read that article, I add some lines.



run:

indexOf(a) = 0

indexOf(b) = 1
indexOf(f) = 5
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 6
at java.lang.String.charAt(String.java:686)
at regextest.Regextest.main(Regextest.java:26)
Java Result: 1



String index out of range: 6

Still has doubt....if it is out of range, Why does it print....
 
Joanne Neal
Rancher
Posts: 3742
16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
abalfazl hossein wrote:String index out of range: 6

Still has doubt....if it is out of range, Why does it print....

That's part of the error message. It's saying you tried to access index 6 of a String where the String doesn't have an index 6.
 
abalfazl hossein
Ranch Hand
Posts: 635
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
public class Regextest {

public static void main(String a[]){
String stream = "ab34ef";
Pattern pattern = Pattern.compile("\\d*");

//HERE * IS GREEDY QUANTIFIER THAT LOOKS FOR ZERO TO MANY COMBINATION THAT
//START WITH NUMBER
Matcher matcher = pattern.matcher(stream);

while(matcher.find()){
System.out.print(matcher.start());
}
}


But it prints:

012456
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
abalfazl hossein wrote:
String index out of range: 6

Still has doubt....if it is out of range, Why does it print....


It is a zero-length match at position six. The charAt() method tries to get the character at position six -- which assumes that the match is at least one character or more.

abalfazl hossein wrote:
I read that article, I add some lines.


It may be a good idea to read the article again. You really need to understand the concept, as it is a very simple regex -- and regular expressions can (and does) get much more complex.

Henry
 
Knute Snortum
Sheriff
Posts: 4287
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Let's clarify terms: start() returns the starting index. Regex always starts its matches just *before* the starting index.


^a^b^3^4^e^f^
| | | | | | |
0 1 2 3 4 5 6


After it's tried all the other characters it tries to match just *after* the last. This is the starting index 6, which is not a 7th character, but a place just after the 6th character.

charAt(6) is looking for the character at index 6, which is the 7th character. There is no such beast, so it throws an exception.
 
Tess Jacobs
Ranch Hand
Posts: 71
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:It may be a good idea to read the article again.

I think the source of confusion and main reason why this question has been asked so many times since 2005 is that in java, nothing exists after the last char in a string (i.e. charAt(6) doesn't exist); however, in regex (which java implements via a standard library), a zero-length character does exist after the last char in a string. Maybe the SCJP FAQ article doesn't make this fact very clear.
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Knute Snortum wrote:Let's clarify terms: start() returns the starting index. Regex always starts its matches just *before* the starting index.


^a^b^3^4^e^f^
| | | | | | |
0 1 2 3 4 5 6


After it's tried all the other characters it tries to match just *after* the last. This is the starting index 6, which is not a 7th character, but a place just after the 6th character.

charAt(6) is looking for the character at index 6, which is the 7th character. There is no such beast, so it throws an exception.



While I agree with most of this response, regarding indexes, I am not sure what is meant by "just *before* the starting index" and "just *after* the last". This implies that the rules of indexes for the regex matches, and the rules for indexes for strings, are different. The rules are exactly the same.

The last match has a starting index of 6 and an ending index of 6, which means that if the OP did this...



it would have been fine. And the result printed would have been a zero-length string.

Tess Jacobs wrote:
Henry Wong wrote:It may be a good idea to read the article again.

I think the source of confusion and main reason why this question has been asked so many times since 2005 is that in java, nothing exists after the last char in a string (i.e. charAt(6) doesn't exist); however, in regex (which java implements via a standard library), a zero-length character does exist after the last char in a string. Maybe the SCJP FAQ article doesn't make this fact very clear.


There is no such a thing as a zero-length character. It is a zero length string. And from my example, it is absolutely possible to extract a zero length sub-string from the end of a string.

In other words, this is a misunderstanding in regards to strings in general, and not specific to regular expressions.

Henry

PS... In rereading this response, it seems a bit blunt. Apologies. It was not meant to be so... Have some cows...
 
Knute Snortum
Sheriff
Posts: 4287
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the cow.

I learned regexes before I learned Java. My understanding of how to visualise how regexes match things was to think of the regex engine matching before a character or after a character. I don't know if this applies to Java or if it's even technically correct. However, it's how the book Mastering Regular Expressions refers to it, and even what is implied by this image in the JavaRanch FAQ.

All I'm saying is that it's a good way to visualise what's happening, technically correct or not.
 
Tess Jacobs
Ranch Hand
Posts: 71
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:There is no such a thing as a zero-length character.

My mistake

I learnt from this webpage that a zero-length match can occur at several positions in an input string:
  • at the beginning of an input string
  • in between any two characters of an input string
  • after the last character of an input string

  • I thought that these positions were peculiar to regex. It's interesting to learn that java recognizes these positions too.

    And thanks for the cow.
     
    Henry Wong
    author
    Sheriff
    Posts: 23295
    125
    C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
    • Likes 1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Knute Snortum wrote:
    I learned regexes before I learned Java. My understanding of how to visualise how regexes match things was to think of the regex engine matching before a character or after a character. I don't know if this applies to Java or if it's even technically correct. However, it's how the book Mastering Regular Expressions refers to it, and even what is implied by this image in the JavaRanch FAQ.

    All I'm saying is that it's a good way to visualise what's happening, technically correct or not.


    Don't get me wrong, I think that anything that can help the OP visualize what is happening with regular expressions is a good thing. And I don't think that I said that it was technically correct or not (or I didn't mean to say that). I was trying to say was indexes with regex was the same as indexes with java strings.

    In other words, I was merely pointing out that the OP was using core Java concepts as a counter example -- an example that was not only flawed, but based on misunderstanding of the core concept itself. And it was probably important to correct that misunderstanding first.

    Henry
     
    abalfazl hossein
    Ranch Hand
    Posts: 635
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    I think the source of confusion and main reason why this question has been asked so many times since 2005 is that in java, nothing exists after the last char in a string (i.e. charAt(6) doesn't exist); however, in regex (which java implements via a standard library), a zero-length character does exist after the last char in a string. Maybe the SCJP FAQ article doesn't make this fact very clear.


    Yes, That article doesn't make this fact very clear.

    Now it's clear for me, Thanks all friends. Thank you
     
    Campbell Ritchie
    Marshal
    Posts: 56586
    172
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    This problem is related the zero‑length String being a substring of every String, including itself, and can occur at any position except maybe inside a character. It is a bit like the empty Set ∅ being a subset of every set including itself.
     
    Tess Jacobs
    Ranch Hand
    Posts: 71
    3
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    abalfazl hossein wrote:Now it's clear for me...


    From what I understand (thanks to Henry’s explanation), every index in a string contains a zero-length substring and may contain a character.

    The string "ab34ef" contains a zero-length substring at index 6 which can be accessed via substring(6), however, it does not contain a character at index 6 and so charAt(6) will throw java.lang.StringIndexOutOfBoundsException

    Similarly, the string "" contains a zero-length substring at index 0 which can be accessed via substring(0), however, it does not contain a character at index 0 and so charAt(0) will throw java.lang.StringIndexOutOfBoundsException
     
    Ulf Dittmer
    Rancher
    Posts: 42972
    73
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    abalfazl hossein wrote:
    Maybe the SCJP FAQ article doesn't make this fact very clear.

    Yes, That article doesn't make this fact very clear.

    That being the case, and since now things are clear, I think you would be well-placed to improve the FAQ entry. The FAQ is public, after all -you can log in with your Saloon account- and we encourage people to contribute to it. Don't be afraid to mess up something - if you accidentally delete something, it's easy to restore previous versions.
     
    Tess Jacobs
    Ranch Hand
    Posts: 71
    3
    • Likes 1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    I attempted to modify the FAQ but got this error message:

    The changes to this page were rejected because a banned word or phrase was used.

    BEFORE MODIFICATION ATTEMPT


    AFTER MODIFICATION ATTEMPT


    PAGE FORMATTING CODE
    The '6' comes from the last index in the String '''ab34ef''' i.e. index 6. This index contains a ''zero-length substring'' which can be accessed as follows

    J[
    "ab34ef".substring(6);
    ]

    Note that this index does not contain a ''character'' and so

    J[
    "ab34ef".charAt(6);
    ]

    will throw

    J[
    java.lang.StringIndexOutOfBoundsException
    ]

    FURTHER PROPOSED CHANGE
    Because a match of zero length is possible, the find() method will check the index following the last character of input.

    Because a match of zero length is possible, the find() method will find the zero-length substring at index 6.


     
    Ulf Dittmer
    Rancher
    Posts: 42972
    73
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Thanks Tess, have a cow for your contribution!

    I have just edited that into the page, and it worked fine; not sure why you got that error earlier, it's not applicable to the content anyway.
     
    Tess Jacobs
    Ranch Hand
    Posts: 71
    3
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Many thanks.
     
    • Post Reply Bookmark Topic Watch Topic
    • New Topic
    Boost this thread!