• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Regex Question ...

 
jay vas
Ranch Hand
Posts: 407
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Im new to java regexes. I need to build a regex that detects
all words in a paragraph that look like a string of amino acids.

So for example :

Ala-Cys-Ala, A-C-A, and ACA all represent possible amino acid sequences of alanine, cystein and alanine. Is there a way to build a regex in java that represents this ? Currently Im doing it with nested for loops. Ive tried
[A|Ala|V|Val|L|Lys|M|Met|W|Trp|P|S|T|Thr|C|Y|Tyr|N|Asn|-|Q|D|E|K|R|H|X]++ but it returns false positive matches... for example GAVs is returned as group(0) using the java matcher, even though the 's' character is not in the expression..?

Ala A
Arg R
Asn N
Asp D
Cys C
His H
Ile I
Leu L
Lys K
Met M
Phe F
Pro P
Ser S
Thr T
Trp W
Tyr Y
Val V
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 34686
367
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jay,
group(0) returns the whole matching string, not just the matching portion. Try putting your reg exp in parens and using group(1).
 
jay vas
Ranch Hand
Posts: 407
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well ive gotten closer, but for some reason
EEEs matches ... any ideas?

 
Henry Wong
author
Marshal
Pie
Posts: 21225
81
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator


This pattern makes no sense... what is it that you are trying to do?

Henry
 
jay vas
Ranch Hand
Posts: 407
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
"([G|A|V|L|[Lys]{3}|M|F|W|P|S|T|[Thr]{3}|C|Y|[Trp]{3}|N|-|Q|D|E|K|R|H|X]){3,9}?";


The pattern means

Match a strings which is
1) of length 3 through 9
where
2) all subStrings in the string are a combination of
G,A,V,L,Lys, M,F,W,P,S,T,Thr, C,Y,Trp, N, -, Q, D, E, K, R, H, or X.


so

G-A-V-L-X-L matches
Lys-L-V-G-A-Trp-X-Trp matches
but


Lys-O-Lys-X wouldnt match (since O is not a valid amino acid).
Also
A-L-s-L-s-X wouldnt match either (s isnt an amino acid, although S is).
 
Henry Wong
author
Marshal
Pie
Posts: 21225
81
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

"([G|A|V|L|[Lys]{3}|M|F|W|P|S|T|[Thr]{3}|C|Y|[Trp]{3}|N|-|Q|D|E|K|R|H|X]){3,9}?";

The pattern means

Match a strings which is
1) of length 3 through 9
where
2) all subStrings in the string are a combination of
G,A,V,L,Lys, M,F,W,P,S,T,Thr, C,Y,Trp, N, -, Q, D, E, K, R, H, or X.


Sorry, but the pattern that you have doesn't do what you described. In fact, I am not even sure if some of the stuff in the pattern is even valid.

Assuming that the "-" is an optional separator, and not part of the sequence, this is probably closer to what you want...



Henry
 
jay vas
Ranch Hand
Posts: 407
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks !!! I'lll try it and tell you the result. BTW, what does the ? after the - mean ?
 
Rob Spoor
Sheriff
Pie
Posts: 20559
57
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It means the - is optional.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic