Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Split string on a word not just a character

 
Theodore David Williams
Ranch Hand
Posts: 102
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is there a way to split a string on a word.

i.e.
 
John Vorwald
Ranch Hand
Posts: 139
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The "[]" indicate a regular expression, and means use any character inside the brackets as the delimiter.
You might try s.split("the").
 
Theodore David Williams
Ranch Hand
Posts: 102
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yeah that works thanks. I still have a problem in that I want to split on multiple words and characters. And I also want to ignore case
I.E. can I split on the words and characters below?
'the', 'The'
'to', 'To', 'TO'
','
'/'
 
Rob Spoor
Sheriff
Pie
Posts: 20667
65
Chrome Eclipse IDE Java Windows
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Put (i) before the regular expression. This is a flag that indicates the regular expression should ignore the case. To add multiple words use the symbol:
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Theodore David Williams wrote:Yeah that works thanks. I still have a problem in that I want to split on multiple words and characters. And I also want to ignore case
I.E. can I split on the words and characters below?'...

One possibility is not to try to do everything at once. Regexes are good, but they're not all-powerful, and trying to incorporate every possible rule into one is likely to make for a very long and complicated pattern (and will probably lead to more mistakes).
What about this:
1. Use String.split("\\s+") to split the string into whitespace-delimited "words".
2. Elimiinate "punctuation" with a String.replaceAll() pattern.
3. Use String.equalsIngnoreCase() to find the words you want to eliminate and pull out the words between them.

It will probably be slower, but we're likely talking fractions of seconds, and the resulting code will be a lot easier to change if you need to, and much more self-documenting.

Winston
 
Matthew Brown
Bartender
Posts: 4568
9
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Just to give a further example - the regex you've got so far will also split on the "word" "the" in "other" or "thesaurus". Yes, you can revise the expression further to cope with that, but Winston's advice is sensible.
 
John Vorwald
Ranch Hand
Posts: 139
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You could put whitespace in your regex in order to split on the words. \s means "any whitespace (tab, newline, space, new paragraph etc) character.
s = s.split("\sthe\s");

 
Rob Spoor
Sheriff
Pie
Posts: 20667
65
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
To also allow "the" at the start and end of the String, make that
 
Winston Gutkowski
Bartender
Pie
Posts: 10527
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rob Spoor wrote:To also allow "the" at the start and end of the String, make that

And if you want to allow for more than one whitespace character, you might need:
split("(\\s+|^)the(\\s+|$)")
and you may need to worry about whether you use greedy or reluctant qualifiers (to be honest, I don't know if it makes any difference).

@Theodore: And the above pattern is just for one word. Do you see what I mean now about complexity?

Winston

 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic