• Post Reply Bookmark Topic Watch Topic
  • New Topic

Dividing a string into substrings  RSS feed

 
Ranch Hand
Posts: 206
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would like to take a large string of characters and divide it into substrings. The maximum length of a substring is 180 characters. Also, the substrings should not be cut off in the middle of a word. The substring should end with the last character of a word or with a period. It is ok if the substring begins in the middle of a sentence. However, I do not want the substrings to begin in the middle of a word. How can I do this?
 
Sheriff
Posts: 22845
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Start at position 179 of the string (or at the end if it isn't that long) and work backwards until you hit the end of a word or a period. Chop that segment off and throw it into your pot, then repeat until there's nothing left. Note that the end of a word can be followed by a space, or by a non-period punctuation mark like a comma, or by the end of the string. Be careful not to go past the end of the string in the latter case.

Presumably you want to discard the space after the end of a segment, or perhaps you want to keep it at the beginning of the next segment? You said you wanted to keep a period at the end of a segment but you were silent about how to handle spaces.

Also if you have bad data it's possible that your string might start with a single word longer than 180 characters. Perhaps you just want to chop it off at 180 characters? Or throw an exception? Anyway you should be careful to deal with this, or you'll go past the beginning of the string.
 
Master Rancher
Posts: 2045
75
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That should work, but my first thought was: can't you use some regex for this, maybe checking the
results for this length condition afterwards?

Greetz,
Piet
 
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Piet Souris wrote:That should work, but my first thought was: can't you use some regex for this, maybe checking the
results for this length condition afterwards?


Using regular expressions it is relatively easy to split at a word boundary after a minimum number characters but significantly more difficult to split just before a maximum number of characters. I spent a whole day looking at the problem in 2009 ( reported in 'another place' ) and failed to find a satisfactory regular expression solution. I love regular expressions but for this problem the best solution seems to be that proposed by Paul Clapham.

If anyone can provide a regular expression solution that does not involve several loops then please post it.
 
lowercase baba
Bartender
Posts: 12565
49
Chrome Java Linux
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can you use a regex? maybe.

SHOULD you? probably not.

You should never approach a problem with "How do I use A to to B?" Instead, you should spend a lot of time thinking through what needs to be done (you failed to mention other punctuation, what to do if your string has a monetary value like "$4.87", etc).

Only once you know all the details on what needs to be done should you start asking "what is the RIGHT tool to do this?"
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
fred rosenberger wrote:Can you use a regex? maybe.

SHOULD you? probably not.

I think you need to justify this. I can't find a way to use regex for this task but since the problem is one of text manipulation and regular expressions deal with text manipulation I don't rule it out as a matter or course and I don't think you should.


You should never approach a problem with "How do I use A to to B?" Instead, you should spend a lot of time thinking through what needs to be done (you failed to mention other punctuation, what to do if your string has a monetary value like "$4.87", etc).

Only once you know all the details on what needs to be done should you start asking "what is the RIGHT tool to do this?"


To decide "what is the RIGHT tool to do this?" one must ask the question "can it be done using so and so tool ?" and only when one has answered this for each tool in ones toolbox can one make a decision which is the tool to use. The possibility of a regex solution was raised as a question by Piet Souris ( not the OP ) presumably because regular expressions are in his toolbox and, even though he does not know how to do it, his experience makes the use of regular expressions a possibility so an investigation is worth pursuing. As I have already said I can't find a reasonable way to solve the OP's problem using regular expressions but I'm not blind to the possibility that they may be a solution.




 
fred rosenberger
lowercase baba
Bartender
Posts: 12565
49
Chrome Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Richard Tookey wrote:
fred rosenberger wrote:Can you use a regex? maybe.

SHOULD you? probably not.

I think you need to justify this. I can't find a way to use regex for this task but since the problem is one of text manipulation and regular expressions deal with text manipulation I don't rule it out as a matter or course and I don't think you should.


You should never approach a problem with "How do I use A to to B?" Instead, you should spend a lot of time thinking through what needs to be done (you failed to mention other punctuation, what to do if your string has a monetary value like "$4.87", etc).

Only once you know all the details on what needs to be done should you start asking "what is the RIGHT tool to do this?"


To decide "what is the RIGHT tool to do this?" one must ask the question "can it be done using so and so tool ?" and only when one has answered this for each tool in ones toolbox can one make a decision which is the tool to use. The possibility of a regex solution was raised as a question by Piet Souris ( not the OP ) presumably because regular expressions are in his toolbox and, even though he does not know how to do it, his experience makes the use of regular expressions a possibility so an investigation is worth pursuing. As I have already said I can't find a reasonable way to solve the OP's problem using regular expressions but I'm not blind to the possibility that they may be a solution.

But the problem isn't defined well enough yet to start looking at what tools are right for the job.

Could I drive in a wood screw with a hammer? sure. Should I? probably not. Unless the problem is to pound a wood screw into a board as quickly as possible and not worry about tearing up the wood.

Can I build a house out of toothpicks and silly putty? Sure. Should I? probably not. Unless the problem is to build something for a Ripley's believe it or not type project.

So without knowing the specifics of what is really needed, I would say the answer is almost always "probably not".

 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
fred rosenberger wrote:
Richard Tookey wrote:
fred rosenberger wrote:Can you use a regex? maybe.

SHOULD you? probably not.

I think you need to justify this. I can't find a way to use regex for this task but since the problem is one of text manipulation and regular expressions deal with text manipulation I don't rule it out as a matter or course and I don't think you should.


You should never approach a problem with "How do I use A to to B?" Instead, you should spend a lot of time thinking through what needs to be done (you failed to mention other punctuation, what to do if your string has a monetary value like "$4.87", etc).

Only once you know all the details on what needs to be done should you start asking "what is the RIGHT tool to do this?"


To decide "what is the RIGHT tool to do this?" one must ask the question "can it be done using so and so tool ?" and only when one has answered this for each tool in ones toolbox can one make a decision which is the tool to use. The possibility of a regex solution was raised as a question by Piet Souris ( not the OP ) presumably because regular expressions are in his toolbox and, even though he does not know how to do it, his experience makes the use of regular expressions a possibility so an investigation is worth pursuing. As I have already said I can't find a reasonable way to solve the OP's problem using regular expressions but I'm not blind to the possibility that they may be a solution.

But the problem isn't defined well enough yet to start looking at what tools are right for the job.

So you cannot, as you have done, rule out regular expressions until the problem has been better defined. Using your logic you should also have ruled out Paul's suggested solution yet you pointedly didn't .


Could I drive in a wood screw with a hammer? sure. Should I? probably not. Unless the problem is to pound a wood screw into a board as quickly as possible and not worry about tearing up the wood.

How can you know that regular expressions are the hammer to drive in the wood screw especially since regular expressions are designed for text manipulation and anyway you look at it the OP's problem will be one of text manipulation.

Can I build a house out of toothpicks and silly putty? Sure. Should I? probably not. Unless the problem is to build something for a Ripley's believe it or not type project.

Nobody has said we should build a house out of toothpicks and silly putty. What was asked by Piet is can regular expressions be used to solve a text manipulation problem posed by the OP.

So without knowing the specifics of what is really needed, I would say the answer is almost always "probably not".

That is far too simplistic, a cop out and smells of dogma and not logic. I know from work I did in 2009 that I cannot find a regex solution to the OP's problem as currently defined but that does not mean it cannot be done; it only means I do not have the expertise to do it or the expertise to say whether or not it is feasible. I cannot rule out regular expressions for the task and unless you an expert on regular expressions I don't see how you can. The problem as currently posed by the OP may need to be refined by him but unless it is dramatically changed it will still be a text manipulation problem and as such may be amenable to a solution using regular expressions.
 
Paul Clapham
Sheriff
Posts: 22845
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Regular expressions sound like a possibility to me. I wouldn't go that way because I'm not that good at them. That doesn't mean they are the wrong way to go, though, only the wrong way for ME to go.

However if Richard can't find a suitable regex then that says that it's the wrong way for almost everybody to go.
 
Piet Souris
Master Rancher
Posts: 2045
75
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Indeed, if Richard can't find a suitable regex, then such a regex does not exist in all likelyhood.
My initial thought was simply to split the string as said, and then building substrings by adding these parts,
taking any length constraint into account. But it may be difficult to reassemble the split characters, and it
may not be possible at all. Pauls method, as said, is most likely applicable.

So, Fred (Victa!), could you provide us with some more details?

Greetz,
Piet
 
fred rosenberger
lowercase baba
Bartender
Posts: 12565
49
Chrome Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am not a regex expert by any means. As I stated, it probably COULD be done with them - I don't know. But from what I do know of them, this seems extremely hard. We are not simply matching patterns, we are building a ton of logic into the regex. "break the string at a word break, but get as close to 180 without going over as you can. Oh, and include the period so long as that doesn't make it go over 180 too. And let's trim off the space on the next substring...or not..."

I think a regex perhaps could be used as PART of the solution, but I think much of the logic should be broken out into java code. Remember, you can spend a week figuring out the perfect regex, but then a month later you (or someone else) will have to support and maintain it. And heaven forbid that your requirements then change...

I think the KISS principle applies. One regex that does all the OP requests seems insanely complicated to me.
 
Greenhorn
Posts: 4
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
"you can spend a week figuring out the perfect regex, but then a month later you " - been there done that Freed! BreakIterator deals with text boundaries, here is a solution to your use case:



Output:
Start at position 179 of the string (or at the end if it isn't that long) and work backwards until you hit the end of a word or a period. Chop that segment off and throw it into
your pot, then repeat until there's nothing left. Note that the end of a word can be followed by a space, or by a non-period punctuation mark like a comma, or by the end of the
string. Be careful not to go past the end of the string in the latter case.
 
Paul Clapham
Sheriff
Posts: 22845
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
BreakIterator! Yeah! One of those classes which nobody ever remembers -- or never even heard of -- but when you need it, you really need it.

Here's a link to the Oracle tutorial about it: http://docs.oracle.com/javase/tutorial/i18n/text/boundaryintro.html
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul Clapham wrote:BreakIterator! Yeah! One of those classes which nobody ever remembers


I'm guilty of not remembering. Many years ago I spend time going through the first page of the Javadoc for every class the Java 5 but it was probably a waste of time since I have not remembered more than 10% of them.
 
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Justin Musgrove wrote:"you can spend a week figuring out the perfect regex, but then a month later you " - been there done that Freed! BreakIterator deals with text boundaries, here is a solution to your use case:

Very nice, but please DontWriteLongLines; they make threads very hard to read.
I've broken yours up this time (the offending part was that enormous String literal). See how much better it looks now?

Winston
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!