• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

String find and replace

 
B M Tejaswi
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Am trying to write code which replaces the words in really large input string with the encoded word.The encoded pair is given. I would like you guys to give me some inputs as to how it can be done better(optimization). Below is the code


Input string is :
This component is standard compliant|3rdparty api fails===component=C|standard=S|3rdparty=3|compliant=I

output :
This C is S I|3 api fails

string after '===' is the encoding pairs from which the LHS of '===' is encoded.

Any suggestions?
 
Jesper de Jong
Java Cowboy
Saloon Keeper
Posts: 15627
46
Android IntelliJ IDE Java Scala Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can you isolate the problem? It's a lot of work to try to understand all the code you posted and things are missing, such as your Constants class, which would be especially interesting since you're passing values that are defined in there to String.split(...).

Can you give a concrete example of where String.split(...) doesn't do what you expected? What are the exact inputs and outputs?
 
B M Tejaswi
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the reply.
AM able to achieve the expected output given the input..the only problem is how it can be optimized/how it can be further enhanced so that the runtime might reduce given a very long input string.
Below is the code snippet for Constants class:
 
Jeff Verdegan
Bartender
Posts: 6109
6
Android IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tejaswi Bm wrote:Thanks for the reply.
AM able to achieve the expected output given the input..the only problem is how it can be optimized/how it can be further enhanced so that the runtime might reduce given a very long input string.


Have you measured the performance and found that it doesn't meet your already-defined, specific requirements? If not, then there's no point in trying to optimize it. You'd just be guessing at what might be wrong and trying to make it "better" without having an actual target.

Having said that, one thing does pop to mind. I'm not going to read your code, but if you're using String.split() or String.replaceAll(), you're compiling the regex every time. If you're reusing the same regex then you might get better performance by calling Pattern.compile() once, reusing the same Pattern each time, and then creating a Matcher for each input String and calling its methods as needed.

 
B M Tejaswi
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Jeff!!
But i dont think my code is optimized and can handle really huge input string. I have used a HashMap to store the tokens, but i feel the same can be achieved with 'something else'. I am not aware of that 'something else'. So if you could point me in right direction??
 
Paul Clapham
Sheriff
Posts: 21558
33
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My first thought would be to throw out your replace() method and simply use the replace() method which is already built into the String class.

Quite often people will write their own code rather than use built-in methods, disregarding the fact that the built-in methods were written by people with more experience and have been improved over the years. Often that's a mistake. At any rate you now ought to compare your code to the built-in code and see which is more "optimized" (by which it looks like you mean runs faster, rather than uses less memory or something else).
 
Jeff Verdegan
Bartender
Posts: 6109
6
Android IntelliJ IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tejaswi Bm wrote:Thanks Jeff!!
But i dont think my code is optimized and can handle really huge input string. I have used a HashMap to store the tokens, but i feel the same can be achieved with 'something else'. I am not aware of that 'something else'.


Why? In what way does HashMap not meet your needs, or in what way does it seem inappropriate to you? There may be something better, but I can't recommend anything without a clear picture of how HashMap is falling short.

If you're associating a key with a value, Map is the way to go. If you don't care about the iteration order of the keys, values, or pairings, then HashMap is the default workhorse Map implementation to choose.
 
fred rosenberger
lowercase baba
Bartender
Posts: 12264
36
Chrome Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tejaswi Bm wrote:I don't think my code is optimized and can handle really huge input string.

without actual evidence - actual measurements of how long things take - what you "think" is irrelevant. this kind of this has been discussed ad nauseam in the performance forum (where this thread really belongs).

Never try and do optimization without measuring your performance. You will ALWAYS be served better writing clean, readable, understandable code first. OK...second. FIRST you should define what your performance requirements are. Then write your clean, readable, understandable code. Then test it to see if it meets your requirements.

If it does - you're done.

If not, use a profiler to find out exactly where it is slow, and spend your time optimizing that.
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13077
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Personally I would read the input String as a stream, parsing out words and writing to an output stream - for each word, do a hash table lookup to see if it has a replacement - if so substitute in the output stream. Completely insensitive to problems of string replace in a really large string in memory.

Things to watch out for - case differences, word separators.

Bill
 
Winston Gutkowski
Bartender
Pie
Posts: 10571
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tejaswi Bm wrote:But i dont think my code is optimized and can handle really huge input string. I have used a HashMap to store the tokens, but i feel the same can be achieved with 'something else'. I am not aware of that 'something else'. So if you could point me in right direction??

Well, like the others I suspect you're obsessing about something which you don't yet know is a problem.

However, if you're absolutely bound and determined to research this problem as an exercise. I'd say they are a few things you need to concentrate on:
1. Searching: I'm not sure about how regexes (ie, Matcher.find()) work, but I do know that indexOf() is sometimes not an optimal search because it's sequential. For more general searches you may want to look at something like the Boyer-Moore-Horspool algorithm, but in your case (searching for a delimiter), it's almost certainly overkill, and I would say indexOf() should be be just fine.

2. Storage of your "token" information: A HashMap sounds perfectly reasonable for your tokens and replacements, but I'd say you don't want to split() the String you're searching.

My advice, starting from the beginning of the String:
a. Find a delimiter.
b. Check if the substring before it matches any of your tokens to be replaced. If it does, copy the replacement token to the output; otherwise just copy the substring.
c. Copy the delimiter to the output.
And repeat that with the substring starting at the character after the delimiter until there are no more delimiters to find. Chances are that you'll still have a "last token" to process, in which case do check (b) on it.

In your case you may have a little bit more work to do, but the basic pattern is the same, and it will run in O(n) time.

Winston
 
B M Tejaswi
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
With the suggestions that i have got from the learned ones in this forum, i tried to alter the code and have come up with the following code :


I have performed some test and below are the reults-
Number of input words in string = 120000
Number of total encoding pairs = 4
Number of encoded words in output = 20000
Number of iterations = 1000
Total time in millis = 1009033
Avg time in millis = 1009

but am not happy with the results..I want an avg time of around 600 millis..please help
 
Winston Gutkowski
Bartender
Pie
Posts: 10571
64
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tejaswi Bm wrote:With the suggestions that i have got from the learned ones in this forum, i tried to alter the code
...
but am not happy with the results..I want an avg time of around 600 millis..please help

Well first, your coding style could do with some improvement: Specifically, don't use '_' in field names. It may be great for Python, but it looks awful to most Java programmers. Also, StringTokenizer is a legacy class, and its use is discouraged. You can do exactly the same thing with a Scanner; although either one will be quite a bit slower than the technique I suggested in my previous post.

It seems to me that you're overthinking this thing by a mile. Substitution is very straightforward:
1. Find a token.
2. Check if it's in your HashMap:
  • if it is, add its substitute to the output.
  • 3. If it isn't, check if it's numeric:
  • if it is, add its "encrypted" version to the output.
  • 4. Otherwise, add it to the output as is.

    It also appears that you're replacing all your delimiters with spaces. That's fine, but I didn't see any suggestion that that's what you wanted.

    My advice: Write down ALL the rules for your "replace" in English before you write another line of code.

    Also: have a look at String.substring(). In general, its the fastest way of returning a substring (or eliminating part of a String) because it doesn't involve any copying of the internal character array.

    I'd also suggest that a numericTokenMap is overkill, and probably a lot slower than simply substituting numeric characters from an array, eg:particularly if you're only interested in the characters '0' to '9' (which is NOT the same thing as a 'numeric' character).

    HIH

    Winston
     
    William Brogden
    Author and all-around good cowpoke
    Rancher
    Posts: 13077
    6
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    My advice: Write down ALL the rules for your "replace" in English before you write another line of code.
    ++ good advice

    Be sure to include rules for handling recognizing words with differences in case/CASE/Case ...

    Your method is slow because it involves creation of a number of objects and tries to start with the full string in memory - for speed go with stream processing as Winston and I suggested.

    Bill
     
    B M Tejaswi
    Greenhorn
    Posts: 11
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    But i cannot read the input as stream since i need to prepare the encoding pairs..so i have to read the entire input string and then split based on delimiter!correct me if am wrong..if stream can be tokenized then please do guide me with the code.
     
    William Brogden
    Author and all-around good cowpoke
    Rancher
    Posts: 13077
    6
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    In your OP, you said the "encoding pair" is given. Where does a "encoding pair" come from?

    The Java io package has the StringReader class which will let you read a String as a stream of characters.

    Have you followed Winston's advice to write down ALL the rules of your desired "replace" function?

    Bill

     
    B M Tejaswi
    Greenhorn
    Posts: 11
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Encoding pair is given as part of input string. Encoding pair is separated by "===" in the input string.
     
    Winston Gutkowski
    Bartender
    Pie
    Posts: 10571
    64
    Eclipse IDE Hibernate Ubuntu
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    B M Tejaswi wrote:Encoding pair is given as part of input string. Encoding pair is separated by "===" in the input string.

    Right, so there is some sort of format to your input that your program will have to deal with. This is all part of the "rules" that I mentioned above.

    Have you written them ALL down?

    Winston
     
    • Post Reply
    • Bookmark Topic Watch Topic
    • New Topic