• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Liutauras Vilda
  • Ron McLeod
Sheriffs:
  • Jeanne Boyarsky
  • Devaka Cooray
  • Paul Clapham
Saloon Keepers:
  • Scott Selikoff
  • Tim Holloway
  • Piet Souris
  • Mikalai Zaikin
  • Frits Walraven
Bartenders:
  • Stephan van Hulst
  • Carey Brown

Replace character in String with space if NOT exist, fastest , best way ?

 
Ranch Hand
Posts: 378
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have this character set ,  of Latin and Scandinavian characters:





If have some strings, that I have to validate, if they have characters that DO NOT  match, and if they DO NOT  match,  i must replace the character with a space.
Is this possible via regex, or is there a better way, here I think about speed , and nice  readable code ?

Kind reagrds
Frank
 
Marshal
Posts: 5753
352
IntelliJ IDE Python TypeScript Java Linux
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I don't think "validate" is the right term for this. Validation is asking the question "does this meet the criteria? Yes or No". What you are doing is a data transform with rules.

I'd be interested to see how you've gotten on so far with tackling the problem to know where to start with helping. Can you share your progress please?

I'm particularly interested in the data types and storage locations for the input data. Are they Java Strings? Are they files? How large are the input Strings?

You mention "best" and "fast" but don't give any criteria for those.
  • What does "best" mean to you?
  • How fast is fast enough?
  •  
    lowercase baba
    Posts: 13091
    67
    Chrome Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    The best solution for doing things "fastest" is to buy better hardware - CPUs, memory, etc.
     
    Frank Jacobsen
    Ranch Hand
    Posts: 378
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Its java strings, and the there is 8 strings length from 40 to 140 character , with name and address.

    And yes its correct its data transform with rules.

    I just tried with these few lines, it will probably solve it, but think there is a better and faster method ?

    Better = nicer code

    Fast   = I have to call this methoed 200.000 times every day, will this perform less then 100 Milliseconds ,I can do the performance measurement myself, but maybe there are some out there who know regex or other ways to do this more optimally.





     
    Tim Cooke
    Marshal
    Posts: 5753
    352
    IntelliJ IDE Python TypeScript Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    You method takes a String and returns a StringBuffer? I might return a String but maybe you have a good reason for returning a StringBuffer?

    I would write some tests to verify the behaviour with different input strings.

    Again, define your criteria for "better" and "faster"? How fast is fast enough?
     
    Bartender
    Posts: 10966
    87
    Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows ChatGPT
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Two things:

    Comparing >0 will fail for 'a'.

    And if you squeeze all the spaces out of your compare string it will run somewhat faster, thought not noticeably.
     
    Tim Cooke
    Marshal
    Posts: 5753
    352
    IntelliJ IDE Python TypeScript Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    I find your "allowedStrings" format to be a little ambiguous which would make it difficult for me to reason with it. Is whitespace a valid character? Or is the whitespace only used to separate characters? Are "Ø0" and "øA" single characters or two characters? And when I say character I mean a Java char or Character type? Two characters is a String.

    You might remove this ambiguity by representing your allowed characters as a Set of Character type Set<Character>.
     
    Rancher
    Posts: 326
    14
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    You can use regex on this one: Just define a group with all allowed characters and add a ! as negation to it. Then you can use String.repalceAll() which takes this regex and give a single space as replacement. I'm not sure how "fast" that will be - but mostly less than your 100ms as when it's called 200k times (which is rather trivial btw) it sure gets compiled down to more efficient machine code and is run this way natively - that's the "just in time compilation" part of Java: If the VM sees something executed really often then it gets compiled down to more lower code which can run directly on the host hardware instead of keep interpreting byte code over and over again.
    For a dataset of 200k entries I'm sure this will only take a couple of seconds. So 200k a day is like and what's your code doing the other 23h59m30s?
    Also: Don'T use StringBuffer - use StringBuilder instead - read docs to learn why.
     
    Frank Jacobsen
    Ranch Hand
    Posts: 378
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    I use a string buffer, because it is faster to append than to create a new string each time.

    Faster = The fastes way this can be done.

    Now comparing >= 0, and removed all the spaces in allowedstring.

    Its all single charaters.

    Stay tuned on this channel, i will try to make some test ,  on this code postet , if this works, and some performence test , with 100000, strings.



    And thanks out there  
     
    Carey Brown
    Bartender
    Posts: 10966
    87
    Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows ChatGPT
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    If you are going to set up a frame work to do performance testing you might as well test Mathew's regex approach as well.

     
    Saloon Keeper
    Posts: 28420
    210
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    There are 86400 seconds in a day, so if the only thing the system was doing was string conversion, you'd have plenty of spare capacity. Although in actuality, I'd bet that during some times of the day you'd see more requests than in others.

    In any event, on modern CPUs, I'd estimate that you are extremely pessimistic at 100msec per conversion. More likely, it would take microseconds, not milliseconds, even with relatively inefficient processes.

    Java puts some particular constraints on this. Since Java Strings are immutable, you have to build a new string for each conversion. In languages like C, you could convert in place as long as you didn't change the overall string length. Say by changing ö to oe or ñ to n~. Some hardware, such as IBM mainframes could potentially go one step further, as they have actual machine-language instructions that can directly convert an entire string in one command.

    As for techniques, I don't think regexes are likely to be the fastest. A regex is typically compiled into a series of instructions for a regex finite-state machine (FSM) and the FSM then interprets the instructions. This can be faster and more compact than discrete Java code in some cases, but even faster would be a table lookup like mainframes do. The catch here is that in Unicode, the potential code point set is much larger than a simple ASCII/EBCIC conversion table would be, although if you are working with a sparse table, you can break it up into sub-tables and use a relatively small set of "if" statements to locate and apply the correct sub-table.

    As always, though, don't trust anything just because it "looks" efficient. Benchmark it!
     
    Matthew Bendford
    Rancher
    Posts: 326
    14
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Frank Jacobsen wrote:I use a string buffer, because it is faster to append than to create a new string each time.


    As, for some reason, it seem to matter that much to you, here's a quote from the StringBuffer docs:

    As of release JDK 5, this class has been supplemented with an equivalent class designed for use by a single thread, StringBuilder. The StringBuilder class should generally be used in preference to this one, as it supports all of the same operations but it is faster, as it performs no synchronization.

     
    Tim Holloway
    Saloon Keeper
    Posts: 28420
    210
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Realistically, almost no one should be using StringBuffer anymore. Unless you've got some really weird app that builds Strings from several concurrent threads, and even then, timing considerations would have to apply.

    StringBuffer, like Vector is either the result of dire pessimism on behalf of Java's original design or possibly used internally for critical JVM functions.

    But for the rest of us: StringBuilder.
     
    fred rosenberger
    lowercase baba
    Posts: 13091
    67
    Chrome Java Linux
    • Likes 1
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Frank Jacobsen wrote:
    Faster = The fastest way this can be done.


    If you really want "The fastest way this can be done", buy a better cpu. Then wait a few months, and there will be even BETTER hardware, so you can then that to get more speed...etc.  Make sure you kill all other processes on whatever machine you are running so nothing else steals CPU time. If you are bound by disk reads, get a better disk drive.  and so on...

    What we're trying to get at is that this is not a spec.  Something can always be done to make it faster. The question becomes "Is it worth it to make it faster than it is now?"  and/or "how fast is fast enough?".  If it can process each call in 10 microseconds, is it worth it to spend $1,000,000 to get it down to 9 microseconds?  That would be faster.  Then do you spend another $2million to get it down to 8?

    Best practices say you define what the speed needs to be before you start optimizing. Then you look at where your bottlenecks are, what it would take to improve each, and at what cost.  Then you decide which ones are worth the cost.  

    "The fastest way it can be done" is an unobtainable goal, as there is (for all intents and purposes) no end point.
     
    Tim Holloway
    Saloon Keeper
    Posts: 28420
    210
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    You could probably jack up the speed even further if you offloaded onto a graphics co-processor or other parallel processing system. That can gain you considerably more even than simple raw CPU power. And I'd note that actual CPU speeds seem to have pretty much topped out in recent years in favor of more and more GigaHertz cores.

    When I was young in the IT field I was in the OS support group for a large IBM mainframe. The only suitable programming language was Assembler, since Fortran, COBOL and PL/1 had certain overheads that OS internal code couldn't tolerate. I obsessed about both CPU efficiency and memory efficiency, as even the top-line floor-spanning water-cooled models back then were less powerful than an Apple Watch. I can emulate one at faster than original speed on a $35 Raspberry Pi (if they ever get over the shortage!)

    The IBM System/370 line was limited to 16MB address space. About the smallest JVM RAM footprint I've seen was around 128MB.

    All of which is to say that you're going to blow the doors off old-time computing with even your worst efforts on any modern machine - even an Apple Watch.

    In fact, the job that made me walk away from large enterprise business was so obsessed with people efficiency, that they'd actually demand crap speed, crap reliability, and (unspoken) crap security as long as I could just "git 'er Dun!".

    So give it a decent try, and unless someone has specific complaints about production performance, leave it at that. You'll have a more prosperous career.
     
    Frank Jacobsen
    Ranch Hand
    Posts: 378
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Maybe there is a faster way, but thought it would take much longer. So I'm going with this solution.

    Tested it 100000 times in a loop with this string:

    String testString = "Frank is# a star *";

    42 milliseconds.



    Thanks for all your input out there.

    Kind regards
    Frank  
     
    Marshal
    Posts: 80128
    417
    • Likes 1
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    I think that method is unreliable because it returns a mutable and thread‑unsafe reference type, viz. a StringBuilder. You don't know what users will do with that StringBuilder object. I suggest you call toString() on it and return a String reference.
    You have an inefficiency in that you are using the + operator on a String; it would be better to declare the String and compose it with +s in the same expression; what's more that will make the String object into a compile‑time constant and all the String composition will be done by the javac tool. It will also make line 4 short enough to read.
    Don't use milliseconds; use System.nanoTime() instead.
    Run your exercise twice and only time the second run; that will allow the runtime to make any optimisations before you start timing. For example, line 4 might be optimised to run only once.
     
    Campbell Ritchie
    Marshal
    Posts: 80128
    417
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    The above should be changed so lines 3‑12 create a field and for reasons of performance, they should not be in the method. Example. Note that only ASCII and Latin extended characters are accessible to the current implementation. I think array access will be faster (constant time complexity) than searching a String (=linear time complexity).
     
    Tim Holloway
    Saloon Keeper
    Posts: 28420
    210
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Campbell Ritchie wrote:I think that method is unreliable because it returns a mutable and thread‑unsafe reference type, viz. a StringBuilder. You don't know what users will do with that StringBuilder object. I suggest you call toString() on it and return a String reference.
    You have an inefficiency in that you are using the + operator on a String; it would be better to declare the String and compose it with +s in the same expression; what's more that will make the String object into a compile‑time constant and all the String composition will be done by the javac tool. It will also make line 4 short enough to read.
    Don't use milliseconds; use System.nanoTime() instead.
    Run your exercise twice and only time the second run; that will allow the runtime to make any optimisations before you start timing. For example, line 4 might be optimised to run only once.



    I won't agree in principle, but the sample manufactures a unique StringBuilder each time it is invoked, so any mult-threading issues are only going to happen if the caller does something that's not thread-safe to it. Still, I concur. Return a String. No benefits to returning a StringBuilder unless you're going to do some sort of post-processing.

    The problem with "allowedStrings" isn't the use of the concatenation operator. The context makes it obvious that constant-folding optimisation can be done at compile time. However, we do have ways to split long strings in the source code for readability and I recommend them.

    Also, more importantly, since it's a constant string, make it static and final. That will help insure optimisation and avoid possible errors when the code is maintained.

    Finally, indexOf() means a linear search of allowedStrings. As I said previously, it's much faster, though more complicated to do table lookups (random version sequential access). Or, for this particular case, a series of conditional statements. E.g.,: if (c >= 'a' && c < 'z') ||  (c >= 'A' && c < 'Z') || ...

    For the allowed characters that aren't in a sequence, put them in a switch/case and the compiler will optimise them.

    (Sorry about delayed response. Tropical Storm Nicole took out my Internet).
     
    Campbell Ritchie
    Marshal
    Posts: 80128
    417
    • Likes 1
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Tim Holloway wrote:. . . The problem with "allowedStrings" isn't the use of the concatenation operator. . . . constant-folding optimisation can be done at compile time.

    It is only a tiny delay but since OP didn't make anything final, (as you pointed out) the + operator in line 5 has to run at runtime rather than compile time.

    However, we do have ways to split long strings . . .

    As I showed; my concatenated Strings are compile‑time constants.
     
    Campbell Ritchie
    Marshal
    Posts: 80128
    417
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Yesterday, I wrote:

    I missed a bit out; that should read
     
    Tim Holloway
    Saloon Keeper
    Posts: 28420
    210
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Campbell Ritchie wrote:... since OP didn't make anything final, (as you pointed out) the + operator in line 5 has to run at runtime rather than compile time.



    Actually, that's not what I said. Since nothing alters or even reads allowedLetters between lines 4 and 5, it's a trivial optimisation to do compile-time folding. There is no need to do anything at runtime there.
     
    Time flies like an arrow. Fruit flies like a banana. Steve flies like a tiny ad:
    Smokeless wood heat with a rocket mass heater
    https://woodheat.net
    reply
      Bookmark Topic Watch Topic
    • New Topic