[Logo]
Forums Register Login
HashSet case insensitive
Hello,

I want to break a string and put into set but without sorting:

My code sorts the elements.





Can someone give me a hint?

Any suggestion would be helpful.

Thanks,


I'm assuming that, rather than sorted, you'd like them in the order they were found but still disallowing duplicates. If that's the case then I suggest you read the javadocs on LinkedHashSet.

I tried :



but LinedkHashSet don't accept the parameter String.CASE_INSENSITIVE_ORDER.

So, I want to eliminate duplicates regardless CASE_INSENSITIVE_ORDER.

So, if I had hello and Hello I want to save only the first Hello.

CASE_INSENSITIVE_ORDER implies "order", or sorting. I thought you didn't want sorting.

If you want it case insensitive, do you have the option of storing only the lower case copy of each string?

A different approach would be to roll your own LinkedTreeSet class but I don't consider that a beginner option.
You have to consider the semantics of each data structure you use and see if it matches what you want. A mismatch between what you want to do what the data structure is intended for makes finding good solutions difficult.

A Set normally doesn't imply any ordering. However, a TreeSet does. A List will preserve ordering. You can also sort a List if you want to.

Then there's your requirement to eliminate duplicates. A Set will not allow duplicates based on equals(). A Map has the same semantics. A List does allow duplicates though.

So, a reasonable solution may need to use a combination of TreeSet and List. Maybe even a Map and a List. Just make sure to use each in a way that is compatible with its inherent capabilities.
(1 like)
 

Dana Ucaed wrote:. . . put into set but without sorting: . . .

If you go and ask a mathematician, you will be told that sets do not support sorting or order as a default. So the iteration order of an ordinary set is unpredictable, and sets supporting some sort of predictable iteration order are special cases.
I thought I'd throw this out in case you find it suits your needs. It is similar to LinkedHashSet but is case insensitive. This would have been tedious to create by hand but the Eclipse IDE makes quick work of generating stubs, so, only a little bit of work on my part. See the main() method at the end which is just a quick and dirty sanity check.
yes, one solution is to convert to lowercase but my output would not be correct.

I must store original string.



Is this a class assignment?
So, you created a wrapper class above LinkedList.

Thanks Carey.

One nitpick: Since String is final, I don't know what's the use of declaring a type of Collection<? extends String>
(1 like)
Eclipse created all the stubs. I'm not sure why it created those signatures. Perhaps it wasn't taking into account that it was only designed to work with a class that was final (?).
I think Junilu hinted this already, but why not simply use a HashSet where you store the lowercase strings, and an ArrayList where you store the real string, if the tolower version is not present?
(1 like)
Personally, I would wrap my set around a Map<CollationKey, String>, and pass a Collator that handles the normalization into the constructor.

You can extend AbstractSet, and forward most of the requests to the keyset of the map.
(1 like)
How about creating your own datatype for this :




It contains your String in it's original form and you can use it within any collection:


output:
I love these discussions.
How about the equals() method also accepting an instanceof String?
 

Carey Brown wrote:How about the equals() method also accepting an instanceof String?


I see a few issues with that, one is that :
  • set will not benefit with that change in any way.
  • The relationship wont be an equivalence relationship since it breaks symmetry :  x.equals(y) wont be true for y.equals(x)
  • (1 like)
    could have been implemented as
    which may be more efficient and handles the case where this.dataString is mixed case
    Awesome ! I agree that equalsIgnoreCase would be better.
    (1 like)
    Keep in mind that toLowercase() and equalsIgnoreCase() do not work for many languages. Case conversions are locale sensitive.
    Which is why you can't use equalsIgnoreCase in an equals method, because there is no matching hash code method. I prefer the toLowerCase solution. If you want to prevent the creation of new Strings you can create some utility methods (e.g. equalsLowerCase(String s1, String s2), hashCodeLowerCase(String s) and possibly even compareToLowerCase(String s1, String s2)).
    (1 like)
    Or, you can just use a CollationKey, which is like a String but stripped of things like casing, accents and composition, depending on the strength of the Collator used. It has equals(), hashCode() and compareTo() methods that take these collation rules into account.
    CollatedSet is actually a bit of a misnomer, since the strings are not returned in the order defined by the collator, unless you construct it with a SortedMap implementation.
     

    Rob Spoor wrote:Which is why you can't use equalsIgnoreCase in an equals method, because there is no matching hash code method. I prefer the toLowerCase solution. If you want to prevent the creation of new Strings you can create some utility methods (e.g. equalsLowerCase(String s1, String s2), hashCodeLowerCase(String s) and possibly even compareToLowerCase(String s1, String s2)).


    I was looking at Java's source code for equalsIgnoreCase() and they do an interesting thing, they compare the chars to see if they're equal, if not then they compare the lower case of the chars to see if they're equal. if not, they make yet another test of comparing the upper case of the chars. This leads me to think that even String#toLowerCase() is not symetrical to String#toUpperCase() leading me to think that computing a hash code based on a String returned from toLowerCase() might have an issue in some languages. How wide spread is this issue? I couldn't say but I doubt it will impact me, not a good stance for a production level programmer, but what's a body to do?
    You simply should never use toLowerCase(), unless you can guarantee that the strings are in a neutral language (such as string constants defined in your application, not intended for human reading).

    For locale sensitive normalization, use Collator.
    After thinking about this problem a little bit more, I determined it's impossible to write a valid Set implementation that does this.

    It's not possible to have a valid implementation for equals() and hashCode() and also have a valid implementation for removeAll() and retainAll(), and vice versa. You can use the class I wrote above, but then you must not let it implement the Set interface. It can extend AbstractCollection though.
     

    Carey Brown wrote:I was looking at Java's source code for equalsIgnoreCase() and they do an interesting thing, they compare the chars to see if they're equal, if not then they compare the lower case of the chars to see if they're equal. if not, they make yet another test of comparing the upper case of the chars. This leads me to think that even String#toLowerCase() is not symetrical to String#toUpperCase() leading me to think that computing a hash code based on a String returned from toLowerCase() might have an issue in some languages. How wide spread is this issue? I couldn't say but I doubt it will impact me, not a good stance for a production level programmer, but what's a body to do?



    German is one of those languages... have a look at this little code example


         
         
    This is what comes from having learnt German as a first language: Paul's example looks perfectly normal to me. Only the bit I thought was normal on first reading was all the concatenation of method calls, since German doesn't have words. It has multiple concatenations of words none of them less than 0x87358bfa letters long.
    The Carey solution works.

    The simplest solution is to use regex, but I wanted to avoid regex.

    I am very glad that see some discussion.



    How is using a regex going to help you in this situation?
    Wink, wink, nudge, nudge, say no more ... https://richsoil.com/cards


    This thread has been viewed 1047 times.

    All times above are in ranch (not your local) time.
    The current ranch time is
    Jun 24, 2018 03:03:06.