I'm working on a Java project where i need to create a filter for filtering uploaded names.
The names are uploaded to the db from all over the world, so for-example, i don't want to blacklist a name like "Alassandra" which has "ass" in it.
Would appreciate some guidance, help where to look at, or some algorithm tips for creating my type of filter?
This seems pretty simple: Check if the word you are checking contains profanity. If not, then the word is OK. If yes, then check if the word is in the list of acceptable names. If yes, then the word is OK. Otherwise the word is bad.
Thats pretty similar to what i was thinking. I was thinking of creating a whitelist and a blacklist. Before comparing it to the lists, i should look and see if there are any non-alphabetic symbols($@|) and replace them if found with the right alphabetic letter. And then the application should look at the blacklist and whitelist to compare if there is a match.
The list will be pretty big because i need to have bad words from many different languages. Maybe i need to implement different bl/wl for different countries?
The thing with profanity filters is your implementation can go from very permissive to very strict. If you are too permissive, users will find a way to circumvent it. If you are too strict, you might block legitimate names (for example, a profanity filter that uses spell check might block a name like Anas Ashfaq) What you might want to do is implement an heurestic that provides a profanity score. If the score is too high, block the name, if it's too low, allow it, if it's somewhere in between, flag it for an admin to look at. Ultimately, the best profanity filter is a human.
Also, just having a profanity filter in your application, changes the behavior of the users. It discourages some users, and emboldens others
Thanks Jayesh.. Thats is something me and my workmate have talked about and is probably the best way to go.
Do you have any tips on how i should create the filter(algorithms or api?). I have read some tutorials on regex. I have also looked at some algorithms(Aho Corasick) but didn't find it usefull for my application.
Just played around with your link, William. My name - no match; My wife's name - no match; Barack - no match, Obama - no match; Mitt - match; Romney - match. I tried some of my kid's classmates:- Don;t know if it's a coincidence but it matches all the white kids in his class. There is a definite bias towards European names in the Moby Words list
In a previous life, we used Soundex to do spell checks on user's search and it worked pretty welo. What we did is we had a blacklist of swear words, and we basically dropped blacklisted words from the search terms. So, if user searched for "fuck you", we treated the search as "you". If the user misspelled it, we used soundex to spell correct it. SO, if they typed "fuk you", we would search for something like "fug you OR fugue you"(something like that)
This won't work directly for the OP, but the way I would do it is do some sort of string matching with blacklisted words. If the name matches a word in a blacklist, block the name. If it passes the blacklist check, compare the soundex with a brownlist of soundex. If it matches, flag the name for review by an admin.