• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Devaka Cooray
  • Ron McLeod
  • Jeanne Boyarsky
Sheriffs:
  • Liutauras Vilda
  • paul wheaton
  • Junilu Lacar
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Piet Souris
  • Carey Brown
  • Tim Holloway
Bartenders:
  • Martijn Verburg
  • Frits Walraven
  • Himai Minh

Creating a profanity filter for user generated content

 
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi guys!

I'm working on a Java project where i need to create a filter for filtering uploaded names.
The names are uploaded to the db from all over the world, so for-example, i don't want to blacklist a name like "Alassandra" which has "ass" in it.

Would appreciate some guidance, help where to look at, or some algorithm tips for creating my type of filter?

Regards
Arvin
 
Java Cowboy
Posts: 16084
88
Android Scala IntelliJ IDE Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Welcome to the Ranch.

This seems pretty simple: Check if the word you are checking contains profanity. If not, then the word is OK. If yes, then check if the word is in the list of acceptable names. If yes, then the word is OK. Otherwise the word is bad.
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
For really big lists of bad words, look into the algorithms used in spelling checkers.

Bill
 
Arvin Moradi
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Jesper de Jong and William Brogden!

Thats pretty similar to what i was thinking. I was thinking of creating a whitelist and a blacklist. Before comparing it to the lists, i should look and see if there are any non-alphabetic symbols($@|) and replace them if found with the right alphabetic letter. And then the application should look at the blacklist and whitelist to compare if there is a match.

The list will be pretty big because i need to have bad words from many different languages. Maybe i need to implement different bl/wl for different countries?

EDIT:

Is regex a good way for filtering?
 
Rancher
Posts: 2759
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The thing with profanity filters is your implementation can go from very permissive to very strict. If you are too permissive, users will find a way to circumvent it. If you are too strict, you might block legitimate names (for example, a profanity filter that uses spell check might block a name like Anas Ashfaq) What you might want to do is implement an heurestic that provides a profanity score. If the score is too high, block the name, if it's too low, allow it, if it's somewhere in between, flag it for an admin to look at. Ultimately, the best profanity filter is a human.

Also, just having a profanity filter in your application, changes the behavior of the users. It discourages some users, and emboldens others
 
Arvin Moradi
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Jayesh.. Thats is something me and my workmate have talked about and is probably the best way to go.

Do you have any tips on how i should create the filter(algorithms or api?). I have read some tutorials on regex. I have also looked at some algorithms(Aho Corasick) but didn't find it usefull for my application.

Regards
Arvin
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You might find Moby Words files useful in creating a "whitelist" - for peoples names for example.

Just to add to your complexity, consider all the "phonetic" spelling matches.

In my phonetic matching experiments I used Moby Word lists.

Bill
 
Jayesh A Lalwani
Rancher
Posts: 2759
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Just played around with your link, William. My name - no match; My wife's name - no match; Barack - no match, Obama - no match; Mitt - match; Romney - match. I tried some of my kid's classmates:- Don;t know if it's a coincidence but it matches all the white kids in his class. There is a definite bias towards European names in the Moby Words list

In a previous life, we used Soundex to do spell checks on user's search and it worked pretty welo. What we did is we had a blacklist of swear words, and we basically dropped blacklisted words from the search terms. So, if user searched for "fuck you", we treated the search as "you". If the user misspelled it, we used soundex to spell correct it. SO, if they typed "fuk you", we would search for something like "fug you OR fugue you"(something like that)


This won't work directly for the OP, but the way I would do it is do some sort of string matching with blacklisted words. If the name matches a word in a blacklist, block the name. If it passes the blacklist check, compare the soundex with a brownlist of soundex. If it matches, flag the name for review by an admin.
 
Die Fledermaus does not fear such a tiny ad:
the value of filler advertising in 2021
https://coderanch.com/t/730886/filler-advertising
reply
    Bookmark Topic Watch Topic
  • New Topic