I have a Javaservlet that processes a GET parameter. Currently, I have an Apache mod_rewrite rule that only matches URLs in a fixed character set and passes matching URLs to the servlet. This works fine but ends up generating 404 errors for URLs with characters outside of that set. I would like to prevent those bounces by letting the servlet handle all input, potentially returning a blank page if the input is undesirable.
If I allow wildcard matching in the mod_rewrite rule then I need to be very careful to prevent vulnerabilities in the servlet's parameter processing. What are good practices to ensure that the string input is sanitized? I would want to truncate long strings, strip out HTML, and remove characters I don't want. Are there gotchas in string processing that might lead to code vulnerability?
Sounds like you are on the right track. These are valid concerns.
There are blacklists and there are whitelists. To ensure security to a great extent (maximum is no web site at all ), you would allow users only to select from a list. The other end of the spectrum is "anything goes". That's where you get things like (not limited to) SQL injection. What you can do in between, is to make a whitelist. You can avoid doing ANYTHING with any special characters you find in the input. That is, no greater than, no less than, no open or close paren, and probably no ampersand, either. That would be a blacklist validation. Blacklists can grow longer as new technologies show up. It is tough to get them right, and tougher to keep them right. Browsers march along from version to version, and you never know what someone will point at your site, and that's just one source for error.
Depending on your application, you might wish to avoid anything that is not a space, digit or letter. Allowing things in that are on a constrained "good" list is white list validation. And, of course apply your length and other constraints mentioned in your post.
You can research more of this on your own. Consider the types of attacks you aim to thwart (SQL injection, other types of injection, buffer overrun, etc.). Terms you can look at also include blacklist and white list.
Thanks for your reply and for validating my approach.
I am implementing a white list. I'm only allowing characters in a certain set into the user parameter string. No angle brackets, parenthesis, question marks, or semicolons, no ampersands, no parentheses: just alphanumerics, +, -, ., comma, and white space characters. And, I'm truncating the string to prevent a buffer overrun attack.