I noticed a post on Dom’s blog asking for suggestions on how to prevent exploits in user submitted HTML using PHP and thought that I’d post an in-depth response regarding the security practices that should be followed when designing and building a site that accepts UGC.
What Is UGC?
User generated content is at the heart of most web 2.0 sites, from Facebook to Delicious via Digg, Flickr and Twitter. All these sites generate loads of traffic from data that their userbase submits for free - which sounds like a great deal, until a malicious user discovers an exploit and suddenly the site is awash with viagra spam, malware and popups and then legitimate users soon leave.
UGC simply means anything that your users enter which is later displayed on your website.
This includes usernames, comments, email address, blog entries (if you run a blogging platform), tweets etc. Note that this data doesn’t have to come from a form served by your server - JSON, XML, RSS feeds and thirdy party adverts can all contain undesirable markup.
Types Of Exploits found In UGC
Exploits can take many forms but two of the most common are cross site scripting (XSS) and SQL injection, both of which can and should be prevented in server-side code.
A Common XSS Attack
An example of an XSS exploit could be found on a social website (with features such as those found on Facebook or Bebo). A bad guy creates a profile and lists his homepage as:
Then the bad guy runs a script that befriends every user he can find. Many of them will click on his homepage URL to see what he is all about. When they do, they’ll see an alert box like this:
More inventive attackers could use the exploit to automate friend messages, send spam, show viagra adverts or even scrape sensitive data. All of which are very damaging for the site’s reputation and mean that the owner will be busy cleaning up for a long time.
Typical SQL Injection Attacks
Most UGC is stored in a relational database such as MySql. SQL injection attacks exploit lazy (or naiive) programmers that build up strings of SQL to send to the database containing the raw data supplied, such as this example that searches for a user with the supplied name:
$sql = "SELECT * FROM users WHERE username = '" . $_POST['username'] . "'";
This will create a simple SQL select statement, like this (if I supply the username andy):
SELECT * FROM users WHERE username = 'andy'
A piece of working SQL, but it is vulnerable. An attacker might enter a username like
'; DELETE FROM users; --. Which would result in the following SQL being created:
SELECT * FROM users WHERE username = ''; DELETE FROM users; --';
Readers familiar with SQL should now be experiencing a deep sense of dread (and possibly awe at the inventiveness of some people). The SQL code would delete every single entry from the
users table if were executed, even though the author thought he was writing a read-only SELECT statement. Scary stuff!
Preventing SQL Injection Attacks
Preventing attacks like the SQL injection outlined above is quite straightforward, if you use a feature common almost all RDBMS clients - use prepared statements with bind variables.
Using bind variables would mean that the SQL would change to:
SELECT * FROM users WHERE username = ?
NB: The actual syntax of bind variables can vary between RDBMS platforms
So, whatever strings you provide to the database will behave just as you expect - they cannot be interpreted as SQL. There are significant performance benefits too.
Guarding Against Cross Site Scripting Attacks
There is no simple action to prevent XSS attacks - you must analyse all possible user inputs and determine how they must be sanitised.
Regular expressions can help determine if a user supplies invalid data but don’t just rely on regexps, specify minimum and maximum lengths too. Determine what format the user-supplied data will take is vital to perform further content specific checks.
Plain Text Input
Usernames and status updates usually take the form of plain text, but you will need to only allow a restricted character set (don’t forget that hackers can send a string of backspace or escape characters).
Other issues to consider for plain text input are:
- Is the data case sensitive?
- Internationalisation - will you support accented characters, or even non-latin characters sets?
- Will a username be used to create a URL, as Twitter does?
- Is the data displayed publicly? If so, it would be best to prevent people from using their email address as their username.
Remember to output your strings with correct HTML encoding. See PHP’s
htmlentities(), Perl’s CGI module, the Ruby
cgi escape etc.
It is common for social sites and blogs to allow members or commenters to link to their homepage. In this case, you probably want to restrict the protocol of the URL to just HTTP and HTTPS (definitely not
Consider very carefully whether you should nofollow the links - if you don’t endorse the page, or it might be an affiliate link then you should.
You also need to have a blacklist of disallowed domains which should include most common URL shorteners. URL shorteners can be abused (get a list of them here).
Other domains are often targets for spammers, so make sure that your list can be easily edited. You might want to consider using regular expressions here too, so that you could (for example) block *.blogspot.com.
Another check that your legitimate users will find reassuring, is to check the URL against Google’s safe browsing API.
File or Image Upload
If you allow users to upload their own avatars to your site you need to check them thoroughly. Only allow files in known formats, and check the MIME Type not the file extension.
While not strictly an XSS issue, multimedia files have been hacked before to cause buffer overruns so ensure that you keep your server packages up to date with the latest patches.
Ensure that images are within specified size parameters or resize them on the server (to prevent page widening).
Sanitising HTML input is a more complex process than other types of input, since HTML often contains unclosed tags, implicit attributes and general hackery.
Most HTML sanitisation methods involve building a parse tree from the input and traversing the tree to discard any elements and attributes that are deemed undesirable. Define a whitelist of allowed element/atttribute combinations NOT a blacklist.
Take extra care in preventing attributes like
ONMOUSEOVER in all HTML elements. Beware of the
STYLE attribute too.
For most HTML user generated content, the only element that should be allowed an attribute is the
<A> (anchor) element, and the only attribute that it is allowed is
HREF (in some circumstances you might allow
ALT). Take care with the URLs supplied to
SRC attributes, see the section above on URL input for validation recommendations.
Look for HTML sanitizers for your preferred language - lots of other coders will have solved this problem before.
Checking for spam is also useful when accepting user supplied HTML content. Automattic’s Akismet has a great API and good third-party library support to help you out with this.
If you allow users import an RSS feed from an untrusted domain, you need to do more that just validate it against the XML Schema Definition. You need to treat titles as plain text, descriptions as HTML and links as URLs as discussed elsewhere in this post.
Aside from the XSS and SQL injection issues, there are a number of other sensible precautions that web application developers can take to minimise the impact off malicious users. You might think that these won’t affect you soon, but my advice is to get them in place before you need them. If your site comes under sustained attack, you’ll have plenty on your plate without needing to code and test some defenses.
Recording and testing against a set of IP addresses that are banned from your application is an excellent precaution and will deter many script kiddies and wannabe blackhats. Place the check early in your application code to reduce server load.
More sophisticated systems could use a variety of inputs (length of membership, country, IP address, number of previous posts, etc) together with a binary classifier to determine whether a user action is undesirable.
Scripts can type a lot faster than humans, so any user posting updates many times per second is likely to have an evil intent. You can modify your session management code to slow and eventually lock out such abuse.
Talk to your hosting company about this (there are specialist consultants that can help you with this too).
This really goes without saying - get your backup strategy sorted out now and test it regularly! Then test it again.
Sensible, Helpful Error Messages
Remember that the vast majority of your users will have good intentions. They might mis-type an email or not understand exactly what is a URL, so provide useful feedack when displaying an error message.
Use simple language and be very clear about just what is and isn’t allowed in each field.
This Is Lots Of Work
Yes it is and it’s worth it. Taking solid, sensible precautions like these make the difference between throwing something together and engineering.
You might also like:
Creative Commons licensed photo by Tancread.