Design, Usability and Security Dilemmas With User Generated Content
Sep 18, 2009 In Development By Karsten JanuszewskiWho's Afraid of User Generated Content?
Allowing users to add their content—feedback, reviews, expertise, etc.—to a web page is ubiquitous these days. Whether we're talking about comments on a blog post or wiki articles, user generated content is everywhere.
The mechanisms for dealing with this type of content, however, are hardly standardized. There are usually three approaches. Users can either:
- Enter text, but not format
- Add HTML directly to the comments
- Use an alternative mark-up syntax
Each of these approaches has pros & cons. Here are just a few:
Approach #1—Text Only
Pro: Nobody can pollute the comments with awful images, formatting, or links.
Con: Nobody can enhance the comments with great images, formatting, or links. You can get around the hyperlink problem fairly easily (by converting http:// references into hyperlinks as the data exits the system), but this doesn't fix the formatting or images issue.
Approach #2—HTML
Pro: Users get a lot of power. They can customize pages, profiles, the whole bit. MySpace is an example of this. Some would argue that the success of MySpace is a result of their allowance of this behavior.
Con: First, there is a usability risk: you have to assume that users know HTML, or teach it to them on the fly. And then there's the design problem: allowing HTML means that users can do all kinds of crazy things—embedding images, adding Flash or Silverlight objects, inserting styles, running the banner tag. MySpace is an example of user-added HTML gone wild and, some would argue it is the “problem” with MySpace.
Another option is to allow a narrow subset of HTML. Just the <a> and the <strong> tag? Or more?
Allowing HTML as user generated comments opens up big security issues – read on for an in-depth discussion of this.
Approach #3—Alternative Mark-up Syntax (Aka the wiki way)
Pro: Wikis, which use their own syntax for formatting, are a perfect example. And, there are other syntaxes out there. The nice part of using one of these syntaxes is that you avoid some of the problems with HTML, as far as security and license to do ill.
Con: Users are forced to learn a new language. And there are lots of languages out there: Textile, Markdown, Markdown with Smarty Pants, Multimarkdown, etc. Heck, Mix Online supports comments written with the Textile syntax and implemented through the by using a library from Codeplex called Textile.NET, though we never tell you in the comment form. (Maybe that’s coming in version 2 – ask Nishant.) In fact, try adding a comment to Mix Online and use the Textile format – you’ll see it works.
It's All a Security Problem
No matter which approach you take, there is one big Universal Con to opening your doors to user generated content: security. User generated content makes all kinds of attacks possible—from SQL Injection to cross site scripting to who knows what.
Some of the worry goes away with ASP.NET, because it has an attribute (validateRequest) that can prevent someone from inserting malicious content. But, if you want allow HTML, you’ll have to turn validateRequest off, which is turned on by default. That means you have to write your own validation as data enters the system.
With validateRequest or your own home rolled solution, we are talking about checking data as it enters the system. What if something does slip through?
A more thorough procedure for the paranoid among us is to sanitize the data as it leaves the system as well. You can do this manually by encoding all output (HTMLEncode(), UrlEncode(), etc.). Or, in ASP.NET, you can pass all data through the Anti-Cross Site Scripting library (originally from the Microsoft Patterns and Practices group). Implementing this library is easy and I highly recommend it. You’ll notice the recent version of Oxite does just this.
What Do You Think?
I leave you on an inconclusive note. All three approaches have pros/cons, and none is necessarily right. So I’m curious: which approaches do you take as web developers? Which do you prefer as users? Let us know in the comments – formatted with Textile if you’d like -- or on Twitter.



Follow the Conversation
5 Comments so far. You should leave one, too.
I think I’ve re-written my own “light” rich text editor five times since 1999. I hate it. Ultimately the issue comes down to how you parse the text that it creates, and as you probably know, the differences in browser implementations makes that a daunting task.
Ultimately, for a lot of use cases, I end up doing plain text boxes that parse line breaks, and allow limited HTML, like i, b, em, blockquote, etc. Then for anchor tags, you make sure they don’t allow Javascript. That has served me fairly well.
@Jeff Putz: Good point on making sure you don’t let javascript through on an anchor tag if you do let those through…
I actually believe that it is nearly impossible to block all malcious content (since the # of ways a bit of script can be encoded into the page is extremely large) unless you use a library like the AntiXSS one linked to from above. Until the most recent version (3.1) though, it was an all or nothing procedure. If you wanted some html to make it through then you couldn’t use that library. The 3.1 version adds a ‘GetSafeHTML’ method that does exactly that though, allows only ‘safe’ tags through, and isn’t fooled by encoding the malicious content into odd ways. (you can see some of the types of encoding I’m talking about here: http://www.owasp.org/index.php/OWASP_Testing_Guide_Appendix_C:_Fuzz_Vectors#Cross_Site_Scripting_.28XSS.29 )
Quick followup, I was actually looking for this page as an example: http://ha.ckers.org/xss.html
it includes a ton of great samples of xss, using various encoding tricks and taking advantage of how different browsers interpret things. Seeing that list, and starting to use it as test material for some of my attempts to sanitize HTML was what told me it was necessary to use a full library like the http://antixss.codeplex.com one. If you aren’t running on IIS/.NET I’m sure there are other similar libraries out there for php and other languages.
It looks nice and I am in need of this.