How can I sanitize user input but keep the content of <pre> tags? - coldfusion

I'm using CKEditor in Markdown format to submit user created content. I would like to sanitize this content from malicious tags, but I would like to keep the formatting that is the result of the markdown parser. I've used two methods that do not work.
Method one
<!--- Sanitize post content --->
<cfset this.text = HTMLEditFormat(this.text)>
<!--- Apply mark down parser --->
<cfx_markdown textIn="#this.text#" variable="parsedNewBody">
Problem For some reason <pre> and <blockquote> are being escaped, and thus I'm unable to use them. Only special characters appear. Other markdown tagging works well, such as bold, italic, etc. Could it be CKEdit does not apply markdown correctly to <pre> and <blockquote>?
Example: If I were to type <pre><script>alert("!");</script></pre> I would get the following: <script>alert("!");</script>
Method two
Same as method one, but reverse the order where the sanitation takes place after the markdown parser has done it's work. This is effectively useless since the sanitation function will escape all the tags, malicious ones or ones created by the markdown parser.
While I want to sanitize malicious content, I do want to keep basic HTML tags and contents of <pre> and <blockquote> tags!--any ideas how?

There are two important sanitizations that need to be done on user generated content. First, you want to protect your database from SQL injection. You can do this by using stored procedures or the <cfqueryparam> tag, without modifying the data.
The other thing you want to do is protect your site from XSS and other content-display based attacks. The way you do this is by sanitizing the content on display. It would be fine, technically, to do it before saving, but generally the best practice is to store the highest fidelity data possible and only modify it for display. Either way, I think your problem is that you're doing this sanitization out of order. You should run the Markdown formatter on the content first, THEN run it through HTMLEditFormat().
It's also important to note that HTMLEditFormat will not protect you from all attacks, but it's a good start. You'll want to look into implementing OWASP utilities, which is not difficult in ColdFusion, as you can directly use the provided Java implementation.

Why don't you just prepend and append pre tag after parsing?
I mean, if you only care about first an dlast pre and you dont have nested pre's or similar. If you cfx tag clears pre, make new wrapper method which is going to check if <pre> exists and if not, add it. Also if you use pre tags I guess new line chars are important, so check what your cfx does with those.

Maybe HTMLEditFormat twin HTMLCodeFormat is what you need?


How to sanitize form values to allow text-only

I understand that if a user needs to supply HTML code as part of a form input (e.g. in a textarea) then I use an Anti-Samy policy to filter out the hazardous HTML that's not permitted.
However, I have some text-fields and text-areas which should be text-only. No HTML code at all should be inserted into the DB from these fields.
I am trying to therefore sanitize the inputs so that only raw text is inserted into the database. I believe I can do this two ways:
Use a Regex expression to filter out HTML code e.g. #REReplaceNoCase(FORM.InputField, "[^a-zA-Z\d\s:]", "", "ALL")#
Use a strict text-only Anti-Samy policy
Which option is the correct/good-practice way to remove any user inputted HTML code from a textfield. Or are there further options available to me?
While you could use AntiSamy to do it, I don't know how sensible that would be. Kinda defeats the purpose of it's flexibility, I think. I'd be curious about the overhead, even if minimal, to running that as a filter over just a regex.
Personally I'd probably opt for the regex route in this scenario. Your example appears to only strip the brackets. Is that acceptable in your situation? (understandable if it was just an example) Perhaps use something like this:
reReplace(string, "<[^>]*>", "", "ALL");

Is it safe to wrap an entire CFM page in a cfoutput tag

I am placing a <cfoutput> tag around my entire <html> tag. The ColdBox best practice guide states "When you are creating view templates, try to always surround it with 1 cfoutput tag, instead of nesting them all over the place."
But I have on occasion seen errors pop up where a <script> block containing javascript code is within the <cfoutput> tag. This probably because Coldfusion sees a hash # and tries to parse it but it can't because its javascript.
So how does one get away with having a single <cfoutput> tag on a view page in which to place everything?
I am not aware of any significant security or performance concerns in regards to wrapping an entire page in cfoutput. Of course, you'll always need to be aware to escape any pound signs by doubling them up any time you're inside a cfoutput.
The best practices in that ColdBox guide are geared primarily toward readability and reducing clutter on the page. If you have large sections of the page that you don't want to escape pound signs on or if you like to use cfoutput's grouping functionality, there's nothing wrong with breaking up your cfoutputs in a way that makes sense.
In the olden days of CF there might have been more overhead, but these days I can't imagine it being more than a few nanoseconds, and that's once at compile time.
In my view files I tend to wrap all output in a single cfoutput tag.
You can escape # symbols in JavaScript, etc, by converting them to ##.
The simple answer to your question as posted is yes.
There are only two issues that I'm aware of to keep in mind:
Escape any single hashtags (#) with double hashtags (##) that may occur in your code (i.e., CSS, JS, etc.) ... unless the hashtags are actually being wrapped around a CFML function or variable.
If you're using "cfoutput query...", you will probably want to close the first "cfoutput" tag, and then reopen after the query output. Otherwise, you can run into issues when trying to group query output.
My preference is to use as few tags as possible, mostly for readability and to reduce clutter.

Cleansing string / input in Coldfusion 9

I have been working with Coldfusion 9 lately (background in PHP primarily) and I am scratching my head trying to figure out how to 'clean/sanitize' input / string that is user submitted.
I want to make it HTMLSAFE, eliminate any javascript, or SQL query injection, the usual.
I am hoping I've overlooked some kind of function that already comes with CF9.
Can someone point me in the proper direction?
Well, for SQL injection, you want to use CFQUERYPARAM.
As for sanitizing the input for XSS and the like, you can use the ScriptProtect attribute in CFAPPLICATION, though I've heard that doesn't work flawlessly. You could look at Portcullis or similar 3rd-party CFCs for better script protection if you prefer.
This an addition to Kyle's suggestions not an alternative answer, but the comments panel is a bit rubbish for links.
Take a look a the ColdFusion string functions. You've got HTMLCodeFormat, HTMLEditFormat, JSStringFormat and URLEncodedFormat. All of which can help you with working with content posted from a form.
You can also try to use the regex functions to remove HTML tags, but its never a precise science. This ColdFusion based regex/html question should help there a bit.
You can also try to protect yourself from bots and known spammers using something like cfformprotect, which integrates Project Honeypot and Akismet protection amongst other tools into your forms.
You've got several options:
"Global Script Protection" Administrator setting, which applies a regular expression against post and get (i.e. FORM and URL) variables to strip out <script/>, <img/> and several other tags
Use isValid() to validate variables' data types (see my in depth answer on this one).
<cfqueryparam/>, which serves to create SQL bind parameters and validate the datatype passed to it.
That noted, if you are really trying to sanitize HTML, use Java, which ColdFusion can access natively. In particular use the OWASP AntiSamy Project, which takes an HTML fragment and whitelists what values can be part of it. This is the same approach that sites like SO and use to protect submissions and is a more secure approach to accepting markup content.
Sanitation of strings in coldfusion and in quite any language is very important and depends on what you want to do with the string. most mitigations are for
saving content to database (e.g. <cfqueryparam ...>)
using content to show on next page (e.g. put url-parameter in link or show url-parameter in text)
saving files and using upload filenames and content
There is always a risk if you follow the idea to prevent and reduce a string by allow basically everything in the first step and then sanitize malicious code "away" by deleting or replacing characters (blacklist approach).
The better solution is to replace strings with rereplace(...) agains regular expressions that explicitly allow only the characters needed for the scenario you use it as an easy solution, whenever this is possible. use cases are inputs for numbers, lists, email-addresses, urls, names, zip, cities, etc.
For example if you want to ask for a email-address, you could use
<cfif reFindNoCase("^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.(?:[A-Z]{5})$", stringtosanitize)>...ok, clean...<cfelse>...not ok...</cfif>
(or an own regex).
For HTML-Imput or CSS-Imput I would also recommend OWASP Java HTML Sanitizer Project.

Is there any performance implication in using one big <cfoutput> tag?

I'm being forced/payed to work on a Legacy ColdFusion project (I'm an usual C# programmer) and one peculiarity with CF is that they have they're own tags that are supposed to blend with HTML (bad bad decision, IMO, since it just confuses the hell out of me even with the "starts with cf rule).
Besides this, they have the # character to indicate the start of CF "territory" much alike <% in ASP.Net or $ in Spark or so many equivalents. But this only gets parsed if inside a tag.
My question is: Is there a problem with opening one tag in the begining of the file and closing it, against using only when i'm going to use the # character?
To illustrate here's some code:
Some text #SomeVar# Some text.<br />
Some Images some other things #AnotherVar#
Some text <cfoutput>#SomeVar#</cfoutput> Some text.<br/>
Some Images some other things <cfoutput>#AnotherVar#</cfoutput>
Granted, this is might seem trivial for small content but i'm talking about a whole page.
Depending on the page contents, either is fine. There may be a performance impact (minor) by putting all of your page inside the CFOUTPUT tag, because the CFML engine needs to parse and scan the contents of the tag for executable code. Outside of the CFOUTPUT tag, the CFML engine can ignore the page as static content.
If you have CSS and HTML code that uses pound signs (for example named anchors or Hex color codes), you need to escape all pound signs (by adding a second one like "##") when within a CFOUTPUT. Because of this, I generally only put the CFOUTPUT around code I specifically want the CF engine to run.
That said, the CFML engine pays a bit of a performance penalty for constantly opening and closing the CFOUTPUT. If you're looping over come content, put the CFOUTPUT around the entire loop, rather than opening and closing it in each iteration of the loop.
Also, if you're having trouble knowing what code is CFML and what isn't, you might want to get a better IDE/editor for CFML like CFEclipse. It color codes the tags and lets you see the difference between CFML and HTML tags immediately. It's open source.
One problem you might find is that cfoutput is often used to display queries and they can not be nested inside of other cfoutput tags. So this will cause a 'Invalid tag nesting configuration' error
<cfoutput query="qFriends">
<li>#qFriends.fname# #qFriends.lname#</li>
It should not be a big issue but be careful using hex-valued colors, you'll need to escape those with an extra #. If it was me, I would try to break down those huge chunks of content into smaller pieces. Let HTML, JS, Flash and CSS do their jobs and use CF for the server side.
If you want to put cfoutput at the beginning and end of the page, you have to use double sign ## for colors value.

Markdown and XSS

Ok, so I have been reading about markdown here on SO and elsewhere and the steps between user-input and the db are usually given as
convert markdown to html
sanitize html (w/whitelist)
insert into database
but to me it makes more sense to do the following:
sanitize markdown (remove all tags -
no exceptions)
convert to html
insert into database
Am I missing something? This seems to me to be pretty nearly xss-proof
Please see this link:
> hello <a name="n"
> href="javascript:alert('xss')">*you*</a>
<p>hello <a name="n"
∴​ you must sanitize after converting to HTML.
There are two issues with what you've proposed:
I don't see a way for your users to be able to format posts. You took advantage of Markdown to provide nice numbered lists, for example. In the proposed no-tags-no-exceptions world, I'm not seeing how the end user would be able to do such a thing.
Considerably more important: When using Markdown as the "native" formatting language, and whitelisting the other available tags,you are limiting not just the input side of the world, but the output as well. In other words, if your display engine expects Markdown and only allows whitelisted content out, even if (God forbid) somebody gets to the database and injects some nasty malware-laden code into a bunch of posts, the actual site and its users are protected because you are sanitizing it upon display, as well.
There are some good resources on the web about output sanitization:
Sanitizing user data: Where and how to do it
Output sanitization (One of my clients, who shall remain nameless and whose affected system was not developed by me, was hit with this exact worm. We have since secured those systems, of course.)
BizTech: Best Practices: Never heard of XSS?
Well certainly removing/escaping all tags would make a markup language more secure. However the whole point of Markdown is that it allows users to include arbitrary HTML tags as well as its own forms of markup(*). When you are allowing HTML, you have to clean/whitelist the output anyway, so you might as well do it after the markdown conversion to catch everything.
*: It's a design decision I don't agree with at all, and one that I think has not proven useful at SO, but it is a design decision and not a bug.
Incidentally, step 3 should be ‘output to page’; this normally takes place at the output stage, with the database containing the raw submitted text.
insert into database
convert markdown to html
sanitize html (w/whitelist)
use Text::Markdown ();
use HTML::StripScripts::Parser ();
my $hss = HTML::StripScripts::Parser->new(
Context => 'Document',
AllowSrc => 0,
AllowHref => 1,
AllowRelURL => 1,
AllowMailto => 1,
EscapeFiltered => 1,
strict_comment => 1,
strict_names => 1,
convert markdown to html
sanitize html (w/whitelist)
insert into database
Here, the assumptions are
Given dangerous HTML, the sanitizer can produce safe HTML.
The definition of safe HTML will not change, so if it is safe when I insert it into the DB, it is safe when I extract it.
sanitize markdown (remove all tags - no exceptions)
convert to html
insert into database
Here the assumptions are
Given dangerous markdown, the sanitizer can produce markdown that when converted to HTML by a different program will be safe.
The definition of safe HTML will not change, so if it is safe when I insert it into the DB, it is safe when I extract it.
The markdown sanitizer has to know not just about dangerous HTML and dangerous markdown, but how the markdown->HTML converter does its job. That makes it more complex, and more likely to be wrong than the simpler unsafeHTML->safeHTML function above.
As a concrete example, "remove all tags" assumes you can identify tags, and would not work against UTF-7 attacks. There might be other encoding attacks out there that render this assumption moot, or there might be a bug that causes the markdown->HTML program to convert (full-width '<', exotic white-space characters stripped by markdown, SCRIPT) into a <script> tag.
The most secure would be:
sanitize markdown (remove all tags - no exceptions)
convert markdown to HTML
sanitize HTML
insert into a DB column marked risky
re-sanitize HTML every time you fetch that column from the DB
That way, when you update your HTML sanitizer you get protection against any newly discovered attacks. This is often inefficient, but you can get pretty good security by storing a timestamp with HTML inserted so that you can tell which might have been inserted during the time when someone knew about an attack that gets past your sanitizer.