HTML entities are not displayed correctly in Textile because the ampersand is converted into & by the system. Is there any way to input e.g. ⌘ and actually get ⌘?
Wrapping the entity in ==⌘== does disable Textile processing for that block. Maybe that's as good as it gets?
I'm using RedCloth.
After having had a look through RedCloth sources, there does not seem to be a way of doing this.
Maybe it's as well, on second thought: Textile is not necessarily meant to output HTML, but HTML entities are entirely an HTML solution. I'll just input funky characters directly.
RedCloth does not get hex values in html entities, but others, like or € are processed properly.
Related
I'm using CKEditor in Markdown format to submit user created content. I would like to sanitize this content from malicious tags, but I would like to keep the formatting that is the result of the markdown parser. I've used two methods that do not work.
Method one
<!--- Sanitize post content --->
<cfset this.text = HTMLEditFormat(this.text)>
<!--- Apply mark down parser --->
<cfx_markdown textIn="#this.text#" variable="parsedNewBody">
Problem For some reason <pre> and <blockquote> are being escaped, and thus I'm unable to use them. Only special characters appear. Other markdown tagging works well, such as bold, italic, etc. Could it be CKEdit does not apply markdown correctly to <pre> and <blockquote>?
Example: If I were to type <pre><script>alert("!");</script></pre> I would get the following: <script>alert("!");</script>
Method two
Same as method one, but reverse the order where the sanitation takes place after the markdown parser has done it's work. This is effectively useless since the sanitation function will escape all the tags, malicious ones or ones created by the markdown parser.
While I want to sanitize malicious content, I do want to keep basic HTML tags and contents of <pre> and <blockquote> tags!--any ideas how?
Thanks!
There are two important sanitizations that need to be done on user generated content. First, you want to protect your database from SQL injection. You can do this by using stored procedures or the <cfqueryparam> tag, without modifying the data.
The other thing you want to do is protect your site from XSS and other content-display based attacks. The way you do this is by sanitizing the content on display. It would be fine, technically, to do it before saving, but generally the best practice is to store the highest fidelity data possible and only modify it for display. Either way, I think your problem is that you're doing this sanitization out of order. You should run the Markdown formatter on the content first, THEN run it through HTMLEditFormat().
It's also important to note that HTMLEditFormat will not protect you from all attacks, but it's a good start. You'll want to look into implementing OWASP utilities, which is not difficult in ColdFusion, as you can directly use the provided Java implementation.
Why don't you just prepend and append pre tag after parsing?
I mean, if you only care about first an dlast pre and you dont have nested pre's or similar. If you cfx tag clears pre, make new wrapper method which is going to check if <pre> exists and if not, add it. Also if you use pre tags I guess new line chars are important, so check what your cfx does with those.
Maybe HTMLEditFormat twin HTMLCodeFormat is what you need?
My data coming from the database might contain some html. If I use
string dataFromDb = "Some text<br />some more <br><ul><li>item 1</li></ul>";
HttpContext.Current.Server.HtmlEncode(dateFromDb);
Then everything gets encoded and I see the safe Html on the screen.
However, I want to be able to execute the safe html as noted in the dataFromDb above.
I think I am trying to create white list to check against.
How do I go about doing this?
Is there some Regex already out there that can do this?
Check out this article the AntiXSS library is also worth a look
You should use the Microsoft AntiXSS library. I believe the latest version is available here. Specifically, you'll want to use the GetSafeHtmlFragment method.
So here is my situation, and the solution that I've come up with to solve the problem. I have created an application that includes TinyMCE to allow users to create HTML content for publishing. The user can include images in their markup, and drag/resize those images affecting the final Width/Height attributes in the IMG tag. This is all great, the users can include images and resize/relocate them to their desired appearance. But one big problem is that I am now sending a (possibly) much larger image to the client, only to have the browser resize the image into the requested Width/Height attributes. All that bandwidth and lost load time....
So my solution is to pre-process my users markup content, scanning all of the IMG tags and parsing out the Height/Width/Src attributes. Then set each img's SRC tag to a phpThumb request with the parsed Height/Width passed into the thumbnails URL. This will create my reduced size image (optimising bandwidth at the expense of CPU and caching). What do you think about this solution? I've seen other posts where people were using mod_rewrite to do something similar, but I want to affect the content on the page service and not manipulate the image requests as they're being received. .... Any thoughts about this design?
I need some help with the fine details as my regex skills need some work, but I'm very short on time and promise to pay my technical knowledge debt soon. To make the regex's easier, I can be sure of some things. Only img tags that need this processing will have an existing width="" height="" attributes (with the double quotes, and lower cased text, but I suppose matching the text case insensitive would be better if TinyMCE changes)
So a regex to match only the necessary Img tags, and maybe another three regex's to extract the src, the width, and the height?
Thanks everyone.
I think using regexs for this is a bad idea and you'd be better off parsing it using something like PHP Simple HTML DOM Parser, then you can do something like:
// Load HTML from a string
$html->load($your_posted_content);
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
Try this:
(?i)<img(?>\s+(?>src="([^"]*)"|width="([^"]*)"|height="([^"]*)"|\w+="[^"]*"))+
That will match any image tag, and if the src, width, and height attributes are present, their values will be stored in groups 1, 2, and 3 respectively. But it doesn't require any of those attributes to be there, so you'll want to verify that all three groups contain values before processing.
Generally speaking, RegEx is not good for HTML parsing.. But in your case you may be able to get away with it if your limiting the scope to be very narrow (i.e. only searching for the width=".." and height=".." attributes.. or something like that).
A better solution might be to transfer the content from TinyMCE asynchronously, behing-the scenes, and process it server-side with a proper HTML/XML parser, and then updated the content of the editor once that's done.
I'm using HTML_Template for templating in my C++-based web app (don't ask). I chose that because it was very simple and it turns out to be a good solution.
The only problem right now is that I would like to be able to include translatable strings in the HTML templates (HTML_Template does not really support that).
Ultimately, what I would like is to have a single file that contains all the strings to be translated. It can then be given to a translator and plugged back in to the app and used depending on which language the user chose in settings.
I've been going back and forth on some options and was wondering what others felt was the best choice (or if there's a better choice that isn't listed)
Extend HTML_Template to include a tag for holding the literal string to translate. So, for example, in the HTML I would put something like
<TMPL_TRANS "this is the text to translate"/>
Use a completely separate scheme for translation and preprocess the HTML files to generate the final template files (without the special translation lingo). For example, in the pre-processed file, translatable text would look like this:
{{this is the text to translate}}
and the final would look like:
this is the text to translate
Don't do anything and let the translators find the string to translate in the html and js files themselves.
You may want to consider arrays, if not already.
A popular implementation for translating strings is to use tables and indices. One index is for the language and the second index is for the string. Create a function that returns strings based on these two indices:
const std::string& Get_String(unsigned int language_index, unsigned int string_index);
Each language would have a table of strings (or const char *). There would be a table of pointers to language tables, one for each supported language.
The biggest pain is to convert existing code to use this system.
Hope this helps.
Ok, so I have been reading about markdown here on SO and elsewhere and the steps between user-input and the db are usually given as
convert markdown to html
sanitize html (w/whitelist)
insert into database
but to me it makes more sense to do the following:
sanitize markdown (remove all tags -
no exceptions)
convert to html
insert into database
Am I missing something? This seems to me to be pretty nearly xss-proof
Please see this link:
http://michelf.com/weblog/2010/markdown-and-xss/
> hello <a name="n"
> href="javascript:alert('xss')">*you*</a>
Becomes
<blockquote>
<p>hello <a name="n"
href="javascript:alert('xss')"><em>you</em></a></p>
</blockquote>
∴ you must sanitize after converting to HTML.
There are two issues with what you've proposed:
I don't see a way for your users to be able to format posts. You took advantage of Markdown to provide nice numbered lists, for example. In the proposed no-tags-no-exceptions world, I'm not seeing how the end user would be able to do such a thing.
Considerably more important: When using Markdown as the "native" formatting language, and whitelisting the other available tags,you are limiting not just the input side of the world, but the output as well. In other words, if your display engine expects Markdown and only allows whitelisted content out, even if (God forbid) somebody gets to the database and injects some nasty malware-laden code into a bunch of posts, the actual site and its users are protected because you are sanitizing it upon display, as well.
There are some good resources on the web about output sanitization:
Sanitizing user data: Where and how to do it
Output sanitization (One of my clients, who shall remain nameless and whose affected system was not developed by me, was hit with this exact worm. We have since secured those systems, of course.)
BizTech: Best Practices: Never heard of XSS?
Well certainly removing/escaping all tags would make a markup language more secure. However the whole point of Markdown is that it allows users to include arbitrary HTML tags as well as its own forms of markup(*). When you are allowing HTML, you have to clean/whitelist the output anyway, so you might as well do it after the markdown conversion to catch everything.
*: It's a design decision I don't agree with at all, and one that I think has not proven useful at SO, but it is a design decision and not a bug.
Incidentally, step 3 should be ‘output to page’; this normally takes place at the output stage, with the database containing the raw submitted text.
insert into database
convert markdown to html
sanitize html (w/whitelist)
perl
use Text::Markdown ();
use HTML::StripScripts::Parser ();
my $hss = HTML::StripScripts::Parser->new(
{
Context => 'Document',
AllowSrc => 0,
AllowHref => 1,
AllowRelURL => 1,
AllowMailto => 1,
EscapeFiltered => 1,
},
strict_comment => 1,
strict_names => 1,
);
$hss->filter_html(Text::Markdown::markdown(shift))
convert markdown to html
sanitize html (w/whitelist)
insert into database
Here, the assumptions are
Given dangerous HTML, the sanitizer can produce safe HTML.
The definition of safe HTML will not change, so if it is safe when I insert it into the DB, it is safe when I extract it.
sanitize markdown (remove all tags - no exceptions)
convert to html
insert into database
Here the assumptions are
Given dangerous markdown, the sanitizer can produce markdown that when converted to HTML by a different program will be safe.
The definition of safe HTML will not change, so if it is safe when I insert it into the DB, it is safe when I extract it.
The markdown sanitizer has to know not just about dangerous HTML and dangerous markdown, but how the markdown->HTML converter does its job. That makes it more complex, and more likely to be wrong than the simpler unsafeHTML->safeHTML function above.
As a concrete example, "remove all tags" assumes you can identify tags, and would not work against UTF-7 attacks. There might be other encoding attacks out there that render this assumption moot, or there might be a bug that causes the markdown->HTML program to convert (full-width '<', exotic white-space characters stripped by markdown, SCRIPT) into a <script> tag.
The most secure would be:
sanitize markdown (remove all tags - no exceptions)
convert markdown to HTML
sanitize HTML
insert into a DB column marked risky
re-sanitize HTML every time you fetch that column from the DB
That way, when you update your HTML sanitizer you get protection against any newly discovered attacks. This is often inefficient, but you can get pretty good security by storing a timestamp with HTML inserted so that you can tell which might have been inserted during the time when someone knew about an attack that gets past your sanitizer.