django - ckeditor bug renders text/string in raw html format - django

am using django ckeditor. Any text/content entered into its editor renders raw html output on the webpage.
for ex: this is rendered output of ckeditor field (RichTextField) on a webpage;
<p><span style="color:rgb(0, 0, 0)">this is a test file ’s forces durin</span><span style="color:rgb(0, 0, 0)">galla’s good test is one that fails Thereafter, never to fail in real environment. </span></p>
I have been looking for a solution for a long time now but unable to find one :( There are some questions which are similar but none of those have been able to help. It will be helpful if any changes suggested are provided with the exact location where it needs to be changed. Needless to say I am a newbie.
Thanks

You need to mark the relevant variable that contains the html snippet in your template as safe
Obviously you should be sure, that the text comes from trusted users and is safe, because with the safe filter you are disabling a security feature (autoescaping) that Django applies per default.
If your ckeditor is part of a comment form and your mark the entered text as safe, anybody with access to the form could inject Javascipt and other (potentially nasty) stuff in your page.
The whole story is explained pretty well in the official docs: https://docs.djangoproject.com/en/dev/topics/templates/#automatic-html-escaping

Related

TYPO3templating and html tag

I made site package, on TYPO3, by by official documentations, everything ok,
but only one small problem,
when I watch the pages on the browser, I see the HTML TAGs,
I have not answer,
I tried already done package,
same problem....
what to do??
This always happens when you prepare HTML into a fluid-variable and output that variable directly.
In your case you do not assign to a variable, but you output the result from the viewhelper, which contains HTML-markup, directly.
To avoid that the HTML-markup is shown (escaped) you need to use an additional viewhelper: f:format.raw
either:
<f:format.raw><f:cObject typoscriptObjectPath="..." data="..."/></f:format.raw>
or:
{f:cObject(typoscriptObjectpath:'...', data:'...')->f:format.raw()}

content empty when using scrapy

Thanks for everyone in advance.
I encountered a problem when using Scrapy on Python 2.7.
The webpage I tried to crawl is a discussion board for Chinese stock market.
When I tried to get the first number "42177" just under the banner of this page (the number you see on that webpage may not be the number you see in the picture shown here, because it represents the number of times this article has been read and is updated realtime...), I always get an empty content. I am aware that this might be the dynamic content issue, but yet don't have a clue how to crawl it properly.
The code I used is:
item["read"] = info.xpath("div[#id='zwmbti']/div[#id='zwmbtilr']/span[#class='tc1']/text()").extract()
I think the xpath is set correctly and I have checked the return value of this response and it indeed told me that there is nothing under this directory. Results shown here:'read': [u'<div id="zwmbtilr"></div>']
If it has something, there should be something between <div id="zwmbtilr"> and </div>.
Really appreciated if you guys share any thoughts on this!
I just opened your link in Firefox with NoScript enabled. There nothing inside the <div #id='zwmbtilr'></div>. If I enable the javascripts, I can see the content you want. So, as you already new, it is a dynamic content issue.
Your first option is try to identify the request generated by javascript. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.

Django. How to make html(tinymce) input safe on form.clean?

For example, I have "Add comment" form on my django-powered website.
This form have text field with tinymce.
I want user to be able to use only p,strong,i,ul,ol,li tags. Because, result is html-code, I can't use strip_tags on my AddCommentForm.clean_text method. Also, I need to be sure, that result doesn't contain any vulnerabilities (js, iframe, etc)
I believe, that you can advice me a good solution for this))
This can be done at the TinyMCE side via some configuration parameters. While it's not 100% secure fro someone POSTing directly, it's better than nothing.
It should just be a matter of tweaking your valid_elements config in your TinyMCE setup to only allow what you want:
...
valid_elements : "p,strong/b,i/em,ul,ol,li",
...

Multiple pages html output from a .rst document in Django

I'm writing a Django app to serve some documentation written in RestructuredText.
I have many documents written in *.rst, each of them is quite long with many section, subsection and so on.
Display the whole document in a single page is not a problem using Django filters, but I'd rather have just the topic index on a first page, whit links to an URL where I can display a single section / subsection (which will need some 'previous | up | home | next' link I guess...). In a way similar to a 'multiple HTML page output' as in a docbook / XML to HTML conversion.
Can anyone point me to some direction to build a document tree of a *.rst document an parse a single section of it, or suggest a clever way to obtain a similar result?
Choice 1. Include URL links to the other parts of the document.
You write an index.rst, part1.rst, part2.rst, etc. And your index.rst has links to the other parts. This requires almost no work, except careful planning to make sure that your RST HTML links are correct.
There's no "parse". You just break your document into sections. Manually.
[This seems so obvious, I'm afraid to mention it.]
Choice 2. Use Sphinx. It manages table-of-contents and inter-document connections very nicely.
However, the Sphinx extensions to RST aren't handled directly by Django, so you'd need to save the Sphinx output and then display that in Django. We use the JSON HTML Builder (http://sphinx.pocoo.org/builders.html?highlight=json#sphinx.builders.html.JSONHTMLBuilder) output from Sphinx. Then we render these documents through a template.

Markdown and XSS

Ok, so I have been reading about markdown here on SO and elsewhere and the steps between user-input and the db are usually given as
convert markdown to html
sanitize html (w/whitelist)
insert into database
but to me it makes more sense to do the following:
sanitize markdown (remove all tags -
no exceptions)
convert to html
insert into database
Am I missing something? This seems to me to be pretty nearly xss-proof
Please see this link:
http://michelf.com/weblog/2010/markdown-and-xss/
> hello <a name="n"
> href="javascript:alert('xss')">*you*</a>
Becomes
<blockquote>
<p>hello <a name="n"
href="javascript:alert('xss')"><em>you</em></a></p>
</blockquote>
∴​ you must sanitize after converting to HTML.
There are two issues with what you've proposed:
I don't see a way for your users to be able to format posts. You took advantage of Markdown to provide nice numbered lists, for example. In the proposed no-tags-no-exceptions world, I'm not seeing how the end user would be able to do such a thing.
Considerably more important: When using Markdown as the "native" formatting language, and whitelisting the other available tags,you are limiting not just the input side of the world, but the output as well. In other words, if your display engine expects Markdown and only allows whitelisted content out, even if (God forbid) somebody gets to the database and injects some nasty malware-laden code into a bunch of posts, the actual site and its users are protected because you are sanitizing it upon display, as well.
There are some good resources on the web about output sanitization:
Sanitizing user data: Where and how to do it
Output sanitization (One of my clients, who shall remain nameless and whose affected system was not developed by me, was hit with this exact worm. We have since secured those systems, of course.)
BizTech: Best Practices: Never heard of XSS?
Well certainly removing/escaping all tags would make a markup language more secure. However the whole point of Markdown is that it allows users to include arbitrary HTML tags as well as its own forms of markup(*). When you are allowing HTML, you have to clean/whitelist the output anyway, so you might as well do it after the markdown conversion to catch everything.
*: It's a design decision I don't agree with at all, and one that I think has not proven useful at SO, but it is a design decision and not a bug.
Incidentally, step 3 should be ‘output to page’; this normally takes place at the output stage, with the database containing the raw submitted text.
insert into database
convert markdown to html
sanitize html (w/whitelist)
perl
use Text::Markdown ();
use HTML::StripScripts::Parser ();
my $hss = HTML::StripScripts::Parser->new(
{
Context => 'Document',
AllowSrc => 0,
AllowHref => 1,
AllowRelURL => 1,
AllowMailto => 1,
EscapeFiltered => 1,
},
strict_comment => 1,
strict_names => 1,
);
$hss->filter_html(Text::Markdown::markdown(shift))
convert markdown to html
sanitize html (w/whitelist)
insert into database
Here, the assumptions are
Given dangerous HTML, the sanitizer can produce safe HTML.
The definition of safe HTML will not change, so if it is safe when I insert it into the DB, it is safe when I extract it.
sanitize markdown (remove all tags - no exceptions)
convert to html
insert into database
Here the assumptions are
Given dangerous markdown, the sanitizer can produce markdown that when converted to HTML by a different program will be safe.
The definition of safe HTML will not change, so if it is safe when I insert it into the DB, it is safe when I extract it.
The markdown sanitizer has to know not just about dangerous HTML and dangerous markdown, but how the markdown->HTML converter does its job. That makes it more complex, and more likely to be wrong than the simpler unsafeHTML->safeHTML function above.
As a concrete example, "remove all tags" assumes you can identify tags, and would not work against UTF-7 attacks. There might be other encoding attacks out there that render this assumption moot, or there might be a bug that causes the markdown->HTML program to convert (full-width '<', exotic white-space characters stripped by markdown, SCRIPT) into a <script> tag.
The most secure would be:
sanitize markdown (remove all tags - no exceptions)
convert markdown to HTML
sanitize HTML
insert into a DB column marked risky
re-sanitize HTML every time you fetch that column from the DB
That way, when you update your HTML sanitizer you get protection against any newly discovered attacks. This is often inefficient, but you can get pretty good security by storing a timestamp with HTML inserted so that you can tell which might have been inserted during the time when someone knew about an attack that gets past your sanitizer.