I'm using TinyMCE as my online editor but I'm concerned of XSS attacks etc..
I though of replacing all < and >, but that doesn't seem to be an option with this kind of editor and I'm not sure removing script tags is enough too (what about onclick, onmouseover and other events).
What should be my approach to avoid such attacks?
You have to choose, security or convenience. The WYSIWYG editor like TinyMCE is very convenient. It allows non-experts to use a web interface to update some content with or without html tags. Its the lazy way to allow someone non-technical to update html, and it comes with all kinds of hazards.
When you give users access to TinyMCE interface to your database it is absolutely equal to giving them a database client to update data directly in your database.
ALso, note that today there is a great deal of Cross-Site-Scripting that is not malicious, that is in fact facebook, linkedin, youtube, etc integration that requires script references to third party domains etc.
So if you harden the TinyMCE tool so that XSS can not be added it will be useless to a serious web developer in many scenarios.
But if you need to make an add/edit/update/delete editor XSS proof you need to validate and sanitize all inputs and your best choice is to roll your own.
In theory you can eliminate XSS like this, but in practice its difficult. There always seems to be something that you've overlooked.
The best way I've found is to use a regular expression to only permit use of certain tags that you specify ( <strong>, <em> etc) and remove all others. You also need to look for attempts to circumvent your protection by users encoding characters.
For the site i am building i would like the users to be able to provide embed codes for video and audio sites. i know this poses a security risk, so i wanted to find out, within Django, how best to filter the html provided so that only certain tags and certain sites are allowed.
Does anyone have any references to how i can accomplish this with Django?
You may be better off using a lightweight markup language and then converting to HTML. This prevents them from playing games to get around whatever HTML checking you do. Fully and correctly checking HTML for 'gotchas' is very difficult to do.
Doing it this way is sort of from the school of That which is not explicitly permitted is prohibited.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed yesterday.
Improve this question
How does one intelligently parse data returned by search results on a page?
For example, lets say that I would like to create a web service that searches for online books by parsing the search results of many book providers' websites. I could get the raw HTML data of the page, and do some regexs to make the data work for my web service, but if any of the websites change the formatting of the pages, my code breaks!
RSS is indeed a marvelous option, but many sites don't have an XML/JSON based search.
Are there any kits out there that help disseminate information on pages automatically? A crazy idea would be to have a fuzzy AI module recognize patterns on a search results page, and parse the results accordingly...
I've done some of this recently, and here are my experiences.
There are three basic approaches:
Regular Expressions.
Most flexible, easiest to use with loosely-structured info and changing formats.
Harder to do structural/tag analysis, but easier to do text matching.
Built in validation of data formatting.
Harder to maintain than others, because you have to write a regular expression for each pattern you want to use to extract/transform the document
Generally slower than 2 and 3.
Works well for lists of similarly-formatted items
A good regex development/testing tool and some sample pages will help. I've got good things to say about RegexBuddy here. Try their demo.
I've had the most success with this. The flexibility lets you work with nasty, brutish, in-the-wild HTML code.
Convert HTML to XHTML and use XML extraction tools. Clean up HTML, convert it to legal XHTML, and use XPath/XQuery/ X-whatever to query it as XML data.
Tools: TagSoup, HTMLTidy, etc
Quality of HTML-to-XHML conversion is VERY important, and highly variable.
Best solution if data you want is structured by the HTML layout and tags (data in HTML tables, lists, DIV/SPAN groups, etc)
Most suitable for getting link structures, nested tables, images, lists, and so forth
Should be faster than option 1, but slower than option 3.
Works well if content formatting changes/is variable, but document structure/layout does not.
If the data isn't structured by HTML tags, you're in trouble.
Can be used with option 1.
Parser generator (ANTLR, etc) -- create a grammar for parsing & analyzing the page.
I have not tried this because it was not suitable for my (messy) pages
Most suitable if HTML structure is highly structured, very constant, regular, and never changes.
Use this if there are easy-to-describe patterns in the document, but they don't involve HTML tags and involve recursion or complex behaviors
Does not require XHTML input
FASTEST throughput, generally
Big learning curve, but easier to maintain
I've tinkered with web harvest for option 2, but I find their syntax to be kind of weird. Mix of XML and some pseudo-Java scripting language. If you like Java, and like XML-style data extraction (XPath, XQuery) that might be the ticket for you.
Edit: if you use regular expressions, make sure you use a library with lazy quantifiers and capturing groups! PHP's older regex libraries lack these, and they're indispensable for matching data between open/close tags in HTML.
Without a fixed HTML structure to parse, I would hate to maintain regular expressions for finding data. You might have more luck parsing the HTML through a proper parser that builds the tree. Then select elements ... that would be more maintainable.
Obviously the best way is some XML output from the engine with a fixed markup that you can parse and validate. I would think that a HTML parsing library with some 'in the dark' probing of the produced tree would be simpler to maintain than regular expressions.
This way, you just have to check on <a href="blah" class="cache_link">... turning into <a href="blah" class="cache_result">... or whatever.
Bottom line, grepping specific elements with regexp would be grim. A better approach is to build a DOM like model of the page and look for 'anchors' to character data in the tags.
Or send an email to the site stating a case for a XML API ... you might get hired!
You don't say what language you're using. In Java land you can use TagSoup and XPath to help minimise the pain. There's an example from this blog (of course the XPath can get a lot more complicated as your needs dictate):
URL url = new URL("http://example.com");
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build a JDOM tree from a SAX stream provided by tagsoup
Document doc = builder.build(url);
JDOMXPath titlePath = new JDOMXPath("/h:html/h:head/h:title");
titlePath.addNamespace("h","http://www.w3.org/1999/xhtml");
String title = ((Element)titlePath.selectSingleNode(doc)).getText();
System.out.println("Title is "+title);
I'd recommend externalising the XPath expressions so you have some measure of protection if the site changes.
Here's an example XPath I'm definitely not using to screenscrape this site. No way, not me:
"//h:div[contains(#class,'question-summary')]/h:div[#class='summary']//h:h3"
You haven't mentioned which technology stack you're using. If you're parsing HTML, I'd use a parsing library:
Beautiful Soup (Python)
HTML Agility Pack (.NET)
There are also webservices that do exactly what you're saying - commercial and free. They scrape sites and offer webservice interfaces.
And a generic webservice that offers some screen scraping is Yahoo Pipes. previous stackoverflow question on that
It isn't foolproof but you may want to look at a parser such as Beautiful Soup It won't magically find the same info if the layout changes but it's a lot easier then writing complex regular expressions. Note this is a python module.
Unfortunately 'scraping' is the most common solution, as you said attempting to parse HTML from websites. You could detect structural changes to the page and flag an alert for you to fix, so a change at their end doesn't result in bum data. Until the semantic web is a reality, that's pretty much the only way to guarantee a large dataset.
Alternatively you can stick to small datasets provided by APIs. Yahoo are working very hard to provide searchable data through APIs (see YDN), I think the Amazon API opens up a lot of book data, etc etc.
Hope that helps a little bit!
EDIT: And if you're using PHP I'd recommend SimpleHTMLDOM
Have you looked into using a html manipulation library? Ruby has some pretty nice ones. eg hpricot
With a good library you could specify the parts of the page you want using CSS selectors or xpath. These would be a good deal more robust than using regexps.
Example from hpricot wiki:
doc = Hpricot(open("qwantz.html"))
(doc/'div img[#src^="http://www.qwantz.com/comics/"]')
#=> Elements[...]
I am sure you could find a library that does similar things in .NET or Python, etc.
Try googling for screen scraping + the language you prefer.
I know several options for python, you may find the equivalent for your preferred language:
Beatiful Soup
mechanize: similar to perl WWW:Mechanize. Gives you a browser like object to ineract with web pages
lxml: python binding to libwww
scrapemark: uses templates to scrape pieces of pages
pyquery: allows you to make jQuery queries in xml/xhtml documents
scrapy: an high level scraping and web crawling framework for writing spiders to crawl and parse web pages
Depending on the website to scrape you may need to use one or more of the approaches above.
If you can use something like Tag Soup, that'd be a place to start. Then you could treat the page like an XML API, kinda.
It has a Java and C++ implementation, might work!
Parsley at http://www.parselets.com looks pretty slick.
It lets you define 'parslets' using JSON what you're define what to look for on the page, and it then parses that data out for you.
As others have said, you can use an HTML parser that builds a DOM representation and query it with XPath/XQuery. I found a very interesting article here: Java theory and practice: Screen-scraping with XQuery - http://www.ibm.com/developerworks/xml/library/j-jtp03225.html
There is a very interesting online service for parsing websites https://loadsiteinmysql.site This service splits the site into tags and loads them into MySQL tables. This allows you to parse sites using MySQL syntax
I'd like to create a technical wiki site and it requires the full use of HTML/CSS and maybe Javascript when editing a page. Is this something I can easily configure in MediaWiki? If not, is there any other wiki software that you'd recommend?
Thanks!
You can enable raw HTML support by setting $wgRawHtml = true; in your LocalSettings.php:
http://www.mediawiki.org/wiki/Manual:$wgRawHtml
However, as noted above this is rather insecure for a public site. (If locked down to registered usage only by known folks it's ok -- but you need to trust your users.)
There are some links on that manual page to extensions organized around letting you put specific known bits of HTML/JS in your output code as well, which may or may not fit your needs better.
Well, while MediaWiki itself does not support this, there are some extensions which allow at least HTML in a page. See for example this extension list. SecureHTML might so what you are looking for.
That said, I'd like to point out that allowing raw HTML rather defeats the purpose of a wiki:
it can and will mess up formatting and create weird problems (clashes between generated and user-provided HTML)
it makes it hard/impossible to convert the wiki to other formats (such as to print it)
it makes searching harder
it makes any kind of security impossible (think XSS)
This is doubly true for allowing Javascript.
So I'd like to ask why you need this. If you need special formatting that MediaWiki does not offer, consider using (or writing) an extension for this.
If you really need arbitrary HTML, a Wiki might not be the best tool for you. You should consider a CMS, or just put HTML files into Subversion.
So what are you trying to do?
Use nowiki tags. Docs can be found here: https://www.mediawiki.org/wiki/Help:Formatting
Currently our team is using MoinMoin as a wiki for IT and it's so nice.
We want to promote to use wiki for end-users because some of them are interested. On the wiki we'll share and edit requirements of aplications, for instance.
I think MoinMoin is not the more user-friendy (but I love to use it) that's why we are looking for the best user-friendly wiki for end-users/customers
For yourself MoinMoin is obviously user friendly. =) Seriously, consider all users and try to figure what kinds of usage patterns you have. MoinMoin is a reasonable choice since it's such a simple program. You can often help your non-programmer users by adding a feature or two to MoinMoin. Developers are up to speed with it and you have all the content there already.
That said. Mediawiki is used for lots of general wikis out there today. Including Wikipedia. An aspect of user friendliness is recognition. Mediawiki might feel more friendly because users are more familiar with how it works. And Mediawiki is widely adapted. Lots of extra features you might want to add to help your users are already written as extensions. And Mediawiki's extensions API is really good so you can easily automate your own verticals when the need arises. Mediawiki is reasonably feature rich without being totalluy overloaded. It has categories and templates which both come in handy for keeping things DRY and using the wiki in various processes. It shares lots of its syntax with MoinMoin since both have the same ancestor (syntax-wise).
I'd probably go with Mediawiki.
Visit Wikimatrix.org to determine what features you need and what tool is best for you. I often mention Foswiki.org as a very nice and userfriendly tool, but it really depends on the features that you need.
I have yet to see any Wiki that is more end-user friendly than Confluence.
Just the most important reasons:
While other Wikis say they have WYSIWYG editors, what they actually do is enclose selected text with markup when clicking an icon. That is not WYSIWYG, that's code injection! In Confluence 5 all editing is done from a visual editor (you actually DO see what you get right away straight within the editor). With the ability to add macros (markup) by powerusers.
In almost all other wikis the users are entirely responsible for creating and maintaining the link hierarchy. This means broken links and orphaned pages will be the norm. In Confluence all pages are automatically added to a page hierarchy and sorted by name. You can enable the tree browser via the Documentation theme to make browsing the wiki even without manually added links convenient. Lastly you have the ability to reorder pages in any order via drag & drop.
However, Confluence is rather costly for > 10 users. But well worth it if you can afford it, or you don't need more than 10 editors. Pure "readers" do not count towards users if anonymous viewing is enabled.