HTML and Attribute encoding - xss

I came across a post on Meta SO and I'm curious about what are the subtle differences between un-encoded and encoded HTML characters, in HTML attributes, in contexts of: security, best-practice and browser support.

HTML encoding replaces certain characters that are semantically meaningful in HTML markup, with equivalent characters that can be displayed to the user without affecting parsing the markup.
The most significant and obvious characters are <, >, &, and " which are are replaced with <, >, &, and ", respectively. Additionally, an encoder may replace high-order characters with the equivalent HTML entity encoding, so content can be preserved and properly rendered even in the event the page is sent to the browser as ASCII.
HTML attribute encoding, on the other hand, only replaces a subset of those characters that are important to prevent a string of characters from breaking the attribute of an HTML element. Specifically, you'd typically just replace ", &, and < with ", &, and <. This is because the nature of attributes, the data they contain, and how they are parsed and interpreted by a browser or HTML parser is different than how an HTML document and its elements are read.
In terms of how that relates to XSS, you want to properly sanitize strings from an outside source (such as the user) so they don't break your page, or more importantly, inject markup and script that can alter or destroy your application or affect your users' machines (by taking advantage of browser or platform vulnerabilities).
If you want to display user-generated content in your page, you'd HTML encode the string and then display it in your markup, and everything they entered will be displayed literally without worrying XSS or broken markup.
If you needed to attach user-generated content to an element in an attribute (for example, a tooltip on a link), you'd attribute encode to make sure the content doesn't break the element's markup.
Could you just use the same function for HTML encoding to handle attribute encoding? Technically, yes. In the case of the meta question you linked, it sounds like they were taking HTML that was encoded and decoding it, then using that result as an attribute value, which results in encoded markup being displayed literally, if you follow.

I would recommend looking over OWASP XSS Prevention Rules 1 and 2.
A brief summary...
Rule 1 for HTML
Escape the following characters with HTML entity encoding ...
& --> &
< --> <
> --> >
" --> "
' --> '
/ --> /
Rule 2 for HTML Common Attributes
Except for alphanumeric characters, escape all characters with ASCII values less than 256 with the &#xHH; format (or a named entity if available) to prevent switching out of the attribute. The reason this rule is so broad is that developers frequently leave attributes unquoted. Properly quoted attributes can only be escaped with the corresponding quote. Unquoted attributes can be broken out of with many characters, including [space] % * + , - / ; < = > ^ and |.

Related

Is it possible to put regex directly into XML content?

I have a XML file I use to manually route users to specific pages in a website.
Currently, we have separate entries for every variation of possible searches (plural, typos etc.). I would like to know if there is a way I can condense it with regex to something like so:
<OnSiteSearch>
...
<Search>
<SearchTerm>(horses?|cows?) for sale</SearchTerm>
<Destination>~/some/path.html</Destination>
</Search>
...
</OnSiteSearch>
Is something like this possible? I've looked online for regex and XML but it seems to be about validating content between the XML tags and not about using regex as the content.
Yes, a regex can be stored in XML as long as you mind XML escaping rules to keep the XML well-formed:
Element content: Escape any < as < and any & as & when writing
the regex; reverse the substitution before using the regex.
Attribute value: Follow rules for element content plus escape any " as
&quote; or any ' as &apos; to avoid conflict with chosen attribute
value delimiters.
CDATA: No escaping needed, but make sure your regex doesn't include
the string ]]>.

Character Truncation Issue

We are using the tag cfsavecontent then publishing to a pdf file. Certain characters seem to truncate the text after those characters.
These are the characters we have seen so far that cause the truncation of text
= > < 1
We have tried using this expression
REReplace(data,'<[^>]*>','','all')
<cfsavecontent variable="Abstract">
</cfsavecontent>
Use mimetype="text/plain" with your cfdocument to preserve text. cfdocument defaults to text/html (HTML), which is the reason why characters like < and > mess with your content.
Alternatively you can encode your content for HTML. There's htmlEditFormat (CF9-) and encodeForHtml (CF10+) to do so, e.g. <cfset Abstract = htmlEditFormat(Abstract)>.

How to get string of everything between these two em tag?

I want to get string between em tag , including other html also.
for example:
<em>UNIVERSALPOSTAL UNION - International Bureau Circular<br />
By: K.J.S. McKeown</em>
output should be as:
UNIVERSALPOSTAL UNION - International Bureau Circular<br />
By: K.J.S. McKeown
please help me.
Thanks
Use the regular expression function like this:
REMatch("(?s)<em>.*?</em>", html)
See also: http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=regexp_01.html
The (?s) sets the mode to single line, so that the input text is interpreted as one line even if it contains line feeds. This is probably the default (I'm not sure) so it can be omitted. As Peter pointed out in a comment, this is not the default and therefore must be set.
The .*? matches all characters inbetween <em> and </em>. The questionmark after the multiplier makes it "non-greedy", so that as few as possible characters are matched. This is needed in case the input html contains something like <em>foo</em><em>bar</em> where otherwise only the outermost <em></em> tags are considered.
The returned array contains all matches found, i.e. all texts including html that was in <em> tags.
Note that this could fail for circumstances where </em> also occurs as attribute text and is incorrectly not html-encoded, for example: <em><a title="Help for </em> tag">click</a></em> or in other rare circumstances (e.g. javascript script tags etc.). A regex cannot replace a full HTML/XML parser and if you need 100% accurateness, you should consider using one: http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=functions_t-z_23.html
If your input is exactly in the format given above, you don't even need regex - just strip the outer tags:
<cfsavecontent variable="Input">[text from above]</cfsavecontent>
<cfset Output = mid( Input, 4 , len(Input) - 9 />
If your input is more than this (i.e. a significant piece of HTML, or a full HTML document), regex is still not the ideal tool - instead, you should be using a HTML parser, such as JSoup:
<cfset jsoup = createObject('java','org.jsoup.Jsoup') />
<cfset Output = jsoup.parse(Input).select('em').html() />
(With CF8, this code requires placing the jsoup JAR file in CF's lib directory, or using a tool such as JavaLoader.)
If you are using jquery you can do this also pretty easily.
$("em").html();
Will return all html between the em tags.
See this fiddle
I had to remove any text that was to follow after a partiucular tag . Now the HTML content was getting generated dynamically from a database that cater to 5 different langauges. so I only had the div tag to help me. I am not sure why REMatch("(?s).*?", html) did not work for me. However Ben helped me here (http://www.bennadel.com/blog/769-Learning-ColdFusion-8-REMatch-For-Regular-Expression-Matching.htm). My code looks like tghis:
<cfset extContentArr = REMatch("(?i)<div class=""inlineBlock"" style=""margin-right:30px;"">.+?</div>",qry_getContent.colval) />
<cfif !ArrayIsEmpty(extContentArr)>
Loop the array and do whatever you need with the extract , I just deleted them.
</cfif>

Filtering out formatting characters between consecutive XML tags with Xerces C++

I'd appreciate pointers on how to get (non-element) text between tags. For example given the element ABC I'd like to get the text ABC.
Currently, I'm able to use DefaultHandler::(const XMLCh *const chars, const XMLSize_t length) in order to get the characters between two consecutive start or end tags. Unfortunately I'm getting unnecessary newlines and formatting spaces. Between parent tags and child elements. For example in the bit of code below, I'm getting 5 extra formatting characters -- one newline and four spaces:
<Parent> <!-- Newline here -->
<Child>XYX</Child> <!-- Four spaces here -->
</Parent>
What is be the best (standard) way of filtering out these formatting characters?
Solved. For posterity's sake, here's how I did it.
Because the desired characters appear between (consecutive start and end) tags that define an element, In the method DefaultHandler::startElement() I store the local name at the start of an element and compare it with next `local name that is encountered.
If the next local name encountered belongs to a new element then the intervening characters must be formatting characters and should be ignored.
If however the next element encountered has the same local name then the intervening characters form the desired string.

Cross-site scripting: how to check the string

I have a string. This is the value of some attribute of some html tag.
How to check if this string contains javascript?
For example (SRC attribute of IMG tag):
1. <IMG src="javascript:alert('XSS')"> - contains script<br/>
2. <IMG src="JaVaScRiPt:alert('XSS')"> - contains script<br/>
3. <IMG javascript:alert('XSS')> - also contains javascript
You first have to canonicalize, then check. But i would look at HtmlPurifier or OWASP AntiSamy for that.
It's pretty hard to do, as there are lots of odd and tricky ways to sneak JavaScript in.
HTMLPurifier has pretty complex parsing to filter out all potentially unsafe HTML if you must allow HTML input in the first place.
However, generally you shouldn't even try to do that, and simply always escape the string.
In PHP that is:
echo htmlspecialchars($string);
In JS you can use document.createTextNode() or jQuery's equivalent $(el).text() to safely insert text into DOM (those two methods don't require escaping).