We are using the tag cfsavecontent then publishing to a pdf file. Certain characters seem to truncate the text after those characters.
These are the characters we have seen so far that cause the truncation of text
= > < 1
We have tried using this expression
REReplace(data,'<[^>]*>','','all')
<cfsavecontent variable="Abstract">
</cfsavecontent>
Use mimetype="text/plain" with your cfdocument to preserve text. cfdocument defaults to text/html (HTML), which is the reason why characters like < and > mess with your content.
Alternatively you can encode your content for HTML. There's htmlEditFormat (CF9-) and encodeForHtml (CF10+) to do so, e.g. <cfset Abstract = htmlEditFormat(Abstract)>.
Related
I'm building a form through which users will be able to submit articles. my current regex allows only certain characters and works as it should, though I do not know how to allow quotes as well. here is the code
<cfif refind ("[^A-Z a-z 0-9\\+\-\?\!\.\,\(\)]+", trim(form.articleText)) and len (trim(form.articleText)) gte 15>
<cfset msg = "The article can not contain special characters.">
</cfif>
I tried using " as in c# but it does not work!
Add quotes in your character class:
<cfif refind ("[^A-Za-z0-9 +?!.,()\\""'-]+", trim(form.articleText)) ...
anubhava's answer gives you what you've asked for, but the solution you probably need is actually something completely different: to use the ESAPI encodeForX functions in CF10 to encode the output appropriately for its context, such as encodeForHtml, instead of trying to restrict what characters can be written and constantly having to update it.
At most, you might want something such as:
<cfif refind('[[:cntrl:]]',form.articleText) >
<cfset msg = "The article can not contain control characters.">
</cfif>
Which will block unprintable control characters, whilst not preventing perfectly reasonable characters such as accented letters, currency symbols, and so on.
I came across a post on Meta SO and I'm curious about what are the subtle differences between un-encoded and encoded HTML characters, in HTML attributes, in contexts of: security, best-practice and browser support.
HTML encoding replaces certain characters that are semantically meaningful in HTML markup, with equivalent characters that can be displayed to the user without affecting parsing the markup.
The most significant and obvious characters are <, >, &, and " which are are replaced with <, >, &, and ", respectively. Additionally, an encoder may replace high-order characters with the equivalent HTML entity encoding, so content can be preserved and properly rendered even in the event the page is sent to the browser as ASCII.
HTML attribute encoding, on the other hand, only replaces a subset of those characters that are important to prevent a string of characters from breaking the attribute of an HTML element. Specifically, you'd typically just replace ", &, and < with ", &, and <. This is because the nature of attributes, the data they contain, and how they are parsed and interpreted by a browser or HTML parser is different than how an HTML document and its elements are read.
In terms of how that relates to XSS, you want to properly sanitize strings from an outside source (such as the user) so they don't break your page, or more importantly, inject markup and script that can alter or destroy your application or affect your users' machines (by taking advantage of browser or platform vulnerabilities).
If you want to display user-generated content in your page, you'd HTML encode the string and then display it in your markup, and everything they entered will be displayed literally without worrying XSS or broken markup.
If you needed to attach user-generated content to an element in an attribute (for example, a tooltip on a link), you'd attribute encode to make sure the content doesn't break the element's markup.
Could you just use the same function for HTML encoding to handle attribute encoding? Technically, yes. In the case of the meta question you linked, it sounds like they were taking HTML that was encoded and decoding it, then using that result as an attribute value, which results in encoded markup being displayed literally, if you follow.
I would recommend looking over OWASP XSS Prevention Rules 1 and 2.
A brief summary...
Rule 1 for HTML
Escape the following characters with HTML entity encoding ...
& --> &
< --> <
> --> >
" --> "
' --> '
/ --> /
Rule 2 for HTML Common Attributes
Except for alphanumeric characters, escape all characters with ASCII values less than 256 with the &#xHH; format (or a named entity if available) to prevent switching out of the attribute. The reason this rule is so broad is that developers frequently leave attributes unquoted. Properly quoted attributes can only be escaped with the corresponding quote. Unquoted attributes can be broken out of with many characters, including [space] % * + , - / ; < = > ^ and |.
I want to get string between em tag , including other html also.
for example:
<em>UNIVERSALPOSTAL UNION - International Bureau Circular<br />
By: K.J.S. McKeown</em>
output should be as:
UNIVERSALPOSTAL UNION - International Bureau Circular<br />
By: K.J.S. McKeown
please help me.
Thanks
Use the regular expression function like this:
REMatch("(?s)<em>.*?</em>", html)
See also: http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=regexp_01.html
The (?s) sets the mode to single line, so that the input text is interpreted as one line even if it contains line feeds. This is probably the default (I'm not sure) so it can be omitted. As Peter pointed out in a comment, this is not the default and therefore must be set.
The .*? matches all characters inbetween <em> and </em>. The questionmark after the multiplier makes it "non-greedy", so that as few as possible characters are matched. This is needed in case the input html contains something like <em>foo</em><em>bar</em> where otherwise only the outermost <em></em> tags are considered.
The returned array contains all matches found, i.e. all texts including html that was in <em> tags.
Note that this could fail for circumstances where </em> also occurs as attribute text and is incorrectly not html-encoded, for example: <em><a title="Help for </em> tag">click</a></em> or in other rare circumstances (e.g. javascript script tags etc.). A regex cannot replace a full HTML/XML parser and if you need 100% accurateness, you should consider using one: http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=functions_t-z_23.html
If your input is exactly in the format given above, you don't even need regex - just strip the outer tags:
<cfsavecontent variable="Input">[text from above]</cfsavecontent>
<cfset Output = mid( Input, 4 , len(Input) - 9 />
If your input is more than this (i.e. a significant piece of HTML, or a full HTML document), regex is still not the ideal tool - instead, you should be using a HTML parser, such as JSoup:
<cfset jsoup = createObject('java','org.jsoup.Jsoup') />
<cfset Output = jsoup.parse(Input).select('em').html() />
(With CF8, this code requires placing the jsoup JAR file in CF's lib directory, or using a tool such as JavaLoader.)
If you are using jquery you can do this also pretty easily.
$("em").html();
Will return all html between the em tags.
See this fiddle
I had to remove any text that was to follow after a partiucular tag . Now the HTML content was getting generated dynamically from a database that cater to 5 different langauges. so I only had the div tag to help me. I am not sure why REMatch("(?s).*?", html) did not work for me. However Ben helped me here (http://www.bennadel.com/blog/769-Learning-ColdFusion-8-REMatch-For-Regular-Expression-Matching.htm). My code looks like tghis:
<cfset extContentArr = REMatch("(?i)<div class=""inlineBlock"" style=""margin-right:30px;"">.+?</div>",qry_getContent.colval) />
<cfif !ArrayIsEmpty(extContentArr)>
Loop the array and do whatever you need with the extract , I just deleted them.
</cfif>
I have a simple regex line to extract the src="" value from an image tag:
<cfset variables.attrSrc = REMatch("(?i)src\s*=\s*""[^""]+", variables.myImageTag) />
<!--- REMatch("(?i)src\s*=\s*""[^""]+" --->
However, while this works great, it doesn't appear to be working with src='' attrubutes that display single quotes instead of double.
Ideally, I'd like it to work with both single quotes and double.
Any thoughts?
Thanks,
Michael.
(?i)src\s*=\s*(""[^""]+""|'[^']+')
What do you use? Replace() linebreak chars with <br>? what about spaces? like maybe replace 2 spaces with ?
ParagraphFormat() sucks.
paragraphformat2()? http://www.cflib.org/udf.cfm/paragraphformat2
ReplaceNoCase(someString, "\n", "<br>","all")
One thing you may have to take into account is that different OS treat line breaks differently. Windows uses CR/LF while OS X and Unix use CR. I have used a code block effectively in the past to manage the different possibilities when it comes to reading in text files. Same principles could apply here. It's not 100% perfect, but on the rare occasions it has failed me it was because of an odd method of how the file was created. I modified it to fit the general idea of what you are going after.
<cfset variables.CRLF = findnocase(variables.textFromTextarea,"#chr(10)#") />
<cfif variables.CRLF>
<cfset variables.textFromTextarea = replaceNoCase(variables.textFromTextarea,"#chr(10)#","<br>","all") />
<cfelse>
<cfset variables.textFromTextarea = replaceNoCase(variables.textFromTextarea,"#chr(13)#","<br>","all") />
</cfif>
The idea here is that you are looking for the windows-only LF. If found, replace on it. Otherwise replace on the CR. Maybe that will work for you.
I needed to replace text from a textarea input to html when output on a webpage, and preserved the line break. So based on the accepted answer, I simply modified it to this, and it worked:
ReplaceNoCase(someString, Chr(10), "<br />","all")
Hope that helps anyone else.