How do we do regexp with negation with coldfusion? - coldfusion

I'm using Coldfusion.
The following syntax seems to remove all HTML tags for the str variable:
ReReplaceNoCase(#str#,"<[^>]*(?:>|$)","","ALL")>
However, I'd like to keep both <div> and </div> intact. How can I do that?

Instead of a regex, I would recommend using JSoup. It makes parsing and manipulating html fragments much easier.
Download and install JSoup. Create a Whitelist with the tags you wish to keep. Then scrub your html string with JSoup.clean(...):
jsoup = createObject("java", "org.jsoup.Jsoup");
whiteList = createObject("java", "org.jsoup.safety.Whitelist");
cleanString = jsoup.clean( yourHTMLString, Whitelist.none().addTags( [ "div" ] ));
writeDump( cleanString );

Related

Stripping non html tags/text from a string

I have a string which I need to send in an xml node to a third party application. That string is then parsed through a html parser over there. The string can have html, but problem occurs with non html tags. For example
<cfset str = "This mail was <b>sent</b> by Jen Myke <jmyke#mail.com> on June 20th.<br/> Click on <a href='http://google.com'>this link</a> for more information.">
There can be non-utf characters too in the string, which also cause issues but I found a old blog post which can help remove non-utf.
<cfset str = reReplace(str, "[^\x20-\x7E]", "", "ALL")>
But I am unable to figure out how I can remove html look alikes.
Try wrapping the string with encodeForXML(). This should encode any non-ASCII character for use within an XML node.
<node>#encodeForXml(str)#</node>
If you need to pass data in an attribute, then
<node attr=#encodeForXmlAttribute(str)#"/>
Edit: You can try using getSafeHTML() before encoding the rest of the string. This will remove HTML tags from a string using an XML configuration file to set your AntiSamy settings. Check the docs for more info.
Try replacing
< to <
> to >

Coldfusion Regex to convert a URL to lowercase

I am trying to take convert urls in a block of html to ensure they are lowercase.
Some of the links are a mix of uppercase and lowercase and they need to be converted to just lowercase.
It would be impossible to run round the site and redo every link so was looking to use a Regex when outputting the text.
<p>Hello world Some link.</p>
Needs to be converted to:
<p>Hello world Some link.</p>
Using a ColdFusion Regex such as below (although this doesn't work):
<cfset content = Rereplace(content,'(http[*])','\L\1','All')>
Any help much appreciated.
I think I would use the lower case function, lCase().
Put your URL into a variable, if it's not already:
<cfset MyVar = "http://www.ThisSite.com">
Force it to lower case here:
<cfset MyVar = lCase(MyVar)>
Or here:
<cfoutput>
Some Link
</cfoutput>
UPDATE: Actually, I see that what you are actually asking is how to generate your entire HTML page (or a big portion) and then go back through it, find all of the links, and then lower their cases. Is that what you are trying to do?
Since you have the HTML stored in a database, there is a bit more work that needs to be done than just using lcase(). I would wrap the functionality into a function that can be easily reused. Check out this code for an example.
content = '<p>Hello world Some link.</p>
<p>Hello world Some link.</p>
<p>Hello world <a href=''http://www.somelink.com/BLARG''>Some link</a>.</p>';
writeDump( content );
writeDump( fixLinks( content ) );
function fixLinks( str ){
var links = REMatch( 'http[^"'']*', str );
for( var link in links ){
str = replace( str, link, lcase( link ), "ALL" );
}
return str;
}
This has only been tested in CF9 & CF10.
Using REMatch() you get an array of matches. You then simply loop over that array and use replace() with lcase() to make the links lowercase.
And...based on Leigh's suggestion, here is a solution in one line of code using REReplace()
REReplace( content, '(http[^"'']*)', '\L\1', 'all' )
Use a HTML parser to parse HTML, not regex.
Here's how you can do it with jQuery:
<!doctype html>
<script src="jquery.js"></script>
<cfsavecontent variable="HtmlCode">
<p>Hello world Some link.</p>
</cfsavecontent>
<pre></pre>
<script>
var HtmlCode = "<cfoutput>#JsStringFormat(HtmlCode)#</cfoutput>";
HtmlCode = jQuery('a[href]',HtmlCode).each( lowercaseHref ).end().html();
function lowercaseHref(index,item)
{
var $item = jQuery(item);
// prevent non-links from being changed
// (alternatively, can check for specific domain, etc)
if ( $item.attr('href').startsWith('#') )
return
$item.attr( 'href' , $item.attr('href').toLowerCase() );
}
jQuery('pre').text(HtmlCode);
</script>
This works for href attributes on a tags, but can of course be updated for other things.
It will ignore in-page links like <a href="#SomeId"> but not stuff like <a href="/HOME/#SomeId"> - if that's an issue you'd need to update the function to exclude page fragment part (e.g. split on # then rejoin, or whatever). Same goes if you might have case-sensitive querystrings.
And of course the above is just jQuery because I felt like it - you could also use a server-side HTML parser, like jSoup to achieve this.

Parse HTML meta tags in Scala

I'm trying to parse meta tags with Scala. I've tried just doing this with XML matching, like
`html // meta ...` etc,
but I'm getting a malformed-XML error because these meta tags on this particular page have no ending tag or ... /> enclosure.
So for the following HTML,
val html = """<meta name="description" content="This is some meta description">"""
I'm using the following regex matcher:
val metaDescription = """.*meta name="Description" content="([^"]+)"""".r
When I try to match with val metaDescription(desc) = html I get a scala.MatchError.
When I try with metaDescription.findAllIn(html) and iterate, I get the whole string--not just the description.
How can I just get the value inside content and nothing else?
EDIT
I got the result I wanted with:
metaDescription.findAllIn(html).matchData foreach {
desc => println(desc.group(1))
}
but that seems like a long way around. Is there a better solution?
Scala XML and TagSoup provides one way to use tag soup directly with Scala XML.
If you are open to alternatives then Scales Xml provides a similar useful approach to parse html via alternative SAX parsers:
val html = loadXmlReader(htmlStream, parsers = AlternateSAXFactoryPool)
example factories for Tagsoup and Nu.Validator are provided on that link.

actionscript htmltext. removing a table tag from dynamic html text

AS 3.0 / Flash
I am consuming XML which I have no control over.
the XML has HTML in it which i am styling and displaying in a HTML text field.
I want to remove all the html except the links.
Strip all HTML tags except links
is not working for me.
does any one have any tips? regEx?
the following removes tables.
var reTable:RegExp = /<table\s+[^>]*>.*?<\/table>/s;
but now i realize i need to keep content that is the tables and I also need the links.
thanks!!!
cp
Probably shouldn't use regex to parse html, but if you don't care, something simple like this:
find /<table\s+[^>]*>.*?<\/table\s+>/
replace ""
ActionScript has a pretty neat tool for handling XML: E4X. Rather than relying on RegEx, which I find often messes things up with XML, just modify the actual XML tree, and from within AS:
var xml : XML = <page>
<p>Other elements</p>
<table><tr><td>1</td></tr></table>
<p>won't</p>
<div>
<table><tr><td>2</td></tr></table>
</div>
<p>be</p>
<table><tr><td>3</td></tr></table>
<p>removed</p>
<table><tr><td>4</td></tr></table>
</page>;
clearTables (xml);
trace (xml.toXMLString()); // will output everything but the tables
function removeTables (xml : XML ) : void {
xml.replace( "table", "");
for each (var child:XML in xml.elements("*")) clearTables(child);
}

RegEx to modify urls in htmlText as3

I have some html text that I set into a TextField in flash. I want to highlight links ( either in a different colour, either just by using underline and make sure the link target is set to "_blank".
I am really bad at RegEx. I found a handy expression on RegExr :
</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>
but I couldn't use it.
What I will be dealing with is this:
<a href="http://randomwebsite.web" />
I will need to do a String.replace()
to get something like this:
<u><a href="http://randomwebsite.web" target="_blank"/></u>
I'm not sure this can be done in one go. Priority is making sure the link has target set to blank.
I do not know how Action Script regexes work, but noting that attributes can appear anywhere in the tag, you can substitute <a target="_blank" href= for every <a href=. Something like this maybe:
var pattern:RegExp = /<a\s+href=/g;
var str:String = "<a href=\"http://stackoverflow.com/\">";
str.replace(pattern, "<a target=\"_blank\" href=");
Copied from Adobe docs because I do not know much about AS3 regex syntax.
Now, manipulating HTML through regex is usually very fragile, but I think you can get away with it in this case. First, a better way to style the link would be through CSS, rather than using the <font> tag:
str.replace(pattern, "<a style=\"color:#00d\" target=\"_blank\" href=");
To surround the link with other tags, you have to capture everything in <a ...>anchor text</a> which is fraught with difficulty in the general case, because pretty much anything can go in there.
Another approach would be to use:
var start:RegExp = /<a href=/g;
var end:RegExp = /<\/a>/g;
var str:String = "<a\s+href=\"http://stackoverflow.com/\">";
str.replace(start, "<font color=\"#0000dd\"><a target=\"_blank\" href=");
str.replace(end, "</a></font>");
As I said, I have never used AS and so take this with a grain of salt. You might be better off if you have any way of manipulating the DOM.
Something like this might appear to work as well:
var pattern:RegExp = /<a\s+href=(.+?)<\/a>/mg;
...
str.replace(pattern,
"<font color=\"#0000dd\"><a target=\"_blank\" href=$1</a></font>");
I recomend you this simple test tool
http://www.regular-expressions.info/javascriptexample.html
Here's a working example with a more complex input string.
var pattern:RegExp = /<a href="([\w:\/.-_]*)"[ ]* \/>/gi;
var str:String = 'hello world <a href="http://www.stackoverflow.com/" /> hello there';
var newstr = str.replace(pattern, '<li><a href="$1" target="blank" /></li>');
trace(newstr);
What about this? I needed this for myself and it looks for al links (a-tags) with ot without a target already.
var pattern:RegExp = /<a ( ( [^>](?!target) )* ) ((.+)target="[^"]*")* (.*)<\/a> /xgi;
str.replace(pattern, '<a$1$4 target="_blank"$5<\/a>');