Coldfusion Regex to convert a URL to lowercase - regex

I am trying to take convert urls in a block of html to ensure they are lowercase.
Some of the links are a mix of uppercase and lowercase and they need to be converted to just lowercase.
It would be impossible to run round the site and redo every link so was looking to use a Regex when outputting the text.
<p>Hello world Some link.</p>
Needs to be converted to:
<p>Hello world Some link.</p>
Using a ColdFusion Regex such as below (although this doesn't work):
<cfset content = Rereplace(content,'(http[*])','\L\1','All')>
Any help much appreciated.

I think I would use the lower case function, lCase().
Put your URL into a variable, if it's not already:
<cfset MyVar = "http://www.ThisSite.com">
Force it to lower case here:
<cfset MyVar = lCase(MyVar)>
Or here:
<cfoutput>
Some Link
</cfoutput>
UPDATE: Actually, I see that what you are actually asking is how to generate your entire HTML page (or a big portion) and then go back through it, find all of the links, and then lower their cases. Is that what you are trying to do?

Since you have the HTML stored in a database, there is a bit more work that needs to be done than just using lcase(). I would wrap the functionality into a function that can be easily reused. Check out this code for an example.
content = '<p>Hello world Some link.</p>
<p>Hello world Some link.</p>
<p>Hello world <a href=''http://www.somelink.com/BLARG''>Some link</a>.</p>';
writeDump( content );
writeDump( fixLinks( content ) );
function fixLinks( str ){
var links = REMatch( 'http[^"'']*', str );
for( var link in links ){
str = replace( str, link, lcase( link ), "ALL" );
}
return str;
}
This has only been tested in CF9 & CF10.
Using REMatch() you get an array of matches. You then simply loop over that array and use replace() with lcase() to make the links lowercase.
And...based on Leigh's suggestion, here is a solution in one line of code using REReplace()
REReplace( content, '(http[^"'']*)', '\L\1', 'all' )

Use a HTML parser to parse HTML, not regex.
Here's how you can do it with jQuery:
<!doctype html>
<script src="jquery.js"></script>
<cfsavecontent variable="HtmlCode">
<p>Hello world Some link.</p>
</cfsavecontent>
<pre></pre>
<script>
var HtmlCode = "<cfoutput>#JsStringFormat(HtmlCode)#</cfoutput>";
HtmlCode = jQuery('a[href]',HtmlCode).each( lowercaseHref ).end().html();
function lowercaseHref(index,item)
{
var $item = jQuery(item);
// prevent non-links from being changed
// (alternatively, can check for specific domain, etc)
if ( $item.attr('href').startsWith('#') )
return
$item.attr( 'href' , $item.attr('href').toLowerCase() );
}
jQuery('pre').text(HtmlCode);
</script>
This works for href attributes on a tags, but can of course be updated for other things.
It will ignore in-page links like <a href="#SomeId"> but not stuff like <a href="/HOME/#SomeId"> - if that's an issue you'd need to update the function to exclude page fragment part (e.g. split on # then rejoin, or whatever). Same goes if you might have case-sensitive querystrings.
And of course the above is just jQuery because I felt like it - you could also use a server-side HTML parser, like jSoup to achieve this.

Related

How do we do regexp with negation with coldfusion?

I'm using Coldfusion.
The following syntax seems to remove all HTML tags for the str variable:
ReReplaceNoCase(#str#,"<[^>]*(?:>|$)","","ALL")>
However, I'd like to keep both <div> and </div> intact. How can I do that?
Instead of a regex, I would recommend using JSoup. It makes parsing and manipulating html fragments much easier.
Download and install JSoup. Create a Whitelist with the tags you wish to keep. Then scrub your html string with JSoup.clean(...):
jsoup = createObject("java", "org.jsoup.Jsoup");
whiteList = createObject("java", "org.jsoup.safety.Whitelist");
cleanString = jsoup.clean( yourHTMLString, Whitelist.none().addTags( [ "div" ] ));
writeDump( cleanString );

JavaScript client side (user input) find-and-replace Hyperlinks

I want a clientside user to be able to insert text in text input box, click 'replace' and have a list of hyperlinks replaced accordingly. Anchor text will stay the same, but the hyperlink will change.
My problem: I am only getting the first hyperlink to change. I have a fiddle set up with two links, and you can see only the first changes. I want a list of, say, 20 links to change at once.
jsfiddle.net/TKxuf/
HTML:
<input id="replace" type="text" value="newphrase" />
<input onclick="doReplace()" type="button" value="Replace!" />
<br/>
<p id="list">Google Keyword Search</p>
<p id="list">Yahoo Keyword Search</p>
JavaScript:
function doReplace() {
var s = "keyword";
var r = document.getElementById('replace').value;
var oldtext = document.getElementById('list').innerHTML;
var newtext = oldtext.replace( s, r );
console.log(s);
console.log(r);
console.log(document.getElementById('list'));
document.getElementById('list').innerHTML = newtext;
}
I can't work out why you'd want the original strings in the HTML page code to begin with, so I'd suggest that you may have a problem with your approach. Note also that it's illegal in HTML to have more than one element with the same Id, which mostly explains why getElementById only returns one item. Also external urls must be preceded by http:// too.
I usually use jQuery these days - in jQuery you could simply swap id="list" to class="list" and use $('.list') to get a list of them all. $('.list').each(function() { var item = this; /* manipulation code here */ }); would allow you to change them all, but you may have to do some reading.
In any case, I still think that your approach is wrong.
What I'd do is have a normal javascript array of Urls, with replacable keys that are difficult to confuse as part of the url, e.g.
var addresses = [
{ text: "Google Keyword Search", url: "http://google.com?q=%keyword%" },
{ text: "Yahoo Keyword Search", url: "http://yahoo.com?q=%keyword%" }
];
When your user searches, you then build up your new html code into a string by iterating through the array:
var output = '';
for (var i = 0; i<addresses.length; i++) {
var item = addresses[i];
output += '<p>'+item.text+'</p>';
}
Note: I haven't checked this code, but you should be able to get the idea. You'd actually write out all the entire list by using innerHTML on the list container.
Hope that this helps.
Best Regards,
Mark Rabjohn
Integrated Arts Limited

Trouble With Regex and Velocity

I just started using dotCMS for work to modify the existing website. I am trying to create a widget that takes a custom structure field called urlTitle. It takes the Title of an event and makes it url friendly. Here is a tutorial describing the urlTitle
I have a regex that is written fine for javascript. My problem is when I try to use the same regex in velocity, I am getting some troubles.
Here is the javascript from the tutorial:
<script>
function updateDisplayURLTitle(){
// get the title entered by the user
var plainTitle = dojo.byId("title");
// make a friendly url
var urlTitle = plainTitle.value.toLowerCase();
urlTitle= urlTitle.replace(/^\s+|\s+$/g,"");
urlTitle = urlTitle.replace(/[^a-zA-Z 0-9]+/g,' ');
urlTitle = urlTitle.replace(/\s/g, "-");
while(urlTitle.indexOf("--") > -1){
urlTitle = urlTitle.replace("--",'-');
}
// set the values of the display place holder and the custom field
// the is to hold the div open
dojo.byId("displayURLTitle").innerHTML = urlTitle;
dojo.byId("urlTitle").value=urlTitle;
}
// attach this the text1 field onchange
dojo.connect(dojo.byId("title"), "onchange", null, "updateDisplayURLTitle");
// populate the field on load
dojo.addOnLoad(updateDisplayURLTitle);
</script>
<div id="displayURLTitle" style="height:20px"> </div>
Then here is my velocity code for my widget:
#set($nowsers = $date.format('yyyyMMddHHmmss', $date.getDate()))
#set($con = $dotcontent.pull("+structureName:calendarEvent +(conhost:48190c8c-42c4-46af-8d1a-0cd5db894797 conhost:SYSTEM_HOST) +calendarEvent.startDate:[$nowsers TO 21001010101000]",1,"calendarEvent.startDate"))
<ul>
#foreach($event in $con)
<li>
$event.title
<p> $event.description</p>
</li>
#set($temp = $event.title.toLowerCase())
#set($temp = $temp.replaceAll('/^\s+|\s+$/g', ""))
#set($temp = $temp.replaceAll('/[^a-zA-Z 0-9]+/g', " "))
#set($temp = $temp.replaceAll('/\s/g', "-"))
$temp
$temp
#end
My goal is to have the regex from the javascript work with the velocity. Right now it doesn't work and I'm not that skilled with regex, so far my research has lead me nowhere.
Another thing I cant figure out is what the /g does. I can't find it in any regex resource website.
I figured it out. It turns out that the escape characer / in front of the regex patter, and the /g were causing the pattern to fail so it must not be needed for the method used in velocity.

RegEx to modify urls in htmlText as3

I have some html text that I set into a TextField in flash. I want to highlight links ( either in a different colour, either just by using underline and make sure the link target is set to "_blank".
I am really bad at RegEx. I found a handy expression on RegExr :
</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>
but I couldn't use it.
What I will be dealing with is this:
<a href="http://randomwebsite.web" />
I will need to do a String.replace()
to get something like this:
<u><a href="http://randomwebsite.web" target="_blank"/></u>
I'm not sure this can be done in one go. Priority is making sure the link has target set to blank.
I do not know how Action Script regexes work, but noting that attributes can appear anywhere in the tag, you can substitute <a target="_blank" href= for every <a href=. Something like this maybe:
var pattern:RegExp = /<a\s+href=/g;
var str:String = "<a href=\"http://stackoverflow.com/\">";
str.replace(pattern, "<a target=\"_blank\" href=");
Copied from Adobe docs because I do not know much about AS3 regex syntax.
Now, manipulating HTML through regex is usually very fragile, but I think you can get away with it in this case. First, a better way to style the link would be through CSS, rather than using the <font> tag:
str.replace(pattern, "<a style=\"color:#00d\" target=\"_blank\" href=");
To surround the link with other tags, you have to capture everything in <a ...>anchor text</a> which is fraught with difficulty in the general case, because pretty much anything can go in there.
Another approach would be to use:
var start:RegExp = /<a href=/g;
var end:RegExp = /<\/a>/g;
var str:String = "<a\s+href=\"http://stackoverflow.com/\">";
str.replace(start, "<font color=\"#0000dd\"><a target=\"_blank\" href=");
str.replace(end, "</a></font>");
As I said, I have never used AS and so take this with a grain of salt. You might be better off if you have any way of manipulating the DOM.
Something like this might appear to work as well:
var pattern:RegExp = /<a\s+href=(.+?)<\/a>/mg;
...
str.replace(pattern,
"<font color=\"#0000dd\"><a target=\"_blank\" href=$1</a></font>");
I recomend you this simple test tool
http://www.regular-expressions.info/javascriptexample.html
Here's a working example with a more complex input string.
var pattern:RegExp = /<a href="([\w:\/.-_]*)"[ ]* \/>/gi;
var str:String = 'hello world <a href="http://www.stackoverflow.com/" /> hello there';
var newstr = str.replace(pattern, '<li><a href="$1" target="blank" /></li>');
trace(newstr);
What about this? I needed this for myself and it looks for al links (a-tags) with ot without a target already.
var pattern:RegExp = /<a ( ( [^>](?!target) )* ) ((.+)target="[^"]*")* (.*)<\/a> /xgi;
str.replace(pattern, '<a$1$4 target="_blank"$5<\/a>');

The regular expression for finding the image url in <img> tag in HTML using VB .Net code

I want to extract the image url from any website. I am reading the source info through webRequest. I want a regular expression which will fetch the Image url from this content i.e the Src value in the <img> tag.
I'd recommend using an HTML parser to read the html and pull the image tags out of it, as regexes don't mesh well with data structures like xml and html.
In C#: (from this SO question)
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//img[#src]");
foreach (var node in nodes)
{
Console.WriteLine(node.src);
}
/(?:\"|')[^\\x22*<>|\\\\]+?\.(?:jpg|bmp|gif|png)(?:\"|')/i
is a decent one I have used before. This gets any reference to an image file within an html document. I didn't strip " or ' around the match, so you will need to do that.
Try this*:
<img .*?src=["']?([^'">]+)["']?.*?>
Tested here with:
<img class="test" src="/content/img/so/logo.png" alt="logo homepage">
Gives
$1 = /content/img/so/logo.png
The $1 (you have to mouseover the match to see it) corresponds to the part of the regex between (). How you access that value will depend on what implementation of regex you are using.
*If you want to know how this works, leave a comment
EDIT
As nearly always with regexp, there are edge cases:
<img title="src=hack" src="/content/img/so/logo.png" alt="logo homepage">
This would be matched as 'hack'.