Coldfusion Parse URLs in text - regex

Can anyone help with a function that will parse all urls into valid html links in a text string?
For example:
"Welcome to www.nerds4life.com. View our articles at nerds4life.com or at http://nerds4life.com or also http://www.nerds4life.com"
would become:
"Welcome to www.nerds4life.com. View our articles at nerds4life.com or at http://nerds4life.com or also http://www.nerds4life.com"
What would be the best way to approach this. Regex (and if so, how?) or loop through each word in the text (would think that's less efficient)
Thanks

Again... there may be a more elegant regex...
Certainly feel free to google for "good" regex's for finding URLs if this one falls short.
<cfset myText = "Welcome to www.nerds4life.com. View our articles at nerds4life.com or at http://nerds4life.com or also http://www.nerds4life.com or at https://foo.com or http://123.com" />
<cfset myNewText = rereplaceNoCase( myText, '((http(s)?://)?((www\.)?\w+\.\w{2,6}))', '\1', 'all' ) />

This will parse URL in string that starts with http or www and terminated by a space
<cfset myString = "Welcome to www.nerds4life.com. View our articles at nerds4life.com or at http://nerds4life.com or also http://www.nerds4life.com">
<cfset URLinString = rereplaceNoCase(myString, '(((http(s)?://)|(www))\.?(\S+))', '\1', 'all')>

Related

Retrieving codes within parenthesis in ColdFusion version 4.5

I have a string retrieved from the database that can contain a series of codes in either {} or [] brackets as well as plain, user entered text. For example, each of the following would be possible values:
[code]
[code1][code2]
{code}
{code1}{code2}
{code1} Some user entered text. {code2}{code3} Some more user entered text.
Etc. etc.
What I need to do using ColdFusion is extract the codes within the {} and [] brackets so I can retrieve their descriptions from a database. For example:
{code1} Some user entered text. {code2}{code3} Some more user entered text.
Would become a list similar to:
{code1}|{code2}|{code3}
Normally I could just do something like REMatch but unfortunately I'm stuck doing this on a server running ColdFusion version 4.5 (groan) so my options are limited.
I'm thinking maybe I could do some Replaces on the string to convert it into a pipe delimited list that I can then easily process but I'm not sure if there might be a more straight forward approach? I'm not even really sure what a sensible way to process this using a Replace would be.
<cfset myString = "{code1} Some user entered text {code2}{code3} More user entered text" />
<cfset myArray = listToArray(myString, "{[") />
<cfloop index="i" from="1" to="#arrayLen(myArray)#">
<cfset myArray[i] = "{" & listFirst(myArray[i], "}]") & "}" />
</cfloop>
<cfdump var="#myArray#" />
<hr>
<cfset myList = arrayToList(myArray, "|") />
<cfdump var="#myList#" />
TryCF.com Gist:
https://trycf.com/gist/6035ddc5cd3daa81bc0943f1af33323a/lucee5?theme=monokai

Extract URL from SCRIPT portion of HTML Code with RegEx

I have one URL which is inside a <script> tag and I need to extract that URL:
Using ReMatchNoCase(), I can find the script and place it in an array.
<SCRIPT LANGUAGE="JavaScript" > //alert("a chance stuff"); document.location.href="https://mypage.cfm"; </SCRIPT>
To extract the URL, I am using the following code
<cfset ulink = reMatchNoCase("<SCRIPT.*?>.*?</SCRIPT>", data)>
<cfset link = Replacenocase(Replace(listLast(ulink[1],'='),'"','','ALL'),';</script>','','all')>
This works, but is there a cleaner way to do this?
Because ReFind/NoCase() is not designed to return the actual substring, this is about as simple as you're going to get.
<cfset data='<SCRIPT LANGUAGE="JavaScript" > //alert("a chance stuff"); document.location.href="https://mypage.cfm"; </SCRIPT>'>
<cfset ulink = reMatchNoCase("<SCRIPT.*?>.*?</SCRIPT>", data)>
<cfset link = Rematchnocase("http[^""']*",ulink[1])>
<cfoutput>#link[1]#</cfoutput>
Which is a little simpler than what you're doing. Alternatively you could use Mid(ulink[1]...) but with subexpressions from a ReFindNoCase(), but it is also no simpler.
The regular expression I use to match the URL is very generic, but it should easily do for the task. It simply captures everything until it finds a quote or apostrophe.
I did also think of this
<cfset data='<SCRIPT LANGUAGE="JavaScript" > //alert("a chance stuff"); document.location.href="https://mypage.cfm"; </SCRIPT>'>
<cfset ulink = rereplacenocase(data,"[\s\S]*?(<script.*?>[\s\S]*?(http[^""']*)[\s\S]*?</script>)[\s\S]*","\2","ALL")>
<cfoutput>#ulink#</cfoutput>
which is possibly better, but it is so much nastier to read and is less reliable for dealing with multiple <script> tags if that should arise.
Personally, I'd go with the first route. With RegEx, sometimes the "lazier" you try to be, the shakier the whole thing becomes. It's best to define the best pattern you can to attain your goal and in ColdFusion, I believe the first route is the best route.
You can do the following:
<cfset data = '<SCRIPT LANGUAGE="JavaScript" > //alert("a chance stuff"); document.location.href="https://mypage.cfm"; </SCRIPT>' />
<cfset start = REFindNoCase("<script[^>]*>", data) />
<cfset match = REMatchNoCase("https?://[^'""]*(?=.*</script>)(?!.*<script>)", mid(data, start, len(data) - start + 1)) />
In the second line I am finding the position of the <script> open tag (even though not absolutely necessary for this particular piece of data). In the 3rd line I find any URLs within the <script> tag. I use positive lookahead to make sure that there is a </script> end tag following, and negative lookahead to make sure there is not another <script> tag.

remove some code from the url pagination

i have the following code, but i am very loose in the regular expression, i am using coldfusion
and i want to remove the code which is inbetween before every next page call
http://beta.mysite.com/?jobpage=2page=2#brands
what i am trying is if jobpage exists, it should remove the jobpage=2 from the URL, {2} is dynamic as it can be one or 2 or 3 and so on.
I tried with listfirst and listlast or gettoken but no help.
This should do it for you
<Cfset myurl = "http://beta.mysite.com/?jobpage=2page=2##brands" />
<cfoutput>#myurl#</cfoutput><br><Br>
<cfset myurl = ReReplaceNoCase(myurl,"(jobpage=[0-9]+[\&]?)","","ALL") />
<cfoutput>#myurl#</cfoutput>

How do I replace text in all href attributes of anchor tags?

I need to replace the text inside all href values. I think a regular expression is the way to do it, but I'm no regex pro. Any thoughts on how I'd do the following using ColdFusion?
so it is changed to:
Thanks!
Here's an update to the question: I have this code and need the pattern below:
<cfset matches = ReMatch('<a[^>]*href="http[^"]*"[^>]*>(.+?)</a>', arguments.htmlCode) /> <cfdump var="#matches#">
<cfset links = arrayNew(1)>
<cfloop index="a" array="#matches#">
<cfset arrayAppend(links, rereplace(a, 'need regex'," {clickurl}","all"))>
</cfloop>
<cfdump var="#links#">
Here's how to do it with jSoup HTML parser:
<cfset jsoup = createObject('java','org.jsoup.Jsoup') />
<cfset Dom = jsoup.parse( InputHtml ) />
<cfset Dom.select('a[href]').attr('href','{replaced}') />
<cfset NewHtml = Dom.html() />
(On CF9 and earlier, this requires placing the jsoup's jar in CF's lib directory, or using JavaLoader.)
Using a HTML parser is usually better than using regex, not least because it's easier to maintain and understand.
Here's an imperfect way of doing it with a regex:
<cfset NewHtml = InputHtml.replaceAll
( '(?<=<a.{0,99}?\shref\s{0,99}?=\s{0,99}?)(?:"[^"]+|''[^'']+)(["'])'
, '$1{replaced}$1'
)/>
Which hopefully demonstrates why using a tool such as jsoup is definitely the way to go...
(btw, the above is using the Java regex engine (via string.replaceAll), so it can use the lookbehind functionality, which doesn't exist in CF's built-in regex (rereplace/rematch/etc))
Update, based on the new code sample you've provided...
Here is an example of how to use jsoup for what you're doing - it might still need some updates (depending on what {clickurl} is eventually going to be doing), but it currently functions the same as your sample code is attempting:
<cfset jsoup = createObject('java','org.jsoup.Jsoup') />
<cfset links = jsoup.parse( Arguments.HtmlCode )
<!--- select all links beginning http and change their href --->
.select('a[href^=http]').attr('href',' {clickurl}')
<!--- get HTML for all links, then split into array. --->
.outerHtml().split('(?<=</a>)(?!$)')
/>
<cfdump var=#links# />
That middle bit is all a single cfset, but I split it up and added comments for clarity. (You could of course do this with multiple variables and 3+ cfsets if you preferred that.)
Again, it's not a regex, because what you're doing involves parsing HTML, and regex is not designed for parsing tag-based syntax, so isn't very good at it - there are too many quirks and variations with HTML and describing them in a single regex gets very complicated very quickly.

How to determine if a full name has a space in it?

I have a field that a user can input first and last name to fill out my form. Sometimes, users put on their first name and that results in empty fields in my database. PLEASE keep in mind that I cannot change this method completely because this form is part of a bigger project and it is being used by other websites of my company.
This is the part of the code that i need the validation around it. I already have a validation that ensures that the filed is not empty but I need on more to ensure that the field has two items in it separated by space.
<input name="fullname" class="fullname" type="text" value="#fullname#" maxlength="150"/>
<cfif fullname eq '' and check2 eq 'check2'>
<br /><span style="color:red">*you must enter your full name</span></cfif>
The check2 eq 'check2' is checking if the form was submitted already to ensure a user submitting their data twice.
I thought of using regular expressions to do that but unfortunately I am not very familiar with how to use regx in CF9 and the documentation online through me off a bit.
I was also thinking to use "Find" or "FindOneOF", any thoughts on that?
Also, I am trying to avoid using JQ,JS etc, so please try to keep your suggestions based on CF code IF possible.
Any help or different suggestions on how to tackle this issue will be very appreciated.
No regex is needed for this. A slightly simpler solution:
<cfset form.fullname = "Dave " />
<cfif listLen(form.fullname," ") GT 1> <!--- space-delimited list, no need for trimming or anything --->
<!--- name has more than one 'piece' -- is good --->
<cfelse>
<!--- name has only one 'piece' -- bad --->
</cfif>
You could do something like this for server side validation:
<cfscript>
TheString = "ronger ddd";
TheString = trim(TheString); // get rid of beginning and ending spaces
SpaceAt = reFind(" ", TheString); // find the index of a space
// no space found -- one word
if (SpaceAt == 0) {
FullNameHasSpace = false;
// at least one space was found -- more than one word
} else {
FullNameHasSpace = true;
}
</cfscript>
<cfoutput>
<input type="input" value="#TheString#">
<cfif FullNameHasSpace eq true>
<p>found space at position #SpaceAt#</p>
<p>Your data is good.</p>
<cfelse>
<p>Did not find a space.</p>
<p>Your data is bad.</p>
</cfif>
</cfoutput>