How can I strip this URL of everything before "http://"? - regex

I'm doing some web scraping with ColdFusion and mostly everything is working fine. The only other issues I'm getting is that some URL's come through with text behind them that is now causing errors.
Not sure what's causing it, but it's probably my regex. Anyhow, there's a distinct pattern where text appears before the "http://". I'd like to simply remove everything before it.
Any chance you could help?
Take this string for example:
"I'M OBSESSED WITH MY BEAUTIFUL FRIEND" src="http://scs.viceland.com/feed/images/uk_970014338_300.jpg
I'd much appreciate your help as regex isn't something I've managed to make time for - hopefully I will some day!
Thanks.
EDIT:
Hi,
I thought it might be helpful to post my entire function, since it could be my initial REGEX that is causing the issue. Basically, the funcion takes one argument. In this case, it's the contents of a HTML file (via CFHTTP).
In some cases, every URL looks and works fine. If I try digg.com for example it works...but it dies on something like youtube.com. I guess this would be down to their specific HTML formatting. Either way, all I ever need is the value of the SRC attribute on image tags.
Here's what I have so far:
<cffunction name="extractImages" returntype="array" output="false" access="public" displayname="extractImages">
<cfargument name="fileContent" type="string" />
<cfset var local = {} />
<cfset local.images = [] />
<cfset local.imagePaths = [] />
<cfset local.temp = [] />
<cfset local.images = reMatchNoCase("<img([^>]*[^/]?)>", arguments.fileContent) />
<cfloop array="#local.images#" index="local.i">
<cfset local.temp = reMatchNoCase("(""|')(.*)(gif|jpg|jpeg|png)", local.i) />
<cfset local.path = local.temp />
<cfif not arrayIsEmpty(local.path)>
<cfset local.path = trim(replace(local.temp[1],"""","","all")) />
<cfset arrayAppend(local.imagePaths, local.path) />
</cfif>
<cfif isValid("url", local.path)>
<cftry>
<cfif fileExists(local.path)>
<cfset arrayAppend(local.imagePaths, local.path) />
</cfif>
<cfcatch type="any">
<cfset application.messagesObject.addMessage("error","We were not able to obtain all available images on this page.") />
</cfcatch>
</cftry>
</cfif>
</cfloop>
<cfset local.imagePaths = application.udfObject.removeArrayDuplicates(local.imagePaths) />
<cfreturn local.imagePaths />
</cffunction>
This function WORKS. However, on some sites, not so. It looks a bit over the top but much of it is just certain safeguards to make sure I get valid image paths.
Hope you can help.
Many thanks again.
Michael

Take a look at ReFind() or REFindNoCase() - http://cfquickdocs.com/cf9/#refindnocase
Here is a regex that will work.
<cfset string = 'IM OBSESSED WITH MY BEAUTIFUL FRIEND" src="http://scs.viceland.com/feed/images/uk_970014338_300.jpg' />
<cfdump var="#refindNoCase('https?://[-\w.]+(:\d+)?(/([\w/_.]*)?)?',string, 1, true)#">
You will see a structure returned with a POS and LEN keys. Use the first element in the POS array to see where the match starts, and the first element in the LEN array to see how long it is. You can then use these values in the Mid() function to grab just that matching URL.

I'm not familiar with ColdFusion, but it seems to me that you just need a regex that looks for http://, then any number of characters, then the end of the string.

Related

Problem with anchor links using resolveurl

I'm using <cfhttp> to pull in content from another site (coldfusion) and resolveurl="true" so all the links work. The problem I'm having is resolveurl is making the anchor links (href="#search") absolute links as well breaking them. My question is is there a way to make resolveurl="true" bypass anchor links somehow?
For starters, let's use the tutorial code from Adobe.com posted in the comments. You'll want to do something similar.
<cfhttp url="https://www.adobe.com"
method="get" result="httpResp" timeout="120">
<cfhttpparam type="header" name="Content-Type" value="application/json" />
</cfhttp>
<cfscript>
// Find all the URLs in a web page retrieved via cfhttp
// The search is case sensitive
result = REMatch("https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?", httpResp.Filecontent);
</cfscript>
<!-- Now, Loop through those URLs--->
<cfoutput>
<cfloop array="#result#" item="item" index="index">
<cfif LEFT(item, 1) is "##">
<!---Your logic if it's just an anchor--->
<cfelse>
<!---Your logic if it's a full link--->
</cfif>
<br/>
</cfloop>
</cfoutput>
If it tries to return a full URL before the anchor as you say, (I've been getting inconsistent results with resolveurl="true") hit it with this to only grab the bit you want.
<cfoutput>
<cfloop array="#result#" item="item" index="index">
#ListLast(item, "##")#
</cfloop>
</cfoutput>
What this code does is grab all the URLs, and parse them for anchors.
You'll have to decide what to do next inside your loop. Maybe preserve the values and add them to a new array, so you can save it somewhere with the links fixed?
It's impossible to assume in a situation like this.
There does not appear to be a way to prevent CF from resolving the hashes. In our usage of it the current result is actually beneficial since when we present content from another site we usually want the user to be sent there.
Here is a way to replace link href values with just anchor if one is present using regular expressions. I'm sure there are combinations of issues that could occur here if really malformed html.
<cfsavecontent variable="testcontent">
<strong>test</strong>
go to google
go to section
</cfsavecontent>
<cfset domain = replace("current.domain", ".", "\.", "all") />
<cfset match = "(href\s*=\s*(""|'))\s*(http://#domain#[^##'""]+)(##[^##'""]+)\s*(""|')" />
<cfset result = reReplaceNoCase(testcontent, match, "\1\4\6", "all") />
<cfoutput><pre>#encodeForHTML(result)#</pre></cfoutput>
Output
<strong>test</strong>
go to google
<a href="#section>go to section</a>
Another option if you are displaying the content in a normal page with js/jquery available is to run through each link on display and update it to just be the anchor. This will be less likely error with malformed html. Let me know if you have any interest in that approach.

Convert Special Characters to HTML - ColdFusion

I need to convert a lot of special characters to their html format and I am trying to do this with a function that is using ReplaceList but something is wrong with the function or the values I am passing to it.
This is the function
<cffunction name="HtmlUnEditFormat" access="public" returntype="string" output="no" displayname="HtmlUnEditFormat" hint="Undo escaped characters">
<cfargument name="str" type="string" required="Yes" />
<cfscript>
var lEntities = "&##xE7;,&##xF4;,&##xE2;,Î,Ç,È,Ó,Ê,&OElig,Â,«,»,À,É,≤,ý,χ,∑,′,ÿ,∼,β,⌈,ñ,ß,„,´,·,–,ς,®,†,⊕,õ,η,⌉,ó,­,>,φ,∠,‏,α,∩,↓,υ,ℑ,³,ρ,é,¹,<,¢,¸,π,⊃,÷,ƒ,¿,ê, ,∅,∀, ,γ,¡,ø,¬,à,ð,ℵ,º,ψ,⊗,δ,ö,°,≅,ª,‹,♣,â,ò,ï,♦,æ,∧,◊,è,¾,&,⊄,ν,“,∈,ç,ˆ,©,á,§,—,ë,κ,∉,⌊,≥,ì,↔,∗,ô,∞,¦,∫,¯,½,¤,≈,λ,⁄,‘,…,œ,£,♥,−,ã,ε,∇,∃,ä,μ,¼, ,≡,•,←,«,‾,∨,€,µ,≠,∪,å,ι,í,⊥,¶,→,»,û,ο,‚,ϑ,∋,∂,”,℘,‰,²,σ,⋅,š,¥,ξ,±,ℜ,þ,〉,ù,√,‍,∴,↑,×, ,θ,⌋,⊂,⊇,ü,’,ζ,™,î,ϖ,‌,〈,˜,ú,¨,∝,ϒ,ω,↵,τ,⊆,›,∏,",‎,♠";
var lEntitiesChars = "ç,ô,â,Î,Ç,È,Ó,Ê,Œ,Â,«,»,À,É,?,ý,?,?,?,Ÿ,?,?,?,ñ,ß,„,´,·,–,?,®,‡,?,õ,?,?,ó,­,>,?,?,?,?,?,?,?,?,³,?,é,¹,<,¢,¸,?,?,÷,ƒ,¿,ê,?,?,?,?,?,¡,ø,¬,à,ð,?,º,?,?,?,ö,°,?,ª,‹,?,â,ò,ï,?,æ,?,?,è,¾,&,?,?,“,?,ç,ˆ,©,á,§,—,ë,?,?,?,?,ì,?,?,ô,?,¦,?,¯,½,¤,?,?,?,‘,…,œ,£,?,?,ã,?,?,?,ä,?,¼, ,?,•,?,«,?,?,€,µ,?,?,å,?,í,?,¶,?,»,û,?,‚,?,?,?,”,?,‰,²,?,?,š,¥,?,±,?,þ,?,ù,?,?,?,?,×,?,?,?,?,?,ü,’,?,™,î,?,?,?,˜,ú,¨,?,?,?,?,?,?,›,?,"",?,?";
</cfscript>
<cfreturn ReplaceList(arguments.str, lEntities, lEntitiesChars) />
</cffunction>
This is how I am calling it:
<cfoutput>
<cfloop query="local.q" startrow="2">
#HtmlUnEditFormat(consultServiceType)# <br />
</cfloop>
</cfoutput>
These are the strings I am passing to it:
Security?
Security Guard®
Alarm System©
Private Investigator;
I am not getting any errors back (I had a cftry in the function before) and the strings come back the same
EDIT:
I've tried using #FindNoCase('©',consultServiceType)# and is returning 0 so I guess something is wrong with the string I am passing in?
You're using CF11, did you try EncodeForHTML() ?
The accepted answer is the better approach (don't reinvent the wheel), but your function isn't working because you have lEntities and lEntitiesChars mixed up.
<cffunction name="HtmlUnEditFormat" access="public" returntype="string" output="no" displayname="HtmlUnEditFormat" hint="Undo escaped characters">
<cfargument name="str" type="string" required="Yes" />
<cfscript>
var lEntities = "&##xE7;,&##xF4;,&##xE2;,Î,Ç,È,Ó,Ê,&OElig,Â,«,»,À,É,≤,ý,χ,∑,′,ÿ,∼,β,⌈,ñ,ß,„,´,·,–,ς,®,†,⊕,õ,η,⌉,ó,­,>,φ,∠,‏,α,∩,↓,υ,ℑ,³,ρ,é,¹,<,¢,¸,π,⊃,÷,ƒ,¿,ê, ,∅,∀, ,γ,¡,ø,¬,à,ð,ℵ,º,ψ,⊗,δ,ö,°,≅,ª,‹,♣,â,ò,ï,♦,æ,∧,◊,è,¾,&,⊄,ν,“,∈,ç,ˆ,©,á,§,—,ë,κ,∉,⌊,≥,ì,↔,∗,ô,∞,¦,∫,¯,½,¤,≈,λ,⁄,‘,…,œ,£,♥,−,ã,ε,∇,∃,ä,μ,¼, ,≡,•,←,«,‾,∨,€,µ,≠,∪,å,ι,í,⊥,¶,→,»,û,ο,‚,ϑ,∋,∂,”,℘,‰,²,σ,⋅,š,¥,ξ,±,ℜ,þ,〉,ù,√,‍,∴,↑,×, ,θ,⌋,⊂,⊇,ü,’,ζ,™,î,ϖ,‌,〈,˜,ú,¨,∝,ϒ,ω,↵,τ,⊆,›,∏,",‎,♠";
var lEntitiesChars = "ç,ô,â,Î,Ç,È,Ó,Ê,Œ,Â,«,»,À,É,?,ý,?,?,?,Ÿ,?,?,?,ñ,ß,„,´,·,–,?,®,‡,?,õ,?,?,ó,­,>,?,?,?,?,?,?,?,?,³,?,é,¹,<,¢,¸,?,?,÷,ƒ,¿,ê,?,?,?,?,?,¡,ø,¬,à,ð,?,º,?,?,?,ö,°,?,ª,‹,?,â,ò,ï,?,æ,?,?,è,¾,&,?,?,“,?,ç,ˆ,©,á,§,—,ë,?,?,?,?,ì,?,?,ô,?,¦,?,¯,½,¤,?,?,?,‘,…,œ,£,?,?,ã,?,?,?,ä,?,¼, ,?,•,?,«,?,?,€,µ,?,?,å,?,í,?,¶,?,»,û,?,‚,?,?,?,”,?,‰,²,?,?,š,¥,?,±,?,þ,?,ù,?,?,?,?,×,?,?,?,?,?,ü,’,?,™,î,?,?,?,˜,ú,¨,?,?,?,?,?,?,›,?,"",?,?";
</cfscript>
<cfreturn ReplaceList(arguments.str, lEntitiesChars, lEntities) />
</cffunction>
<cfoutput>#htmluneditformat("Company?")#</cfoutput>
Further, #ReplaceList()# in both ACF and Railo/Lucee recurse through the list, which means the order of the lists matter. With the fix I suggest, ? becomes &le;. A fix to this would be to move & and the code for it to the beginning of each list.
Consider this simple piece of code
<cfoutput>#replacelist("abc","a,b","b,c")#</cfoutput>
You would probably expect the output to be "bcc", but that's not how ReplaceList works, it works something more like this
<cfset sx = "abc">
<cfset listf = "a,b">
<cfset listr = "b,c">
<cfloop from="1" to="#listlen(listf)#" index="i">
<cfset sx = replace(sx,listgetat(listf,i),listgetat(listr,i),"ALL")>
<!--- iteration one replaces a with b to make bbc --->
<!--- iteration two replaces b with c to make ccc --->
</cfloop>
I'm not suggesting that someone use this code when CF has the built in functionality, I'm merely explaining why it doesn't work and a pitfall of ReplaceList().

How can I retrieve a (quasi) Array from a URL string?

I would like to achieve something I can easily do in .net.
What I would like to do is pass multiple URL parameters of the same name to build an array of those values.
In other words, I would like to take a URL string like so:
http://www.example.com/Test.cfc?method=myArrayTest&foo=1&foo=2&foo=3
And build an array from the URL parameter "foo".
In .net / C# I can do something like this:
[WebMethod]
myArrayTest(string[] foo)
And that will build a string array from the variable "foo".
What I have done so far is something like this:
<cffunction name="myArrayTest" access="remote" returntype="string">
<cfargument name="foo" type="string" required="yes">
This would output:
1,2,3
I'm not thrilled with that because it's just a comma separated string and I'm afraid that there may be commas passed in the URL (encoded of course) and then if I try to loop over the commas it may be misinterpreted as a separate param.
So, I'm stumped on how to achieve this.
Any ideas??
Thanks in advance!!
Edit: Sergii's method is more versatile. But if you are parsing the current url, and do not need to modify the resulting array, another option is using getPageContext() to extract the parameter from the underlying request. Just be aware of the two quirks noted below.
<!--- note: duplicate forces the map to be case-INsensitive --->
<cfset params = duplicate(getPageContext().getRequest().getParameterMap())>
<cfset quasiArray = []>
<cfif structKeyExists(params, "foo")>
<!--- note: this is not a *true* CF array --->
<!--- you can do most things with it, but you cannot append data to it --->
<cfset quasiArray = params["foo"]>
</cfif>
<cfdump var="#quasiArray#">
Well, if you're OK with parsing the URL, following "raw" method may work for you:
<cffunction name="myArrayTest" access="remote" output="false">
<cfset var local = {} />
<!--- parse raw query --->
<cfset local.args = ListToArray(cgi.QUERY_STRING, "&") />
<!--- grab only foo's values --->
<cfset local.foo = [] />
<cfloop array="#local.args#" index="local.a">
<cfif Left(local.a, 3) EQ "foo">
<cfset ArrayAppend(local.foo, ListLast(local.a, "=")) />
</cfif>
</cfloop>
<cfreturn SerializeJSON(local.foo) />
</cffunction>
I've tested it with this query: ?method=myArrayTest&foo=1&foo=2&foo=3,3, looks to work as expected.
Bonus. Railo's top tip: if you format the query as follows, this array will be created automatically in URL scope ?method=myArrayTest&foo[]=1&foo[]=2&foo[]=3,3.
listToArray( arguments.foo ) should give you what you want.

Use coldfusion to get x number of words total including keyword

Okay this is part of my search results project, in it, I have description being returned from multiple tables. All of that part works 100%.
I currently use a trim_text function, which I pass a string, and how many words I want to keep.
However, now I need to modify it to make sure the keyword/search term is in the returning description to help show the validity of that in the search results.
Here below is the existing trim_text function, that I need your help to modify.
<cffunction name="trim_text" output="false" access="remote" returntype="string">
<cfargument name="string" type="string" required="true">
<cfargument name="word_limit" type="integer" required="false">
<cfparam name="word_limit" default=20>
<cfparam name="snippet" default="">
<cfparam name="return_string" default="">
<cfset return_string = "">
<cfset return_string = reReplace( string, "</?\w+(\s*[\w:]+\s*=\s*(""[^""]*""|'[^']*'))*\s*/?>", " ", "all" ) />
<cfset return_string = reReplace( trim( return_string ), "\s+", " ", "all" ) />
<cfset snippet = reMatch( "([^\s]+\s?){1,#word_limit#}", return_string ) />
<cfif !arrayLen( snippet )>
<cfreturn "" />
</cfif>
<cfset charCount = listlen(snippet[1]) />
<cfset wordCount = ( (word_limit * (arrayLen( snippet ) - 1)) + listLen( snippet[ arrayLen( snippet ) ], " " ) ) />
<cfif charCount gt 190>
<cfreturn left(snippet[1],190) & "..." />
</cfif>
<cfset return_string = snippet[1] & "..." />
<cfreturn return_string />
</cffunction>
So my end goal is a description that contains the keyword.
So for example.
Let us say I am searching for the keyword 'business'
And I get the correct search result, however the description doesn't have that word in the description shown, since we are limiting the description to 25 words, via the trim_text function. It makes all the descriptions look similar in size. But doesn't help prove the validity of results where the keyword is further down in the description.
Any questions? I hope I made this very clear.
I am using Coldfusion 8 Standard. I am testing this on my development server.
Thank You...
Sounds like you need to find the position of the keyword in the string, and then take the characters either side.
Treat your string as a list, and use whitespace characters and punction as the delimters.
Something like this:
<cfset wordFoundPos = listFindNoCase(string, searchTerm, " ,.-:;") />
Say that returns 42 - that is, the searchTerm is the 42nd word.
Convert that to a character position like so:
<cfset charPos = findnocase(1, string, searchTerm) />
Then grab the characters either side of that character:
<cfset context = mid(190, charPos-90, string) />
You'll need to detect when the searchterm is found too close to the start or end of the string to avoide errors, and to work out when to append and/or prepend ellipses to the context.

Resolving variables inside a Coldfusion string

My client has a database table of email bodies that get sent at certain times to customers. The text for the emails contains ColdFusion expressions like Dear #firstName# and so on. These emails are HTML - they also contain all sorts of HTML mark-up. What I'd like to do is read that text from the database into a string and then have ColdFusion Evaluate() that string to resolve the variables. When I do that, Evaluate() throws an exception because it doesn't like the HTML markup in there (I also tried filtering the string through HTMLEditFormat() as an intermediate step for grins but it didn't like the entities in there).
My predecessor solved this problem by writing the email text out to a file and then cfincluding that. It works. It's seems really hacky though. Is there a more elegant way to handle this using something like Evaluate that I'm not seeing?
What other languages often do that seems to work very well is just have some kind of token within your template that can be easily replaced by a regular expression. So you might have a template like:
Dear {{name}}, Thanks for trying {{product_name}}. Etc...
And then you can simply:
<cfset str = ReplaceNoCase(str, "{{name}}", name, "ALL") />
And when you want to get fancier you could just write a method to wrap this:
<cffunction name="fillInTemplate" access="public" returntype="string" output="false">
<cfargument name="map" type="struct" required="true" />
<cfargument name="template" type="string" required="true" />
<cfset var str = arguments.template />
<cfset var k = "" />
<cfloop list="#StructKeyList(arguments.map)#" index="k">
<cfset str = ReplaceNoCase(str, "{{#k#}}", arguments.map[k], "ALL") />
</cfloop>
<cfreturn str />
</cffunction>
And use it like so:
<cfset map = { name : "John", product : "SpecialWidget" } />
<cfset filledInTemplate = fillInTemplate(map, someTemplate) />
Not sure you need rereplace, you could brute force it with a simple replace if you don't have too many fields to merge
How about something like this (not tested)
<cfset var BaseTemplate = "... lots of html with embedded tokens">
<cfloop (on whatever)>
<cfset LoopTemplate = replace(BaseTemplate, "#firstName#", myvarforFirstName, "All">
<cfset LoopTemplate = replace(LoopTemplate, "#lastName#", myvarforLastName, "All">
<cfset LoopTemplate = replace(LoopTemplate, "#address#", myvarforAddress, "All">
</cfloop>
Just treat the html block as a simple string.
CF 7+: You may use regular expression, REReplace()?
CF 9: use Virtual File System
If the variable is in a structure from, something like a form post, then you can use "StructFind". It does exactly as you request. I ran into this issue when processing a form with dynamic inputs.
Ex.
StructFind(FORM, 'WhatYouNeed')