Extract URL from SCRIPT portion of HTML Code with RegEx - regex

I have one URL which is inside a <script> tag and I need to extract that URL:
Using ReMatchNoCase(), I can find the script and place it in an array.
<SCRIPT LANGUAGE="JavaScript" > //alert("a chance stuff"); document.location.href="https://mypage.cfm"; </SCRIPT>
To extract the URL, I am using the following code
<cfset ulink = reMatchNoCase("<SCRIPT.*?>.*?</SCRIPT>", data)>
<cfset link = Replacenocase(Replace(listLast(ulink[1],'='),'"','','ALL'),';</script>','','all')>
This works, but is there a cleaner way to do this?

Because ReFind/NoCase() is not designed to return the actual substring, this is about as simple as you're going to get.
<cfset data='<SCRIPT LANGUAGE="JavaScript" > //alert("a chance stuff"); document.location.href="https://mypage.cfm"; </SCRIPT>'>
<cfset ulink = reMatchNoCase("<SCRIPT.*?>.*?</SCRIPT>", data)>
<cfset link = Rematchnocase("http[^""']*",ulink[1])>
<cfoutput>#link[1]#</cfoutput>
Which is a little simpler than what you're doing. Alternatively you could use Mid(ulink[1]...) but with subexpressions from a ReFindNoCase(), but it is also no simpler.
The regular expression I use to match the URL is very generic, but it should easily do for the task. It simply captures everything until it finds a quote or apostrophe.
I did also think of this
<cfset data='<SCRIPT LANGUAGE="JavaScript" > //alert("a chance stuff"); document.location.href="https://mypage.cfm"; </SCRIPT>'>
<cfset ulink = rereplacenocase(data,"[\s\S]*?(<script.*?>[\s\S]*?(http[^""']*)[\s\S]*?</script>)[\s\S]*","\2","ALL")>
<cfoutput>#ulink#</cfoutput>
which is possibly better, but it is so much nastier to read and is less reliable for dealing with multiple <script> tags if that should arise.
Personally, I'd go with the first route. With RegEx, sometimes the "lazier" you try to be, the shakier the whole thing becomes. It's best to define the best pattern you can to attain your goal and in ColdFusion, I believe the first route is the best route.

You can do the following:
<cfset data = '<SCRIPT LANGUAGE="JavaScript" > //alert("a chance stuff"); document.location.href="https://mypage.cfm"; </SCRIPT>' />
<cfset start = REFindNoCase("<script[^>]*>", data) />
<cfset match = REMatchNoCase("https?://[^'""]*(?=.*</script>)(?!.*<script>)", mid(data, start, len(data) - start + 1)) />
In the second line I am finding the position of the <script> open tag (even though not absolutely necessary for this particular piece of data). In the 3rd line I find any URLs within the <script> tag. I use positive lookahead to make sure that there is a </script> end tag following, and negative lookahead to make sure there is not another <script> tag.

Related

Problem with anchor links using resolveurl

I'm using <cfhttp> to pull in content from another site (coldfusion) and resolveurl="true" so all the links work. The problem I'm having is resolveurl is making the anchor links (href="#search") absolute links as well breaking them. My question is is there a way to make resolveurl="true" bypass anchor links somehow?
For starters, let's use the tutorial code from Adobe.com posted in the comments. You'll want to do something similar.
<cfhttp url="https://www.adobe.com"
method="get" result="httpResp" timeout="120">
<cfhttpparam type="header" name="Content-Type" value="application/json" />
</cfhttp>
<cfscript>
// Find all the URLs in a web page retrieved via cfhttp
// The search is case sensitive
result = REMatch("https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?", httpResp.Filecontent);
</cfscript>
<!-- Now, Loop through those URLs--->
<cfoutput>
<cfloop array="#result#" item="item" index="index">
<cfif LEFT(item, 1) is "##">
<!---Your logic if it's just an anchor--->
<cfelse>
<!---Your logic if it's a full link--->
</cfif>
<br/>
</cfloop>
</cfoutput>
If it tries to return a full URL before the anchor as you say, (I've been getting inconsistent results with resolveurl="true") hit it with this to only grab the bit you want.
<cfoutput>
<cfloop array="#result#" item="item" index="index">
#ListLast(item, "##")#
</cfloop>
</cfoutput>
What this code does is grab all the URLs, and parse them for anchors.
You'll have to decide what to do next inside your loop. Maybe preserve the values and add them to a new array, so you can save it somewhere with the links fixed?
It's impossible to assume in a situation like this.
There does not appear to be a way to prevent CF from resolving the hashes. In our usage of it the current result is actually beneficial since when we present content from another site we usually want the user to be sent there.
Here is a way to replace link href values with just anchor if one is present using regular expressions. I'm sure there are combinations of issues that could occur here if really malformed html.
<cfsavecontent variable="testcontent">
<strong>test</strong>
go to google
go to section
</cfsavecontent>
<cfset domain = replace("current.domain", ".", "\.", "all") />
<cfset match = "(href\s*=\s*(""|'))\s*(http://#domain#[^##'""]+)(##[^##'""]+)\s*(""|')" />
<cfset result = reReplaceNoCase(testcontent, match, "\1\4\6", "all") />
<cfoutput><pre>#encodeForHTML(result)#</pre></cfoutput>
Output
<strong>test</strong>
go to google
<a href="#section>go to section</a>
Another option if you are displaying the content in a normal page with js/jquery available is to run through each link on display and update it to just be the anchor. This will be less likely error with malformed html. Let me know if you have any interest in that approach.

Trying to find absolute string using REFind

We have list of reserved keyword that we don't want our clients to be able to use in out system. So we perform a search using REFind.
Here is the code:
<cfset reservedKeywords = "stop,end,quit,cancel,help,test">
<cfset foundArray = REFind("(?i)(" & ListChangeDelims(reservedKeywords, "|") & ")"
, form.keyword, 1, true)>
<cfif foundArray.pos[1] gt 0>
<cfoutput>
<script language="JavaScript">
alert('Keyword "#mid(form.keyword, foundArray.pos[1], foundArray.len[1])#" has been reserved.');
history.go(-1);
</script>
</cfoutput>
<cfabort>
</cfif>
So everything works great.... but we do run into a problem when a keyword is searched that has one of the reserved word IN the keyword. So if "Blended" is submitted, it will be flagged as having the reserved word "end".
Is there a way to perform an absolute search where it takes into account the whole keyword?
I've been trying to edit and play around with the code but just can't get it to work.
Any suggestions would be greatly appreciated.
Thank you!
Use listFindNoCase() instead of REFind().
The way you are currently checking is if one of the elements in the list matches form.keyword - which is why 'blended' gets tagged as 'reserved' when it shouldn't - however, you should be checking if form.keyword matches any items in the list - a subtle, but important, distinction.
reservedWords = "stop,end,quit,cancel,help,test";
reservedWordUsed = listFindNoCase( reservedWords, form.keyword );
if( reserveWordUsed ){
// do the JS stuff here
}

remove some code from the url pagination

i have the following code, but i am very loose in the regular expression, i am using coldfusion
and i want to remove the code which is inbetween before every next page call
http://beta.mysite.com/?jobpage=2page=2#brands
what i am trying is if jobpage exists, it should remove the jobpage=2 from the URL, {2} is dynamic as it can be one or 2 or 3 and so on.
I tried with listfirst and listlast or gettoken but no help.
This should do it for you
<Cfset myurl = "http://beta.mysite.com/?jobpage=2page=2##brands" />
<cfoutput>#myurl#</cfoutput><br><Br>
<cfset myurl = ReReplaceNoCase(myurl,"(jobpage=[0-9]+[\&]?)","","ALL") />
<cfoutput>#myurl#</cfoutput>

Parsing og: tags with ColdFusion regex

If one wants to extract/match Open Graph (og:) tags from html, using regex (and ColdFusion 9+), how would one go about doing it?
And the tricky bit is that is has to cover both possible variations of tag formation as in the following examples:
<meta property="og:type" content="website" />
<meta content="website" property="og:type"/>
So far all I got is this:
<cfset tags = ReMatch('(og:)(.*?)>',html_content)>
It does match both of the links, however only the first type has the content bit returned with it. And content is something that I require.
Just to make it absolutely clear, the desired output should be an array with all of the OG tags (they could be 'type,image,author,description etc.). That means it should be flexible and not based on the og:type example alone.
Of course if it's possible, the ideal output would be a struct with the first column being the name of tag, and the second containing the value (content). But that can be achieved with the post processing and is not as important as extracting the tags themselves.
Cheers,
Simon
So you want an array like ['og:author','og:type', 'og:image'...]?
Try using a regex like og:([\w]+)
That should give you a start. You will have duplicates if you have two of the same og:foo meta tags.
You can look at JSoup also to help parse the HTML for you. It makes it a lot easier.
There are a few good blog posts on using it in CFML
jQuery-like parsing in Java
Parsing, Traversing, And Mutating HTML With ColdFusion And jSoup
Ok, so after the suggestion from #abbottmw (thank you very much!), here's the solution:
Download Jsoup jar file from here: http://jsoup.org/download
Then initiate it like this:
<cfhttp url="...." result="oghtml" > /*to get your html content*/
<cfscript>
paths = expandPath("/lib/jsoup.jar"); //or wherever you decide to place the file
loaderObj =createObject("component","javaloader.JavaLoader").init([expandPath('/lib/jsoup.jar')]);
jsoup = loaderObj.create("org.jsoup.Jsoup");
doc = jsoup.parse(oghtml);
tags = doc.select("meta[property*=og:]");
</cfscript>
<cfloop index="e" array="#tags#">
<cfoutput>
#e.attr("property")# | #e.attr("content")#<br />
</cfoutput>
</cfloop>
And that is it. The complete list of og tags is in the [tags] array.
Of course it's not the regex solutions, which was originally requested, but hey, it works!

How do I replace text in all href attributes of anchor tags?

I need to replace the text inside all href values. I think a regular expression is the way to do it, but I'm no regex pro. Any thoughts on how I'd do the following using ColdFusion?
so it is changed to:
Thanks!
Here's an update to the question: I have this code and need the pattern below:
<cfset matches = ReMatch('<a[^>]*href="http[^"]*"[^>]*>(.+?)</a>', arguments.htmlCode) /> <cfdump var="#matches#">
<cfset links = arrayNew(1)>
<cfloop index="a" array="#matches#">
<cfset arrayAppend(links, rereplace(a, 'need regex'," {clickurl}","all"))>
</cfloop>
<cfdump var="#links#">
Here's how to do it with jSoup HTML parser:
<cfset jsoup = createObject('java','org.jsoup.Jsoup') />
<cfset Dom = jsoup.parse( InputHtml ) />
<cfset Dom.select('a[href]').attr('href','{replaced}') />
<cfset NewHtml = Dom.html() />
(On CF9 and earlier, this requires placing the jsoup's jar in CF's lib directory, or using JavaLoader.)
Using a HTML parser is usually better than using regex, not least because it's easier to maintain and understand.
Here's an imperfect way of doing it with a regex:
<cfset NewHtml = InputHtml.replaceAll
( '(?<=<a.{0,99}?\shref\s{0,99}?=\s{0,99}?)(?:"[^"]+|''[^'']+)(["'])'
, '$1{replaced}$1'
)/>
Which hopefully demonstrates why using a tool such as jsoup is definitely the way to go...
(btw, the above is using the Java regex engine (via string.replaceAll), so it can use the lookbehind functionality, which doesn't exist in CF's built-in regex (rereplace/rematch/etc))
Update, based on the new code sample you've provided...
Here is an example of how to use jsoup for what you're doing - it might still need some updates (depending on what {clickurl} is eventually going to be doing), but it currently functions the same as your sample code is attempting:
<cfset jsoup = createObject('java','org.jsoup.Jsoup') />
<cfset links = jsoup.parse( Arguments.HtmlCode )
<!--- select all links beginning http and change their href --->
.select('a[href^=http]').attr('href',' {clickurl}')
<!--- get HTML for all links, then split into array. --->
.outerHtml().split('(?<=</a>)(?!$)')
/>
<cfdump var=#links# />
That middle bit is all a single cfset, but I split it up and added comments for clarity. (You could of course do this with multiple variables and 3+ cfsets if you preferred that.)
Again, it's not a regex, because what you're doing involves parsing HTML, and regex is not designed for parsing tag-based syntax, so isn't very good at it - there are too many quirks and variations with HTML and describing them in a single regex gets very complicated very quickly.