Using Jsoup (the Java html parser) with Handlebarsjs

Using Jsoup (the Java html parser) with Handlebarsjs - coldfusion

So, I'm using jsoup and when I display the content returned I Get:
{{#ifcond="" pagetitle="" this.name}}
I'm doing this like local.htmlObj.Body().Html()
When I need it to be return like:
{{#ifCond PAGETITLE this.NAME}}
Here what I'm doing
<cfset paths = [] />
<cfset paths[1] = expandPath("/javaloader/lib/jsoup-1.7.1.jar") />
<cfset loader = createObject("component", "javaloader.JavaLoader").init( paths ) />
<cfset obj = loader.create( "org.jsoup.Jsoup" ) />
<cfset local.htmlObj = local.jsoupObj.parse( local.template ) />
<cfloop array="#local.htmlObj.select('.sidebar_left')#" index="element">
<cfif element.attr('section') EQ "test">
<cfset element.append('HTML HERE') />
</cfif>
</cfloop>
local.template is my template that is made up of a ton of different handlebar files That im pulling for different places. I'm constructing one handlebar file that gets returned.

The problem is that JSoup is trying to parse invalid HTML before it lets you access it. A slight easier to understand example of this behavior can be seen if you fetch the following HTML (seen in this question):
<p>
<table>[...]</table>
</p>
It will return:
<p></p>
<table>[...]</table>
In your case the Handelbars code is seen as attributes which in valid html always have a value (think checked="checked"). As far as I can tell there is no way to disable this behavior. It's really the wrong tool for the job you are trying to do. A cleaner approach would be to just fetch the document as a stream and save it to a string.

Related

Problem with anchor links using resolveurl

I'm using <cfhttp> to pull in content from another site (coldfusion) and resolveurl="true" so all the links work. The problem I'm having is resolveurl is making the anchor links (href="#search") absolute links as well breaking them. My question is is there a way to make resolveurl="true" bypass anchor links somehow?

For starters, let's use the tutorial code from Adobe.com posted in the comments. You'll want to do something similar.
<cfhttp url="https://www.adobe.com"
method="get" result="httpResp" timeout="120">
<cfhttpparam type="header" name="Content-Type" value="application/json" />
</cfhttp>
<cfscript>
// Find all the URLs in a web page retrieved via cfhttp
// The search is case sensitive
result = REMatch("https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?", httpResp.Filecontent);
</cfscript>
<!-- Now, Loop through those URLs--->
<cfoutput>
<cfloop array="#result#" item="item" index="index">
<cfif LEFT(item, 1) is "##">
<!---Your logic if it's just an anchor--->
<cfelse>
<!---Your logic if it's a full link--->
</cfif>
<br/>
</cfloop>
</cfoutput>
If it tries to return a full URL before the anchor as you say, (I've been getting inconsistent results with resolveurl="true") hit it with this to only grab the bit you want.
<cfoutput>
<cfloop array="#result#" item="item" index="index">
#ListLast(item, "##")#
</cfloop>
</cfoutput>
What this code does is grab all the URLs, and parse them for anchors.
You'll have to decide what to do next inside your loop. Maybe preserve the values and add them to a new array, so you can save it somewhere with the links fixed?
It's impossible to assume in a situation like this.

There does not appear to be a way to prevent CF from resolving the hashes. In our usage of it the current result is actually beneficial since when we present content from another site we usually want the user to be sent there.
Here is a way to replace link href values with just anchor if one is present using regular expressions. I'm sure there are combinations of issues that could occur here if really malformed html.
<cfsavecontent variable="testcontent">
<strong>test</strong>
go to google
go to section
</cfsavecontent>
<cfset domain = replace("current.domain", ".", "\.", "all") />
<cfset match = "(href\s*=\s*(""|'))\s*(http://#domain#[^##'""]+)(##[^##'""]+)\s*(""|')" />
<cfset result = reReplaceNoCase(testcontent, match, "\1\4\6", "all") />
<cfoutput><pre>#encodeForHTML(result)#</pre></cfoutput>
Output
<strong>test</strong>
go to google
<a href="#section>go to section</a>
Another option if you are displaying the content in a normal page with js/jquery available is to run through each link on display and update it to just be the anchor. This will be less likely error with malformed html. Let me know if you have any interest in that approach.

Excluding items from a list in coldfusion by type

Is there a way to exclude certain items by filetype in a list in Coldfusion?
Background: I just integrated a compression tool into an existing application and ran into the problem of the person's prior code would automatically grab the file from the upload destination on the server and push it to the Network Attached Storage. The aim now is to stop their NAS migration code from moving all files to the NAS, only those which are not PDF's. What I want to do is loop through their variable that stores the names of the files uploaded, and exclude the pdf's from the list then pass the list onto the NAS code, so all non pdf's are moved and all pdf's uploaded remain on the server. Working with their code is a challenge as no one commented or documented anything and I've been trying several approaches.
<cffile action="upload" destination= "c:\uploads\" result="myfiles" nameconflict="makeunique" >
<cfset fileSys = CreateObject('component','cfc.FileManagement')>
<cfif Len(get.realec_transactionid)>
<cfset internalOnly=1 >
</cfif>
**This line below is what I want to loop through and exclude file names
with pdf extensions **
<cfset uploadedfilenames='#myfiles.clientFile#' >
<CFSET a_insert_time = #TimeFormat(Now(), "HH:mm:ss")#>
<CFSET a_insert_date = #DateFormat(Now(), "mm-dd-yyyy")#>
**This line calls their method from another cfc that has all the file
migration methods.**
<cfset new_file_name = #fileSys.MoveFromUploads(uploadedfilenames)#>
**Once it moves the file to the NAS, it inserts the file info into the
DB table here**
<cfquery name="addFile" datasource="#request.dsn#">
INSERT INTO upload_many (title_id, fileDate, filetime, fileupload)
VALUES('#get.title_id#', '#dateTimeStamp#', '#a_insert_time#', '#new_file_name#')
</cfquery>
<cfelse>
<cffile action="upload" destination= #ExpandPath("./uploaded_files/zip.txt")# nameconflict="overwrite" >
</cfif>
Update 6/18
Trying the recommended code helps with the issue of sorting out filetypes when tested outside of the application, but anytime its integrated into the application to operate on the variable uploadedfilenames the rest of the application fails and the multi-file upload module just throws a status 500 error and no errors are reported in the CF logs. I've found that simply trying to run a cfloop on another variable not related to anything in the code still causes it to error.

As per my understanding, you want to filter-out file names with a specific file type/extension (ex: pdf) from the main list uploadedfilenames. This is one of the easiest ways:
<cfset lFileNames = "C:\myfiles\proj\icon-img-12.png,C:\myfiles\proj\sample-file.ppt,C:\myfiles\proj\fin-doc1.docx,C:\myfiles\proj\fin-doc2.pdf,C:\myfiles\proj\invoice-temp.docx,C:\myfiles\proj\invoice-final.pdf" />
<cfset lResultList = "" />
<cfset fileExtToExclude = "pdf" />
<cfloop list="#lFileNames#" index="fileItem" delimiters=",">
<cfif ListLast(ListLast(fileItem,'\'),'.') NEQ fileExtToExclude>
<cfset lResultList = ListAppend(lResultList,"#fileItem#") />
</cfif>
</cfloop>
Using only List Function provided by ColdFusion this is easily done, you can test and try the code here. I would recommend you to wrap this code around a function for easy handling. Another way to do it would be to use some complex regular expression on the list (if you're looking for a more general solution, outside the context of ColdFusion).
Now, applying the solution to your problem:
<cfset uploadedfilenames='#myfiles.clientFile#' >
<cfset lResultList = "" />
<cfset fileExtToExclude = "pdf" />
<cfloop list="#uploadedfilenames#" index="fileItem" delimiters=",">
<cfif ListLast(ListLast(fileItem,'\'),'.') NEQ fileExtToExclude>
<cfset lResultList = ListAppend(lResultList,fileItem) />
</cfif>
</cfloop>
<cfset uploadedfilenames = lResultList />
<!--- rest of your code continues --->
The result list lResultList is copied to the original variable uploadedfilenames.

I hope I'm not misunderstanding the question, but why don't you just wrap all of that in an if-statement that reads the full file name? Whether the files are coming one by one or through a delimited list, it should be easy to work around.
<cfif !listContains(ListName, '.pdf')>
OR
<cfif FileName does not contain '.pdf'>
then
all the code you posted

How do I replace text in all href attributes of anchor tags?

I need to replace the text inside all href values. I think a regular expression is the way to do it, but I'm no regex pro. Any thoughts on how I'd do the following using ColdFusion?
so it is changed to:
Thanks!
Here's an update to the question: I have this code and need the pattern below:
<cfset matches = ReMatch('<a[^>]*href="http[^"]*"[^>]*>(.+?)</a>', arguments.htmlCode) /> <cfdump var="#matches#">
<cfset links = arrayNew(1)>
<cfloop index="a" array="#matches#">
<cfset arrayAppend(links, rereplace(a, 'need regex'," {clickurl}","all"))>
</cfloop>
<cfdump var="#links#">

Here's how to do it with jSoup HTML parser:
<cfset jsoup = createObject('java','org.jsoup.Jsoup') />
<cfset Dom = jsoup.parse( InputHtml ) />
<cfset Dom.select('a[href]').attr('href','{replaced}') />
<cfset NewHtml = Dom.html() />
(On CF9 and earlier, this requires placing the jsoup's jar in CF's lib directory, or using JavaLoader.)
Using a HTML parser is usually better than using regex, not least because it's easier to maintain and understand.
Here's an imperfect way of doing it with a regex:
<cfset NewHtml = InputHtml.replaceAll
( '(?<=<a.{0,99}?\shref\s{0,99}?=\s{0,99}?)(?:"[^"]+|''[^'']+)(["'])'
, '$1{replaced}$1'
)/>
Which hopefully demonstrates why using a tool such as jsoup is definitely the way to go...
(btw, the above is using the Java regex engine (via string.replaceAll), so it can use the lookbehind functionality, which doesn't exist in CF's built-in regex (rereplace/rematch/etc))
Update, based on the new code sample you've provided...
Here is an example of how to use jsoup for what you're doing - it might still need some updates (depending on what {clickurl} is eventually going to be doing), but it currently functions the same as your sample code is attempting:
<cfset jsoup = createObject('java','org.jsoup.Jsoup') />
<cfset links = jsoup.parse( Arguments.HtmlCode )
<!--- select all links beginning http and change their href --->
.select('a[href^=http]').attr('href',' {clickurl}')
<!--- get HTML for all links, then split into array. --->
.outerHtml().split('(?<=</a>)(?!$)')
/>
<cfdump var=#links# />
That middle bit is all a single cfset, but I split it up and added comments for clarity. (You could of course do this with multiple variables and 3+ cfsets if you preferred that.)
Again, it's not a regex, because what you're doing involves parsing HTML, and regex is not designed for parsing tag-based syntax, so isn't very good at it - there are too many quirks and variations with HTML and describing them in a single regex gets very complicated very quickly.

How can I strip this URL of everything before "http://"?

I'm doing some web scraping with ColdFusion and mostly everything is working fine. The only other issues I'm getting is that some URL's come through with text behind them that is now causing errors.
Not sure what's causing it, but it's probably my regex. Anyhow, there's a distinct pattern where text appears before the "http://". I'd like to simply remove everything before it.
Any chance you could help?
Take this string for example:
"I'M OBSESSED WITH MY BEAUTIFUL FRIEND" src="http://scs.viceland.com/feed/images/uk_970014338_300.jpg
I'd much appreciate your help as regex isn't something I've managed to make time for - hopefully I will some day!
Thanks.
EDIT:
Hi,
I thought it might be helpful to post my entire function, since it could be my initial REGEX that is causing the issue. Basically, the funcion takes one argument. In this case, it's the contents of a HTML file (via CFHTTP).
In some cases, every URL looks and works fine. If I try digg.com for example it works...but it dies on something like youtube.com. I guess this would be down to their specific HTML formatting. Either way, all I ever need is the value of the SRC attribute on image tags.
Here's what I have so far:
<cffunction name="extractImages" returntype="array" output="false" access="public" displayname="extractImages">
<cfargument name="fileContent" type="string" />
<cfset var local = {} />
<cfset local.images = [] />
<cfset local.imagePaths = [] />
<cfset local.temp = [] />
<cfset local.images = reMatchNoCase("<img([^>]*[^/]?)>", arguments.fileContent) />
<cfloop array="#local.images#" index="local.i">
<cfset local.temp = reMatchNoCase("(""|')(.*)(gif|jpg|jpeg|png)", local.i) />
<cfset local.path = local.temp />
<cfif not arrayIsEmpty(local.path)>
<cfset local.path = trim(replace(local.temp[1],"""","","all")) />
<cfset arrayAppend(local.imagePaths, local.path) />
</cfif>
<cfif isValid("url", local.path)>
<cftry>
<cfif fileExists(local.path)>
<cfset arrayAppend(local.imagePaths, local.path) />
</cfif>
<cfcatch type="any">
<cfset application.messagesObject.addMessage("error","We were not able to obtain all available images on this page.") />
</cfcatch>
</cftry>
</cfif>
</cfloop>
<cfset local.imagePaths = application.udfObject.removeArrayDuplicates(local.imagePaths) />
<cfreturn local.imagePaths />
</cffunction>
This function WORKS. However, on some sites, not so. It looks a bit over the top but much of it is just certain safeguards to make sure I get valid image paths.
Hope you can help.
Many thanks again.
Michael

Take a look at ReFind() or REFindNoCase() - http://cfquickdocs.com/cf9/#refindnocase
Here is a regex that will work.
<cfset string = 'IM OBSESSED WITH MY BEAUTIFUL FRIEND" src="http://scs.viceland.com/feed/images/uk_970014338_300.jpg' />
<cfdump var="#refindNoCase('https?://[-\w.]+(:\d+)?(/([\w/_.]*)?)?',string, 1, true)#">
You will see a structure returned with a POS and LEN keys. Use the first element in the POS array to see where the match starts, and the first element in the LEN array to see how long it is. You can then use these values in the Mid() function to grab just that matching URL.

I'm not familiar with ColdFusion, but it seems to me that you just need a regex that looks for http://, then any number of characters, then the end of the string.

Resolving variables inside a Coldfusion string

My client has a database table of email bodies that get sent at certain times to customers. The text for the emails contains ColdFusion expressions like Dear #firstName# and so on. These emails are HTML - they also contain all sorts of HTML mark-up. What I'd like to do is read that text from the database into a string and then have ColdFusion Evaluate() that string to resolve the variables. When I do that, Evaluate() throws an exception because it doesn't like the HTML markup in there (I also tried filtering the string through HTMLEditFormat() as an intermediate step for grins but it didn't like the entities in there).
My predecessor solved this problem by writing the email text out to a file and then cfincluding that. It works. It's seems really hacky though. Is there a more elegant way to handle this using something like Evaluate that I'm not seeing?

What other languages often do that seems to work very well is just have some kind of token within your template that can be easily replaced by a regular expression. So you might have a template like:
Dear {{name}}, Thanks for trying {{product_name}}. Etc...
And then you can simply:
<cfset str = ReplaceNoCase(str, "{{name}}", name, "ALL") />
And when you want to get fancier you could just write a method to wrap this:
<cffunction name="fillInTemplate" access="public" returntype="string" output="false">
<cfargument name="map" type="struct" required="true" />
<cfargument name="template" type="string" required="true" />
<cfset var str = arguments.template />
<cfset var k = "" />
<cfloop list="#StructKeyList(arguments.map)#" index="k">
<cfset str = ReplaceNoCase(str, "{{#k#}}", arguments.map[k], "ALL") />
</cfloop>
<cfreturn str />
</cffunction>
And use it like so:
<cfset map = { name : "John", product : "SpecialWidget" } />
<cfset filledInTemplate = fillInTemplate(map, someTemplate) />

Not sure you need rereplace, you could brute force it with a simple replace if you don't have too many fields to merge
How about something like this (not tested)
<cfset var BaseTemplate = "... lots of html with embedded tokens">
<cfloop (on whatever)>
<cfset LoopTemplate = replace(BaseTemplate, "#firstName#", myvarforFirstName, "All">
<cfset LoopTemplate = replace(LoopTemplate, "#lastName#", myvarforLastName, "All">
<cfset LoopTemplate = replace(LoopTemplate, "#address#", myvarforAddress, "All">
</cfloop>
Just treat the html block as a simple string.

CF 7+: You may use regular expression, REReplace()?
CF 9: use Virtual File System

If the variable is in a structure from, something like a form post, then you can use "StructFind". It does exactly as you request. I ran into this issue when processing a form with dynamic inputs.
Ex.
StructFind(FORM, 'WhatYouNeed')

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using Jsoup (the Java html parser) with Handlebarsjs - coldfusion

Related

Problem with anchor links using resolveurl

Excluding items from a list in coldfusion by type

How do I replace text in all href attributes of anchor tags?

How can I strip this URL of everything before "http://"?

Resolving variables inside a Coldfusion string

Categories

Resources