scraping data hurdles with coldfusion - coldfusion

I am working on a scraping and i need to extract the informaion of the websites, basically it is giving link to inside pages
Here is my Small try with scraping, i am able to get the names of the states but i need to go deeper now as i admit, i am very bad at regular expressions
Please guide. here is my code
<cfscript>
function stripHTML(str) {
str = reReplaceNoCase(str, "<*style.*?>(.*?)</style>","","all");
str = reReplaceNoCase(str, "<*script.*?>(.*?)</script>","","all");
str = reReplaceNoCase(str, "<.*?>","","all");
str = reReplaceNoCase(str, "^.*?>","");
str = reReplaceNoCase(str, "<.*$","");
return trim(str);
}
</cfscript>
<cfset start = false>
<cfhttp url="http://www.mapsofindia.com/pincode/" method="get"></cfhttp>
<cfset status = cfhttp.Statuscode>
<cfif status IS '200 OK'>
<cfset start = true>
</cfif>
<cfif start>
<cfset string = cfhttp.filecontent />
<cfset StartText = '<table cellpadding=4 cellspacing=0 align="center" border="0" class=extrtable width="90%">' />
<cfset Start = FindNoCase(StartText, string, 1) />
<cfset EndText='<td style="width:50%"> </td></tr></table><br /><br />' />
<cfset Length=Len(StartText) />
<cfset End = FindNoCase(EndText, string, Start) />
<cfset parse = Mid(string, Start+Length, End-Start-Length) />
<cfset parse = Insert('+',parse,FindNoCase("</a>", parse))>
<cfset parse = trim(parse) />
<cfset dbData = stripHTML(ListQualify(parse,'+'))>
<cfloop list="#createLst#" index="k" delimiters="+">
<cfquery name="insert_data" datasource="#request.dsn#">
INSERT INTO cw_states(state,countrycode,country)
VALUES('#k#','IN','India')
</cfquery>
</cfloop>
<table align="center">
<cfoutput>#parse#<br></cfoutput>
</tr>
</table>
</cfif>
This is listing the cfoutput on page correctly, but i am unable to insert into the database.
Question #2: The link is taking to another page which lists the cities of the state, i want to append my page like index2.cfm?url=#link# which will extract the cities and insert the cities in my cities table and again on the link it should be like index3.cfm?url=#link# which will open another page of zip and insert those zip names in the database
Please guide

Related

How can I keep <sup> & <sub> html tags in my coldfusion string but get rid of all other html tags?

How can I keep <sup>whatever</sup> & <sub>whatever</sub> html tags in my coldfusion string but get rid of all other html tags?
Although there are many ways to run regex magic in CF, I still prefer some Java here to walk through content and capture stuff.
<!--- string with tags to strip --->
<cfsavecontent variable="stringToStrip">
<p class="something">
Hello <sup>World</sup>
</p>
<div>
<div style="border: 1px solid;">foo</div>
<sub class="example">bar</sub>
</div>
</cfsavecontent>
<!--- regex to capture all tag occurences --->
<cfset stripRegEx = "<[^>]+>">
<cfset result = createObject("java", "java.lang.StringBuilder").init()>
<cfset matcher = createObject("java", "java.util.regex.Pattern").compile(stripRegEx).matcher(stringToStrip)>
<cfset last = 0>
<cfloop condition="matcher.find()">
<!--- append content before next capture --->
<cfset result.append(
stringToStrip.substring(
last,
matcher.start()
)
)>
<!--- full tag capture --->
<cfset capture = matcher.group(
javaCast("int", 0)
)>
<!--- keep only sub/sup tags --->
<cfif reFindNoCase("</?su[bp]", capture)>
<cfset result.append(capture)>
</cfif>
<!--- continue at last cursor --->
<cfset last = matcher.end()>
</cfloop>
<!--- append remaining content --->
<cfset result.append(
stringToStrip.substring(last)
)>
<!--- final result --->
<cfset result = result.toString()>
<cfoutput>#result#</cfoutput>
Output is:
Hello <sup>World</sup>
foo
<sub class="example">bar</sub>
I think you could also use a negative lookahead in a regex replace like so:
stripped_string = reReplaceNoCase(source_string, '<(?!/?su[bp]\b)[^>]+>', '', 'all' );

Query created from Query returned from cfspreadsheet not having proper values

Today I came across a very odd case while reading a vlue from a spreadsheet and trying to filter them on a condition and a create a spreadsheet from the filtered data. Here are my steps
Read Excel sheet
<cfspreadsheet action="read" src="#local.sFilePath#" excludeHeaderRow="true" headerrow ="1" query="local.qExcelData" sheet="1" />
Create a holding Query
<cfset local.columnNames = "LoanNumber,Product," />
<cfset local.qSuccessData = queryNew(local.columnNames,"VarChar,VarChar") />
Filter the Excel returned query on a condition and add the valid ones into the new Holding query
<cfloop query="local.qExcelData" >
<cfif ListFind(local.nExceptionRowList,local.qExcelData.currentrow) EQ 0>
<cfset queryAddRow(local.qSuccessData) />
<cfset querySetCell(local.qSuccessData, 'LoanNumber', local.qExcelData['Loan Number']) />
<cfset querySetCell(local.qSuccessData, 'Product', local.qExcelData['Product']) />
</cfif>
</cfloop>
Create the new spreadsheet
<cfspreadsheet action="write" query="local.qSuccessData" filename="#local.sTempSuccessFile#" overwrite="true">
However I am getting the following content in my excel sheet
Loannumber Product
coldfusion.sql.column#87875656we coldfusion.sql.column#89989ER
Please help on this to get it work.
I believe the query loop is not mapping values to the Holding-Query properly.
Please modify your loop as below:
<cfloop query="local.qExcelData" >
<cfif ListFind(local.nExceptionRowList,local.qExcelData.currentrow) EQ 0>
<cfset queryAddRow(local.qSuccessData) />
<cfset querySetCell(local.qSuccessData, 'LoanNumber', local.qExcelData['Loan Number'][currentRow]) />
<cfset querySetCell(local.qSuccessData, 'Product', local.qExcelData['Product'][currentRow]) />
</cfif>
</cfloop>

Error with multiple PDF Generation in CF

I'm getting an error when producing a multi-page PDF.
The pages attribute is not specified for the MERGE action in the cfpdf tag.
The line that is causing the issue is: <cfpdf action="merge" source="#ArrayToList(variables.pdfList)#" destination="promega.pdf" overwrite="yes" />
I tried looking in Adobe's documentation bug cannot find an attribute pages for the merge action. Thoughts?
<!--- Append PDF to list for merge printing later --->
<cfset ArrayAppend(variables.pdfList, "#expandPath('.')#\general.pdf") />
<cfset variables.userAgenda = GetAttendeeSchedule(
variables.event_key,
variables.badgeNum
) />
<!--- Field CFID is the id of the agenda item; use this for certificate selection --->
<cfif variables.userAgenda.recordcount>
<cfloop query="variables.userAgenda">
<cfset variables.title = Trim(variables.userAgenda.CUSTOMFIELDNAMEONFORM) />
<cfpdfform source="#expandPath('.')#\promega_certificate.pdf" destination="#cfid#.pdf" action="populate">
<cfset variables.startdate = replace(CUSTOMFIELDSTARTDATE, "T", " ") />
<cfpdfformparam name="WORKSHOP" value="#variables.title#">
<cfpdfformparam name="NAME" value="#variables.badgeInfo.FirstName# #variables.badgeInfo.LastName#">
<cfpdfformparam name="STARTDATE" value="#DateFormat(variables.startdate, "medium" )#">
</cfpdfform>
<!--- Append PDF to list for merge printing later --->
<cfset ArrayAppend(variables.pdfList, "#expandPath('.')#\#cfid#.pdf") />
</cfloop>
</cfif>
<cfif ArrayLen(variables.pdfList)>
<cfpdf action="merge" source="#ArrayToList(variables.pdfList)#" destination="promega.pdf" overwrite="yes" />
<!--- Delete individual files --->
<cfloop list="#ArrayToList(variables.pdfList)#" index='i'>
<cffile action="delete" file="#i#" />
</cfloop>
<cftry>
<cffile action="delete" file="#expandPath('.')#\general.pdf" />
<cfcatch></cfcatch>
</cftry>
<cfheader name="Content-Disposition" value="attachment;filename=promega.pdf">
<cfcontent type="application/octet-stream" file="#expandPath('.')#\promega.pdf" deletefile="Yes">
<cflocation url="index.cfm" addtoken="false" />
</cfif>
This happens when source is a single file rather than a comma separated list of files. I'm guessing that if it's a single file is is expecting to extract some pages rather than add.
I tried the following on my coldfusion 9 machine and it worked just fine:
<cfset strPath = GetDirectoryFromPath(GetCurrentTemplatePath()) />
<Cfset pdflist = arrayNew(1)>
<Cfset pdflist[1] = "#strPath#page1.pdf">
<Cfset pdflist[2] = "#strPath#page2.pdf">
<cfpdf action="merge" source="#ArrayToList(pdflist)#" destination="#strPath#merged.pdf" overwrite="yes" />
You could try to merge the pages like this and check whether you still get an error:
<cfset strPath = GetDirectoryFromPath(GetCurrentTemplatePath()) />
<Cfset pdflist = arrayNew(1)>
<Cfset pdflist[1] = "#strPath#page1.pdf">
<Cfset pdflist[2] = "#strPath#page2.pdf">
<cfpdf action="merge" destination="#strPath#merged.pdf" overwrite="yes">
<Cfloop from=1 to="#arraylen(pdflist)#" index="x">
<cfpdfparam source="#pdfList[x]#">
</cfloop>
</cfpdf>

Coldfusion REReplace (to parse Twitter Feed)

I have a twitter feed in the format:
1. Username: Blah blah http://something.com #hashtag
2. Username: Blah blah http://something.com #hashtag
3. Username: Blah blah http://something.com #hashtag
I'm removed the username, but how do I wrap tags (for styling) around the http:// parts and the #hashtags?
Here is my current coldfusion code:
<cfset feedurl="http://twitter.com/statuses/user_timeline/MyUserID.rss" />
<cffeed
source="#feedurl#"
properties="feedmeta"
query="feeditems" />
<cffeed
source="#feedurl#"
properties="feedmeta"
query="feeditems" />
<ul>
<cfoutput query="feeditems">
<cfsavecontent variable="twitterString">
#content#
</cfsavecontent>
<li>#REReplace(twitterString, "UserName: ", "")#</li>
</cfoutput>
</ul>
This worked for me:
<cfset feedurl="http://twitter.com/statuses/user_timeline/jakefeasel.rss" />
<cffeed
source="#feedurl#"
properties="feedmeta"
query="feeditems" />
<cffeed
source="#feedurl#"
properties="feedmeta"
query="feeditems" />
<ul>
<cfoutput query="feeditems">
<cfsavecontent variable="twitterString">
#REReplace(content, "UserName: ", "")#
</cfsavecontent>
<cfset formattedString = twitterString>
<cfloop array='#[{"regex" = 'http://\S+', "class" = "URL"}, {"regex" = "##\w+", "class" = "hashTag"}]#' index="regexStruct">
<cfset currentPos = 0>
<cfset matches = ReFindNoCase(regexStruct.regex, twitterString, currentPos, true)>
<cfloop condition="matches.len[1] IS NOT 0">
<cfset formattedString = Replace(formattedString, mid(twitterString, matches.pos[1], matches.len[1]), "<span class='#regexStruct.class#'>" & mid(twitterString, matches.pos[1], matches.len[1]) & "</span>")>
<cfset currentPos = matches.pos[1] + matches.len[1]>
<cfset matches = ReFindNoCase(regexStruct.regex, twitterString, currentPos, true)>
</cfloop>
</cfloop>
<li>
#formattedString#
</li>
</cfoutput>
</ul>
You'll obviously have to provide styles for the "URL" and "hashtag" classes.

phrase images from webpage coldfusion

i need to get images from a webpage source.
i can use cfhttp method get and use htmleditformat() to read the html from that page, now i need to loop through the content to get all image url's(src)
can i use rematch() or refind() etc... and if yes how??
please help!!!!!
if im not clear i can try to clarify..
It can be very difficult to reliably parse html with regex.
Here's a function that will probably trip up on a lot of bad cases, but might work if you just need something quick and dirty.
<cffunction name="getSrcAttributes" access="public" output="No">
<cfargument name="pageContents" required="Yes" type="string" default="" />
<cfset var continueSearch = true />
<cfset var cursor = "" />
<cfset var startPos = 0 />
<cfset var finalPos = 0 />
<cfset var images = ArrayNew(1) />
<cfloop condition="continueSearch eq true">
<cfset cursor = REFindNoCase("src\=?[\""\']", arguments.pageContents, startPos, true) />
<cfif cursor.pos[1] neq 0>
<cfset startPos = (cursor.pos[1] + cursor.len[1]) />
<cfset finalPos = REFindNoCase("[\""\'\s]", arguments.pageContents, startPos) />
<cfset imgSrc = Mid(arguments.pageContents, startPos, finalPos - startPos) />
<cfset ArrayAppend(images, imgSrc) />
<cfelse>
<cfset continueSearch = false />
</cfif>
</cfloop>
<cfreturn images>
</cffunction>
Note: I can't verify at the moment that this code works.
Use a browser and jQuery to 'query' out all the img tag from the DOM might be easier...