scraping data hurdles with coldfusion

scraping data hurdles with coldfusion - coldfusion

I am working on a scraping and i need to extract the informaion of the websites, basically it is giving link to inside pages
Here is my Small try with scraping, i am able to get the names of the states but i need to go deeper now as i admit, i am very bad at regular expressions
Please guide. here is my code
<cfscript>
function stripHTML(str) {
str = reReplaceNoCase(str, "<*style.*?>(.*?)</style>","","all");
str = reReplaceNoCase(str, "<*script.*?>(.*?)</script>","","all");
str = reReplaceNoCase(str, "<.*?>","","all");
str = reReplaceNoCase(str, "^.*?>","");
str = reReplaceNoCase(str, "<.*$","");
return trim(str);
}
</cfscript>
<cfset start = false>
<cfhttp url="http://www.mapsofindia.com/pincode/" method="get"></cfhttp>
<cfset status = cfhttp.Statuscode>
<cfif status IS '200 OK'>
<cfset start = true>
</cfif>
<cfif start>
<cfset string = cfhttp.filecontent />
<cfset StartText = '<table cellpadding=4 cellspacing=0 align="center" border="0" class=extrtable width="90%">' />
<cfset Start = FindNoCase(StartText, string, 1) />
<cfset EndText='<td style="width:50%"> </td></tr></table><br /><br />' />
<cfset Length=Len(StartText) />
<cfset End = FindNoCase(EndText, string, Start) />
<cfset parse = Mid(string, Start+Length, End-Start-Length) />
<cfset parse = Insert('+',parse,FindNoCase("</a>", parse))>
<cfset parse = trim(parse) />
<cfset dbData = stripHTML(ListQualify(parse,'+'))>
<cfloop list="#createLst#" index="k" delimiters="+">
<cfquery name="insert_data" datasource="#request.dsn#">
INSERT INTO cw_states(state,countrycode,country)
VALUES('#k#','IN','India')
</cfquery>
</cfloop>
<table align="center">
<cfoutput>#parse#<br></cfoutput>
</tr>
</table>
</cfif>
This is listing the cfoutput on page correctly, but i am unable to insert into the database.
Question #2: The link is taking to another page which lists the cities of the state, i want to append my page like index2.cfm?url=#link# which will extract the cities and insert the cities in my cities table and again on the link it should be like index3.cfm?url=#link# which will open another page of zip and insert those zip names in the database
Please guide

Related

How can I keep <sup> & <sub> html tags in my coldfusion string but get rid of all other html tags?

How can I keep <sup>whatever</sup> & <sub>whatever</sub> html tags in my coldfusion string but get rid of all other html tags?

Although there are many ways to run regex magic in CF, I still prefer some Java here to walk through content and capture stuff.
<!--- string with tags to strip --->
<cfsavecontent variable="stringToStrip">
<p class="something">
Hello <sup>World</sup>
</p>
<div>
<div style="border: 1px solid;">foo</div>
<sub class="example">bar</sub>
</div>
</cfsavecontent>
<!--- regex to capture all tag occurences --->
<cfset stripRegEx = "<[^>]+>">
<cfset result = createObject("java", "java.lang.StringBuilder").init()>
<cfset matcher = createObject("java", "java.util.regex.Pattern").compile(stripRegEx).matcher(stringToStrip)>
<cfset last = 0>
<cfloop condition="matcher.find()">
<!--- append content before next capture --->
<cfset result.append(
stringToStrip.substring(
last,
matcher.start()
)
)>
<!--- full tag capture --->
<cfset capture = matcher.group(
javaCast("int", 0)
)>
<!--- keep only sub/sup tags --->
<cfif reFindNoCase("</?su[bp]", capture)>
<cfset result.append(capture)>
</cfif>
<!--- continue at last cursor --->
<cfset last = matcher.end()>
</cfloop>
<!--- append remaining content --->
<cfset result.append(
stringToStrip.substring(last)
)>
<!--- final result --->
<cfset result = result.toString()>
<cfoutput>#result#</cfoutput>
Output is:
Hello <sup>World</sup>
foo
<sub class="example">bar</sub>

I think you could also use a negative lookahead in a regex replace like so:
stripped_string = reReplaceNoCase(source_string, '<(?!/?su[bp]\b)[^>]+>', '', 'all' );

Query created from Query returned from cfspreadsheet not having proper values

Today I came across a very odd case while reading a vlue from a spreadsheet and trying to filter them on a condition and a create a spreadsheet from the filtered data. Here are my steps
Read Excel sheet
<cfspreadsheet action="read" src="#local.sFilePath#" excludeHeaderRow="true" headerrow ="1" query="local.qExcelData" sheet="1" />
Create a holding Query
<cfset local.columnNames = "LoanNumber,Product," />
<cfset local.qSuccessData = queryNew(local.columnNames,"VarChar,VarChar") />
Filter the Excel returned query on a condition and add the valid ones into the new Holding query
<cfloop query="local.qExcelData" >
<cfif ListFind(local.nExceptionRowList,local.qExcelData.currentrow) EQ 0>
<cfset queryAddRow(local.qSuccessData) />
<cfset querySetCell(local.qSuccessData, 'LoanNumber', local.qExcelData['Loan Number']) />
<cfset querySetCell(local.qSuccessData, 'Product', local.qExcelData['Product']) />
</cfif>
</cfloop>
Create the new spreadsheet
<cfspreadsheet action="write" query="local.qSuccessData" filename="#local.sTempSuccessFile#" overwrite="true">
However I am getting the following content in my excel sheet
Loannumber Product
coldfusion.sql.column#87875656we coldfusion.sql.column#89989ER
Please help on this to get it work.

I believe the query loop is not mapping values to the Holding-Query properly.
Please modify your loop as below:
<cfloop query="local.qExcelData" >
<cfif ListFind(local.nExceptionRowList,local.qExcelData.currentrow) EQ 0>
<cfset queryAddRow(local.qSuccessData) />
<cfset querySetCell(local.qSuccessData, 'LoanNumber', local.qExcelData['Loan Number'][currentRow]) />
<cfset querySetCell(local.qSuccessData, 'Product', local.qExcelData['Product'][currentRow]) />
</cfif>
</cfloop>

Error with multiple PDF Generation in CF

I'm getting an error when producing a multi-page PDF.
The pages attribute is not specified for the MERGE action in the cfpdf tag.
The line that is causing the issue is: <cfpdf action="merge" source="#ArrayToList(variables.pdfList)#" destination="promega.pdf" overwrite="yes" />
I tried looking in Adobe's documentation bug cannot find an attribute pages for the merge action. Thoughts?
<!--- Append PDF to list for merge printing later --->
<cfset ArrayAppend(variables.pdfList, "#expandPath('.')#\general.pdf") />
<cfset variables.userAgenda = GetAttendeeSchedule(
variables.event_key,
variables.badgeNum
) />
<!--- Field CFID is the id of the agenda item; use this for certificate selection --->
<cfif variables.userAgenda.recordcount>
<cfloop query="variables.userAgenda">
<cfset variables.title = Trim(variables.userAgenda.CUSTOMFIELDNAMEONFORM) />
<cfpdfform source="#expandPath('.')#\promega_certificate.pdf" destination="#cfid#.pdf" action="populate">
<cfset variables.startdate = replace(CUSTOMFIELDSTARTDATE, "T", " ") />
<cfpdfformparam name="WORKSHOP" value="#variables.title#">
<cfpdfformparam name="NAME" value="#variables.badgeInfo.FirstName# #variables.badgeInfo.LastName#">
<cfpdfformparam name="STARTDATE" value="#DateFormat(variables.startdate, "medium" )#">
</cfpdfform>
<!--- Append PDF to list for merge printing later --->
<cfset ArrayAppend(variables.pdfList, "#expandPath('.')#\#cfid#.pdf") />
</cfloop>
</cfif>
<cfif ArrayLen(variables.pdfList)>
<cfpdf action="merge" source="#ArrayToList(variables.pdfList)#" destination="promega.pdf" overwrite="yes" />
<!--- Delete individual files --->
<cfloop list="#ArrayToList(variables.pdfList)#" index='i'>
<cffile action="delete" file="#i#" />
</cfloop>
<cftry>
<cffile action="delete" file="#expandPath('.')#\general.pdf" />
<cfcatch></cfcatch>
</cftry>
<cfheader name="Content-Disposition" value="attachment;filename=promega.pdf">
<cfcontent type="application/octet-stream" file="#expandPath('.')#\promega.pdf" deletefile="Yes">
<cflocation url="index.cfm" addtoken="false" />
</cfif>

This happens when source is a single file rather than a comma separated list of files. I'm guessing that if it's a single file is is expecting to extract some pages rather than add.

I tried the following on my coldfusion 9 machine and it worked just fine:
<cfset strPath = GetDirectoryFromPath(GetCurrentTemplatePath()) />
<Cfset pdflist = arrayNew(1)>
<Cfset pdflist[1] = "#strPath#page1.pdf">
<Cfset pdflist[2] = "#strPath#page2.pdf">
<cfpdf action="merge" source="#ArrayToList(pdflist)#" destination="#strPath#merged.pdf" overwrite="yes" />
You could try to merge the pages like this and check whether you still get an error:
<cfset strPath = GetDirectoryFromPath(GetCurrentTemplatePath()) />
<Cfset pdflist = arrayNew(1)>
<Cfset pdflist[1] = "#strPath#page1.pdf">
<Cfset pdflist[2] = "#strPath#page2.pdf">
<cfpdf action="merge" destination="#strPath#merged.pdf" overwrite="yes">
<Cfloop from=1 to="#arraylen(pdflist)#" index="x">
<cfpdfparam source="#pdfList[x]#">
</cfloop>
</cfpdf>

Coldfusion REReplace (to parse Twitter Feed)

I have a twitter feed in the format:
1. Username: Blah blah http://something.com #hashtag
2. Username: Blah blah http://something.com #hashtag
3. Username: Blah blah http://something.com #hashtag
I'm removed the username, but how do I wrap tags (for styling) around the http:// parts and the #hashtags?
Here is my current coldfusion code:
<cfset feedurl="http://twitter.com/statuses/user_timeline/MyUserID.rss" />
<cffeed
source="#feedurl#"
properties="feedmeta"
query="feeditems" />
<cffeed
source="#feedurl#"
properties="feedmeta"
query="feeditems" />
<ul>
<cfoutput query="feeditems">
<cfsavecontent variable="twitterString">
#content#
</cfsavecontent>
<li>#REReplace(twitterString, "UserName: ", "")#</li>
</cfoutput>
</ul>

This worked for me:
<cfset feedurl="http://twitter.com/statuses/user_timeline/jakefeasel.rss" />
<cffeed
source="#feedurl#"
properties="feedmeta"
query="feeditems" />
<cffeed
source="#feedurl#"
properties="feedmeta"
query="feeditems" />
<ul>
<cfoutput query="feeditems">
<cfsavecontent variable="twitterString">
#REReplace(content, "UserName: ", "")#
</cfsavecontent>
<cfset formattedString = twitterString>
<cfloop array='#[{"regex" = 'http://\S+', "class" = "URL"}, {"regex" = "##\w+", "class" = "hashTag"}]#' index="regexStruct">
<cfset currentPos = 0>
<cfset matches = ReFindNoCase(regexStruct.regex, twitterString, currentPos, true)>
<cfloop condition="matches.len[1] IS NOT 0">
<cfset formattedString = Replace(formattedString, mid(twitterString, matches.pos[1], matches.len[1]), "<span class='#regexStruct.class#'>" & mid(twitterString, matches.pos[1], matches.len[1]) & "</span>")>
<cfset currentPos = matches.pos[1] + matches.len[1]>
<cfset matches = ReFindNoCase(regexStruct.regex, twitterString, currentPos, true)>
</cfloop>
</cfloop>
<li>
#formattedString#
</li>
</cfoutput>
</ul>
You'll obviously have to provide styles for the "URL" and "hashtag" classes.

phrase images from webpage coldfusion

i need to get images from a webpage source.
i can use cfhttp method get and use htmleditformat() to read the html from that page, now i need to loop through the content to get all image url's(src)
can i use rematch() or refind() etc... and if yes how??
please help!!!!!
if im not clear i can try to clarify..

It can be very difficult to reliably parse html with regex.

Here's a function that will probably trip up on a lot of bad cases, but might work if you just need something quick and dirty.
<cffunction name="getSrcAttributes" access="public" output="No">
<cfargument name="pageContents" required="Yes" type="string" default="" />
<cfset var continueSearch = true />
<cfset var cursor = "" />
<cfset var startPos = 0 />
<cfset var finalPos = 0 />
<cfset var images = ArrayNew(1) />
<cfloop condition="continueSearch eq true">
<cfset cursor = REFindNoCase("src\=?[\""\']", arguments.pageContents, startPos, true) />
<cfif cursor.pos[1] neq 0>
<cfset startPos = (cursor.pos[1] + cursor.len[1]) />
<cfset finalPos = REFindNoCase("[\""\'\s]", arguments.pageContents, startPos) />
<cfset imgSrc = Mid(arguments.pageContents, startPos, finalPos - startPos) />
<cfset ArrayAppend(images, imgSrc) />
<cfelse>
<cfset continueSearch = false />
</cfif>
</cfloop>
<cfreturn images>
</cffunction>
Note: I can't verify at the moment that this code works.

Use a browser and jQuery to 'query' out all the img tag from the DOM might be easier...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

scraping data hurdles with coldfusion - coldfusion

Related

How can I keep <sup> & <sub> html tags in my coldfusion string but get rid of all other html tags?

Query created from Query returned from cfspreadsheet not having proper values

Error with multiple PDF Generation in CF

Coldfusion REReplace (to parse Twitter Feed)

phrase images from webpage coldfusion

Categories

Resources