Searching a folder (recursively) for duplicate photos using Coldfusion? - coldfusion

After moving and backing up my photo collection a few times I have several duplicate photos, with different filenames in various folders scattered across my PC. So I thought I would write a quick CF (9) page to find the duplicates (and can then add code later to allow me to delete them).
I have a couple of queries:-
At the moment I am just using file size to match the image file, but I presume matching EXIF data or matching hash of image file binary would be more reliable?
The code I lashed together sort of works, but how could this be done to search outside web root?
Is there a better way?
p
<cfdirectory
name="myfiles"
directory="C:\ColdFusion9\wwwroot\images\photos"
filter="*.jpg"
recurse="true"
sort="size DESC"
type="file" >
<cfset matchingCount=0>
<cfset duplicatesFound=0>
<table border=1>
<cfloop query="myFiles" endrow="#myfiles.recordcount#-1">
<cfif myfiles.size is myfiles.size[currentrow + 1]>
<!---this file is the same size as the next row--->
<cfset matchingCount = matchingCount + 1>
<cfset duplicatesFound=1>
<cfelse>
<!--- the next file is a different size --->
<!--- if there have been matches, display them now --->
<cfif matchingCount gt 0>
<cfset sRow=#currentrow#-#matchingCount#>
<cfoutput><tr>
<cfloop index="i" from="#sRow#" to="#currentrow#">
<cfset imgURL=#replace(directory[i], "C:\ColdFusion9\wwwroot\", "http://localhost:8500/")#>
<td><img height=200 width=200 src="#imgURL#\#name[i]#"></td>
</cfloop></tr><tr>
<cfloop index="i" from="#sRow#" to="#currentrow#">
<td width=200>#name[i]#<br>#directory[i]#</td>
</cfloop>
</tr>
</cfoutput>
<cfset matchingCount = 0>
</cfif>
</cfif>
</cfloop>
</table>
<cfif duplicatesFound is 0><cfoutput>No duplicate jpgs found</cfoutput></cfif>

This is pretty fun task, so I've decided to give it a try.
First, some testing results on my laptop with 4GB RAM, 2x2.26Ghz CPU and SSD: 1,143 images, total 263.8MB.
ACF9: 8 duplicates, took ~2.3 s
Railo 3.3: 8 duplicates, took ~2.0 s (yay!)
I've used great tip from this SO answer to pick the best hashing option.
So, here is what I did:
<cfsetting enablecfoutputonly="true" />
<cfset ticks = getTickCount() />
<!--- this is great set of utils from Apache --->
<cfset digestUtils = CreateObject("java","org.apache.commons.codec.digest.DigestUtils") />
<!--- cache containers --->
<cfset checksums = {} />
<cfset duplicates = {} />
<cfdirectory
action="list"
name="images"
directory="/home/trovich/images/"
filter="*.png|*.jpg|*.jpeg|*.gif"
recurse="true" />
<cfloop query="images">
<!--- change delimiter to \ if you're on windoze --->
<cfset ipath = images.directory & "/" & images.name />
<cffile action="readbinary" file="#ipath#" variable="binimage" />
<!---
This is slow as hell with any encoding!
<cfset checksum = BinaryEncode(binimage, "Base64") />
--->
<cfset checksum = digestUtils.md5hex(binimage) />
<cfif StructKeyExists(checksums, checksum)>
<!--- init cache using original on 1st position when duplicate found --->
<cfif NOT StructKeyExists(duplicates, checksum)>
<cfset duplicates[checksum] = [] />
<cfset ArrayAppend(duplicates[checksum], checksums[checksum]) />
</cfif>
<!--- append current duplicate --->
<cfset ArrayAppend(duplicates[checksum], ipath) />
<cfelse>
<!--- save originals only into the cache --->
<cfset checksums[checksum] = ipath />
</cfif>
</cfloop>
<cfset time = NumberFormat((getTickcount()-ticks)/1000, "._") />
<!--- render duplicates without resizing (see options of cfimage for this) --->
<cfoutput>
<h1>Found #StructCount(duplicates)# duplicates, took ~#time# s</h1>
<cfloop collection="#duplicates#" item="checksum">
<p>
<!--- display all found paths of duplicate --->
<cfloop array="#duplicates[checksum]#" index="path">
#HTMLEditFormat(path)#<br/>
</cfloop>
<!--- render only last duplicate, they are the same image any way --->
<cfimage action="writeToBrowser" source="#path#" />
</p>
</cfloop>
</cfoutput>
Obviously, you can easily use duplicates array to review the results and/or run some cleanup job.
Have fun!

I would recommend split up the checking code into a function which only accepts a filename.
Then use a global struct for checking for duplicates, the key would be "size" or "size_hash" and the value could be an array which will contain all filenames that matches this key.
Run the function on all jpeg files in all different directories, after that scan the struct and report all entries that have more than one file in it's array.
If you want to show an image outside your webroot you can serve it via < cfcontent file="#filename#" type="image/jpeg">

Related

Custom CFInclude for file customization

Our code base has quite a bit of the following example as we allow a lot of our base pages to be customized to our customers' individual needs.
<cfif fileExists("/custom/someFile.cfm")>
<cfinclude template="/custom/someFile.cfm" />
<cfelse>
<cfinclude template="someFile.cfm" />
</cfif>
I wanted to create a custom CF tag to boilerplate this as a simple <cf_custominclude template="someFile.cfm" />, however I ran into the fact that custom tags are effectively blackboxes, so they aren't pulling in local variables that exist prior to the start of the tag, and I can't reference any variable that was created as a result of the tag from importing the file.
E.G.
<!--- This is able to use someVar --->
<!--- Pulls in some variable named "steve" --->
<cfinclude template="someFile.cfm" />
<cfdump var="#steve#" /> <!--- This is valid, however... --->
<!--- someVar is undefined for this --->
<!--- Pulls in steve2 --->
<cf_custominclude template="someFile.cfm" />
<cfdump var="#steve2#" /> <!--- This isn't valid as steve2 is undefined. --->
Is there a means around this, or should I utilize some other language feature to accomplish my goal?
Well, I question doing this at all but I know we all get handed code at times we have to deal with and the struggle it is to get people to refactor.
This should do what you are wanting. One important thing to note is that you will need to ensure your custom tag has a closing or it won't work! Just use the simplified closing, so like you had it above:
<cf_custominclude template="someFile.cfm" />
This should do the trick, called it has you had it : custominclude.cfm
<!--- executes at start of tag --->
<cfif thisTag.executionMode eq 'Start'>
<!--- store a list of keys we don't want to copy, prior to including template --->
<cfset thisTag.currentKeys = structKeyList(variables)>
<!--- control var to see if we even should bother copying scopes --->
<cfset thisTag.includedTemplate = false>
<!--- standard include here --->
<cfif fileExists(expandPath(attributes.template))>
<cfinclude template="#attributes.template#">
<!--- set control var / flag to copy scopes at close of tag --->
<cfset thisTag.includedTemplate = true>
</cfif>
</cfif>
<!--- executes at closing of tag --->
<cfif thisTag.executionMode eq 'End'>
<!--- if control var / flag set to copy scopes --->
<cfif thisTag.includedTemplate>
<!--- only copy vars created in the included page --->
<cfloop list="#structKeyList(variables)#" index="var">
<cfif not listFindNoCase(thisTag.currentKeys, var)>
<!--- copy from include into caller scope --->
<cfset caller[var] = variables[var]>
</cfif>
</cfloop>
</cfif>
</cfif>
I tested it and it works fine, should work fine being nested as well. Good luck!
<!--- Pulls in steve2 var from include --->
<cf_custominclude template="someFile.cfm" />
<cfdump var="#steve2#" /> <!--- works! --->

ColdFusion looping over an array with an empty/undefined field

I am downloading data from an API from one of our vendors. The data is an array but some of the fields are empty and come over as undefined. I am able to get most of the information out with a loop but when I add the field "notes" it fails with the error of:
"Element notes is undefined in a CFML structure referenced as part of an expression. The specific sequence of files included or processed is:
C:\websites\Fire\Reports\xml_parse\Crewsense_payroll_loop.cfm, line:
21 "
When I look at the dump I see that the field shows as "undefined". I've run out of ideas. Any help would be greatly appreciated. I've included the entire code and a link to the dump showing the array.
<cfhttp url="https://api.crewsense.com/v1/payroll? access_token=as;lkdfj;alskdfj;laksdfj&token_type=bearer&start=2019-01-05%2019:00:00&end=2019-01-06%2007:59:00" method="GET" resolveurl="YES" result="result">
</cfhttp>
<cfoutput>
<cfset ApiData = deserializeJSON(result.filecontent)>
<cfset API_ArrayLength = arraylen(ApiData)>
<cfloop index="i" from="1" to=#API_ArrayLength#>
#i# #ApiData[i]["name"]#
#ApiData[i]["employee_id"]#
#ApiData[i]["start"]#
#ApiData[i]["end"]#
#ApiData[i]["total_hours"]#
#ApiData[i]["work_type"]#
#ApiData[i]["work_code"]#
#ApiData[i]["user_id"]#
#ApiData[i]["notes"]# <---Fails here when added--->
<cfset i = i+1>
<br>
</cfloop>
<cfdump var="#ApiData#">
</cfoutput>
Dump
When dealing with data structures that have optional elements you will need to check for their existence before trying to access them. Otherwise you will get that error. I have added a snippet with an if condition utilizing the structKeyExists() function to your code as an example.
<cfhttp url="https://api.crewsense.com/v1/payroll? access_token=as;lkdfj;alskdfj;laksdfj&token_type=bearer&start=2019-01-05%2019:00:00&end=2019-01-06%2007:59:00" method="GET" resolveurl="YES" result="result">
</cfhttp>
<cfoutput>
<cfset ApiData = deserializeJSON(result.filecontent)>
<cfset API_ArrayLength = arraylen(ApiData)>
<cfloop index="i" from="1" to=#API_ArrayLength#>
#i# #ApiData[i]["name"]#
#ApiData[i]["employee_id"]#
#ApiData[i]["start"]#
#ApiData[i]["end"]#
#ApiData[i]["total_hours"]#
#ApiData[i]["work_type"]#
#ApiData[i]["work_code"]#
#ApiData[i]["user_id"]#
<cfif structKeyExists(ApiData[i],"notes")>
#ApiData[i]["notes"]# <!--- Show 'notes' if it exists --->
<cfelse>
'notes' is not available <!--- Do something here (or not) --->
</cfif>
<cfset i = i+1>
<br>
</cfloop>
<cfdump var="#ApiData#">
</cfoutput>

ColdFusion - How to find a string in that's always preceded and followed by certain characters

Let's say I have a string that is as follows:
position":1,"type":"row","answer_id":"9541943203"},{"text":"Creating a report or view","visible":true,"position":2,"type":"row","answer_id":"9541943204"},{"text":"Editing a report or view","visible":true,"position":3,"type":"row","answer_id":"9541943205"},{"text":"Saving a report or view","visible":true,"position":4,"type":"row","answer_id":"9541943206"},
How can I get the values of every answer_id?
I know the value I want is always preceded by "answer_id":" and it's always followed by "},.
How do I compile a list of those values?
e.g. 9541943203, 9541943204, 9541943205, 9541943206
Dump of deserialized JSON:
You can work with JSON in ColdFusion (almost) the same way you do in Javascript.
<!--- your JSON string here --->
<cfset sourceString = '{ "data": { ... } }'>
<!--- we will add all answer_id to this array --->
<cfset arrayOfAnswerIDs = []>
<!--- in case there went something wrong, we will store the reason in this variable --->
<cfset errorMessage = "">
<cftry>
<!--- deserialize the JSON to work with for ColdFusion --->
<cfset sourceJSON = deserializeJSON(sourceString)>
<!--- validate the structure of the JSON (expected keys, expected types) --->
<cfif (
structKeyExists(sourceJSON, "data") and
isStruct(sourceJSON["data"]) and
structKeyExists(sourceJSON["data"], "pages") and
isArray(sourceJSON["data"]["pages"])
)>
<!--- iterate pages --->
<cfloop array="#sourceJSON["data"]["pages"]#" index="page">
<!--- skip pages that do not contain questions --->
<cfif (
(not isStruct(page)) or
(not structKeyExists(page, "questions")) or
(not isArray(page["questions"]))
)>
<cfcontinue>
</cfif>
<!--- iterate questions --->
<cfloop array="#page["questions"]#" index="question">
<!--- skip questions that do not have answers --->
<cfif (
(not isStruct(question)) or
(not structKeyExists(question, "answers")) or
(not isArray(question["answers"]))
)>
<cfcontinue>
</cfif>
<!--- iterate answers --->
<cfloop array="#question["answers"]#" index="answer">
<!--- skip invalid answer objects --->
<cfif not isStruct(answer)>
<cfcontinue>
</cfif>
<!--- fetch the answer_id --->
<cfif (
structKeyExists(answer, "answer_id") and
isSimpleValue(answer["answer_id"])
)>
<!--- add the answer_id to the array --->
<cfset arrayOfAnswerIDs.add(
answer["answer_id"]
)>
</cfif>
</cfloop>
</cfloop>
</cfloop>
<cfelse>
<cfset errorMessage = "Pages missing or invalid in JSON.">
</cfif>
<cfcatch type="Application">
<cfset errorMessage = "Failed to deserialize JSON.">
</cfcatch>
</cftry>
<!--- show result in HTML --->
<cfoutput>
<!--- display error if any occured --->
<cfif len(errorMessage)>
<p>
#errorMessage#
</p>
</cfif>
<p>
Found #arrayLen(arrayOfAnswerIDs)# answers with an ID.
</p>
<ol>
<cfloop array="#arrayOfAnswerIDs#" index="answer_id">
<li>
#encodeForHtml(answer_id)#
</li>
</cfloop>
</ol>
</cfoutput>
You might want to track all unexpected skips during the processing, so consider having an array of errors instead of a single string.
If you want the answer_id in a list, use
arrayToList(arrayOfAnswerIDs, ",")

Randomize cfoutput

The following code changes the body BG dependent on if the /home url is active else use a different BG. We have the pageID = 206 containing the /home BG image and pageID = 207 containing the else BG image. We currently have it setup this way so the client can go into the CMS and change the BG image without an issue.
However, the client would like the ability to add additional backgrounds and have them randomize. In theory I would add a page to the CMS and include the pageID and add randomize. I'm familiar with ColdFusion but this is a little over my head. Any input or direction would be greatly appreciated.
<cfsavecontent variable="HOME_BG">
<cfoutput>
<cfset includeID = '206'><cfinclude emplate='/PageInclude/PageInclude.cfm'/>
</cfoutput>
</cfsavecontent>
<cfset BG1 = getToken(HOME_BG,3, """") >
<cfsavecontent variable="COMM_BG">
<cfoutput>
<cfset includeID = '207'><cfinclude template='/PageInclude/PageInclude.cfm'/>
</cfoutput>
</cfsavecontent>
<cfset BG2 = getToken(COMM_BG,3, """") >
<body onload="checkCookie()" style="background-image: url(<cfoutput><cfif fpath EQ "home">#BG1#<cfelse>#BG2#</cfif></cfoutput>);">
I was looking for the ability to randomize between two different < cfset > but didn't find much luck.
Solution using an array and randrange
<cfset bgHomeIDArray = ArrayNew(1)> <!--- Create the array --->
<cfset ArrayAppend(bgHomeIDArray, 214)> <!--- Adds an array value --->
<cfset ArrayAppend(bgHomeIDArray, 215)> <!--- Adds an array value --->
<cfsavecontent variable="HOME_BG">
<cfoutput>
<cfset includeID = bgHomeIDArray[randRange(1, len(bgHomeIDArray))]> <cfinclude template='/PageInclude/PageInclude.cfm'/>
</cfoutput>
</cfsavecontent>
<cfset BG1 = getToken(HOME_BG,3, """") >
<cfset bgCommIDArray = ArrayNew(1)> <!--- Create the array --->
<cfset ArrayAppend(bgCommIDArray, 216)> <!--- Adds an array value --->
<cfset ArrayAppend(bgCommIDArray, 217)> <!--- Adds an array value --->
<cfsavecontent variable="COMM_BG">
<cfoutput>
<cfset includeID = bgCommIDArray[randRange(1, len(bgCommIDArray))]><cfinclude template='/PageInclude/PageInclude.cfm'/>
</cfoutput>
</cfsavecontent>
<cfset BG2 = getToken(COMM_BG,3, """") >
<body onLoad="checkCookie()" style="background-image: url(<cfoutput><cfif fpath EQ "home">#BG1#<cfelse>#BG2#</cfif></cfoutput>?03202014);">

ColdFusion searching robots.txt for specific page exception

We're adding some functionality to our CMS whereby when a user creates a page, they can select an option to allow/disallow search engine indexing of that page.
If they select yes, then something like the following would apply:
<cfif request.variables.indexable eq 0>
<cffile
action = "append"
file = "C:\websites\robots.txt"
output = "Disallow: /blocked-page.cfm"
addNewLine = "yes">
<cfelse>
<!-- check if page already disallowed in robots.txt and remove line if it does --->
</cfif>
It's the <cfelse> clause I need help with.
What would be the best way to parse robots.txt to see if this page had already been disallowed? Would it be a cffile action="read", then do a find() on the read variable?
Actually, the check on whether the page has already been disallowed would probably go further up, to avoid double-adding.
You keep the list of pages in database and each page record has a indexable bit, right? If yes, simpler and more reliable approach would be to generate new robots.txt each time some page is added/deleted/changes indexable bit.
<!--- TODO: query for indexable pages ---->
<!--- lock the code to prevent concurrent changes --->
<cflock name="robots.txt" type="exclusive" timeout="30">
<!--- flush the file, or simply start with writing something --->
<cffile
action = "write"
file = "C:\websites\robots.txt"
output = "Sitemap: http://www.mywebsite.tld/sitemap.xml"
addNewLine = "yes">
<!--- append indexable entry to the file --->
<cfloop query="getPages">
<!--- we assume that page names are not entered by user (= safe names) --->
<cffile
action = "append"
file = "C:\websites\robots.txt"
output = "Disallow: /#getPages.name#.cfm"
addNewLine = "yes">
</cfloop>
</cflock>
Sample code is not tested, be aware of typos/bugs.
Using the Robots.txt files for this purpose is a bad idea. Robots.txt is not a security measure and you're handing "evildoers" a list of pages that you don't want indexed.
You're much better off using the robots meta tag, which will not provide anyone with a list of pages that you don't want indexed, and gives you greater control of the individual actions a robot can perform.
Using the meta tags, you would simply output the tags when generating the page as usual.
<!--- dummy page to block --->
<cfset request.pageToBlock = "/blocked-page.cfm" />
<!--- read in current robots.txt --->
<cffile action="read" file="#expandPath('robots.txt')#" variable="data" />
<!--- build a struct of all blocked pages --->
<cfset pages = {} />
<cfloop list="#data#" delimiters="#chr(10)#" index="i">
<cfset pages[listLast(i,' ')] = '' />
</cfloop>
<cfif request.variables.indexable eq 0>
<!--- If the page is not yet blocked add it --->
<cfif not structKeyExists(pages,pageToBlock)>
<cffile action="append" file="C:\websites\robots.txt"
output="Disallow: #request.pageToBLock#" addNewLine="yes" />
<!--- not sure if this is in a loop but if it is add it to the struct for nex iteration --->
<cfset pages[request.pageToBlock] = '' />
</cfif>
</cfif>
This should do it. Read in the file, loop over it and build a struct of the bloocked pages. Only add a new page if it's not already blocked.