ColdFusion regex to detect first URL without image extension - regex

I've been digging around stackoverflow this evening but nothing worked so far. What I'm trying to achieve is to extract URL from a string (mainly HTML) that doesn't have any image extension at the end. So if the given HTML string has URL ending with .jpg and then somewhere few lines down another URL without any image extension, regex would get that second one and stop. Alternatively it can return all the 'good' URLs, just leave out the images.
So far I've got:
<cfset c= reMatch('(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+##]*[\w\-\#?^=%&/~\+##])?',htmlString)>
I know image detecting part should be somewhere towards the end, but had only managed to freeze the server with my attempts so far.
Sample string to match:
'<tr> <td style="vertical-align: top; padding-right: 12px;"><img src="http://static01.nyt.com/images/2016/01/31/us/why-iowaALT/why-iowaALT-thumbStandard.jpg" /></td> <td> <h6 style="font-size: 10px; font-weight: normal; text-transform: uppercase; color: ##000000; margin: 0; margin-bottom: 2px"></h6> <h1 style="font-weight: normal; font-family: georgia,"times new roman",times,serif; font-size: 23px; margin: 0; margin-bottom: 4px"><a href="http://p.nytimes.com/email/re?location=InCMR7g4BCJTYuyKqXu41s2MxgEX9Okc&user_id=7b8478da99b24f28abb9c2f1be86c807&email_type=eta&task_id=1454290534529254&regi_id=0" style="color: ##004276; text-decoration: none !important;">'
Note: it should be ColdFusion regex version, which is at times a bit limited
Thanks!

Consider that your code works perfectly fine about extracting valid html links and you have them stored in array. All you have to do is go through this array and find does any of urls stored in this array doesn't contain extension -if not, return this value.
list_of_extensions = '(bmp|jpg|png|gif)'; //you can make this list longer
for(my_url in urls){
if(not reFind(list_of_extensions, my_url)){ // you can be more specific in this reFind call
return my_url; // it will return first invalid url
}
}

You can switch to Java for what you are trying to achieve like this:
<!--- Java Regular Expression Pattern Object --->
<cfset local.objPattern = createObject(
"java",
"java.util.regex.Pattern"
).compile(
javaCast( "string", '(?:https?:\/\/)[^"]*(?<!\.jpg)(?=\")' )
)
/>
<!--- Get Pattern Matcher for your html content --->
<cfset local.objMatcher = local.objPattern.matcher(
JavaCast( "string", local.htmlContent )
) />
<!--- Find Matching URLs --->
<cfloop condition="local.objMatcher.find()">
<cfdump var="#local.objMatcher.group()# <br>">
</cfloop>

Related

Removing Specific HTML Tags with CFML

I need a regex to remove all instances of <FONT> and any properties it might have inside it, like <FONT size=2 face=Verdana> and its closing tag </FONT>. the string i get back, the font tag can contain any property and different variations of values for those properties, and the html structure is not consistent. this is one example of what i get as a string:
<UL>
<LI><FONT size=2 face=Verdana>random text<STRONG>random text</STRONG>random text<SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; FONT-SIZE: 11pt; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"><SPAN style="mso-spacerun: yes"> </SPAN>random text</SPAN> </FONT></LI>
<LI><FONT size=2 face=Verdana><FONT size=2 face=Verdana><STRONG>random text</STRONG></FONT></LI> <LI>random text</FONT></LI>
<LI><FONT size=2 face=Verdana>random text</FONT></LI>
<LI><FONT size=2 face=Verdana>random text</FONT></LI>
and this is what i would like it to look like after using the regex:
<UL>
<LI>random text<STRONG>random text</STRONG>random text<SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; FONT-SIZE: 11pt; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"><SPAN style="mso-spacerun: yes"> </SPAN>random text</SPAN></LI>
<LI><STRONG>random text</STRONG></LI>
<LI>random text</LI>
<LI>random text</LI>
<LI>random text</LI>
I have tried different variations and I've been able to remove the <FONT part but not its properties, the ending >, or the closing tag </FONT>
This an example of what I'm using
loc.result = rereplace(arguments.htmlString, "\\<FONT[^*\\>", "", "ALL");
I apologize for my bad regex code, so any hints or suggestions would be greatly appreciated!
As written by others before, don't use REGEX for that. Use an HTML parser like JSoup.
Download the JSoup jar file and save it somewhere on your classpath, and then use the following function (cfscript syntax, tested with Lucee, but should work with any CFML engine):
<cfscript>
/** removes the given tag from the input html while keeping its contents */
function removeTag(input, tagname){
var Jsoup = createObject("java", "org.jsoup.Jsoup");
var doc = Jsoup.parse(arguments.input);
var body = doc.body().child(0);
var tags = body.select(arguments.tagname);
for (var tag in tags){
for (var attr in tag.attributes().asList())
tag.removeAttr(attr.getKey());
}
var result = body.toString();
result = replace(result, "<#arguments.tagname#>", "", "all");
result = replace(result, "</#arguments.tagname#>", "", "all");
return result;
}
</cfscript>
Then just call the function with the HTML code that you want to clean, e.g.:
cleanHtml = removeTag(inputHtml, "font");
To test your example, I added the following:
<cfsavecontent variable="input">
<UL>
<LI><FONT size=2 face=Verdana>random text 1<STRONG>random text 2</STRONG>random text 3<SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; FONT-SIZE: 11pt; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"><SPAN style="mso-spacerun: yes"> </SPAN>random text 4</SPAN> </FONT></LI>
<LI><FONT size=2 face=Verdana><FONT size=2 face=Verdana><STRONG>random text 5</STRONG></FONT></LI> <LI>random text 5</FONT></LI>
<LI><FONT size=2 face=Verdana>random text 6</FONT></LI>
<LI><FONT size=2 face=Verdana>random text 7</FONT></LI>
</cfsavecontent>
<cfdump var="#{ output: removeTag(input, "font"), input: input }#">
And the output is as follows:
I recommend also reading my blog post Harnessing the Power of Java in CFML
Regex can be made like so: <\/?FONT.*?> (test and example).
But overall do not use regex for HTML/XML parsing. Here's why: https://stackoverflow.com/a/1732454/2610466
UPDATE: Fixed answer as per better undestanding of the question

How to remove the attributes in div tag without class,id and style rest of attributes need to be delete using coldfusion?

I have attribute like this.
Example:
<div class="issue-document" id="assr_0335-5985_1991_num_76_1_1619" itemprop="hasPart" itemscope="" itemtype="https://facebook.com>
<cfset mystring = 'This is some text. It is true that <div class="issue-document" id="assr_0335-5985_1991_num_76_1_1619" itemprop="hasPart" itemscope="" itemtype="https://facebook.com">Harry Potter</div> is a good, but is better'>
<cfset MyReplace = ReReplaceNoCase(mystring,"<div [^>]*>","","ALL")>
<cfoutput><pre>Original string: #mystring#
Without link: #myreplace#</pre></cfoutput>
I need to remove only itemscope and itemprop like this rest of the attribute like id,class, style i don't want to remove using coldfusion in regular expression. Can any one please help to find the solution.
<cfset myReplace = reReplaceNoCase(mystring, '\b(itemscope|itemprop)="[^"]*"', "", "all")) />
https://trycf.com/gist/8d84d7d355f7c54e5533eeed22d097ba/acf2016?theme=monokai

replacing the fonts with rereplace or replace

i need some help here with regards to replace or maybe rereplace
i am trying to replace the font-family:anything to font-family:swiss7
but if there is a value font-family: BebasNeue; i want that font untouched and do not add font-size, but add the font-size to the other fonts
I am following this tutorial but somehow its not matching as to what I need to achieve
https://www.sitekickr.com/snippets/coldfusion/strip-css-styles
Update: well I thought the question was simpler. If you want to see my original answer just see the edit history for this post. This answer uses a similar principle though where you search for an item that you don't want removed and then push it back in with the replace.
<cfset teststr1 = "font-family: BebasNeue;" />
<cfset teststr2 = "font-family: Verdana" />
<cfset teststr3 = "font-family: BebasNeue; font-family: Times New Roman; color: red" />
<cfset search1 = "(font-family:\s*)((BebasNeue)|[\w ]+)(;)?" />
<cfset replace1 = "font-family: \3swiss7\4" />
<cfset search2 = "BebasNeueswiss7" />
<cfset replace2 = "BebasNeue" />
<cfoutput>
<ol>
<li>#replaceNoCase(reReplaceNoCase(teststr1, search1, replace1, "all"), search2, replace2, "all")#</li>
<li>#replaceNoCase(reReplaceNoCase(teststr2, search1, replace1, "all"), search2, replace2, "all")#</li>
<li>#replaceNoCase(reReplaceNoCase(teststr3, search1, replace1, "all"), search2, replace2, "all")#</li>
</ol>
</cfoutput>
Result:
1. font-family: BebasNeue;
2. font-family: swiss7
3. font-family: BebasNeue; font-family: swiss7; color: red
So essentially you replace all font families with the chosen font type, in this case swiss7, but by including the group selector in the replace you leave the BebasNeue font in the string. An additional step then cleans up the combined font name left behind.
I would suggest a more maintainable approach:
if !findNoCase( styleString, 'BebasNeue'){
styleString = REReplace(style, 'font-family:[^"|^;]*', "font-family:swiss7; font-size:12", "ALL");
}
This should get you want you are after: leave it alone if it is BebasNeue, otherwise, change the font-family to swiss7 and add a font-size. But it is also not so complicated that a regex-newbie couldn't understand what is going on.

cffile creating blank lines when storing a table

I've got a cfsavecontent tag that saves a table. Later I use cffile to write the saved content to a file. When I look at that file, I see that there many blank lines inserted after <td> tags in the table; and few blank lines inserted after </tr> tags. (Although it doesn't do that where the code says <tr><td> </td></tr> all on one line.)
Presently I have a file which contains two of those tables. The tables are generated in a loop, and the output file is created with cffile append. This file has 915 lines in it of which maybe 30 are non-blank. All my subsequent code works correctly, but this is just test data. In the real world I could have 1000 or more tables, and I am concerned about the file size.
The code:
<cfset head1 = 'from = "moxware" '>
<cfset head2 = 'to = "#hrep.PersonEmail#" '>
<cfset head3 = 'replyto = "#replyto#" '>
<cfset head4 = 'subject = "#subject#" '>
<cfset head5 = 'type = "html" '>
<cfsavecontent variable = "abc">
<cfoutput>
#head1#
#head2#
#head3#
#head4#
#head5# >
#xyz#
</cfoutput>
</cfsavecontent>
<cffile action = "append"
file = "/var/www/reports/moxrep/#reportout#.cfm"
output = "<cfmail"
mode = "777" >
<cffile action = "append"
file = "/var/www/reports/moxrep/#reportout#.cfm"
output = "#abc#"
mode = "777">
<cffile action = "append"
file = "/var/www/reports/moxrep/#reportout#.cfm"
output = "</cfmail>"
mode = "777" >
Re the xyz, I am reading it in from a file:
<cffile action = "read"
file = "/var/www/reports/moxrep/#reportname#.cfm"
variable = "xyz">
and the file looks like this:
<link rel="stylesheet" href="sample.css">
<link rel="stylesheet" type = "text/css" href ="betty.css"/>
<p style="margin-left:40px"><span style="font-size:14px"><span style="font- family:georgia,serif">Dear Customer,</span></span></p>
We were so pleased that you have signed up for one of our programs. Apparently you live in the city of {{1. Additionally we observe that your were referred to us by {{2. Below please find a listing of what you signed up for.</span></span></p>
<p style="margin-left:40px"><span style="font-size:14px"><span style="font- family:georgia,serif">{{r</span></span></p>
<p style="margin-left:40px"><span style="font-size:14px"><span style="font-family:georgia,serif">Sincerely Yours,</span></span></p>
<p style="margin-left:40px"><span style="font-size:14px"><span style="font-family:georgia,serif">John Jones<br />
President<br />
XYZ Corporation</span></span></p>
The file was created by a code generator, not me, so it's a bit cumbersome. Later in the code I replace everything starting with {{ ; in particular {{r gets replaced with a table, and that is where the additional space is coming from.
The append itself is not inserting any extra lines.
Does anyone know what is causing these extra blank lines in the file; and how to get rid of them?
Betty, typically you need to do this carefully if you want to avoid whitespace. In particular the use of cfoutput with a query will generate lines. So this code:
<table>
<cfoutput query="myquery">
<tr><td>#col1#</td><td>#col2#</td></tr>
</cfoutput>
</table>
will produce extra lines... but if you do this:
<cfsetting enablecfoutputonly="yes">
<cfoutput><table></cfoutput>
<cfloop query="myquery"><cfoutput><tr><td>#col1#</td><td>#col2#</td></tr></cfoutput></cfloop>
<cfoutput></table></cfoutput>
You would carefully control exactly what is allowed to be appended to the buffer. enableoutputonly does exactly what it says... it does not allow anything to "go to the buffer" unless it is enclosed in cfoutputs.
Hope this helps. As cameron says you should paste code for questions like this. That's where the answer will typically reside.
(you might also need to experiment with the "addnewline" attribute of cffile - depending on whether your problem is a line at the END of your file).
To answer your question regarding adding cfsetting... in your case you are writing CF code to a file that is then executed later (which by the way is not a great idea usually :). So in your first Append statement:
<cffile action = "append"
file = "/var/www/reports/moxrep/#reportout#.cfm"
output = "<cfmail"
mode = "777" >
Change the "output" to be:
<cffile action = "append"
file = "/var/www/reports/moxrep/#reportout#.cfm"
output = "<cfsetting enablecfoutputonly=""yes""/> <cfmail"
mode = "777" >
But Betty - you will still need to remove the line breaks from your cfsavecontent (if that's where your whitespace is coming from) because they actually ARE inside of a cfoutput. Also, your code that creates the table you are inserting might be at fault - and it is not listed here.
Finally, since this is cfmail take a look at this post regarding line breaks that may or may not have some bearing - but at least gives you one more piece of information :)
http://www.coldfusionmuse.com/index.cfm/2006/4/12/cfmail.linebreak
You may consider using cfprocessingdirective around your cfsavecontent. There is a setting in CF administrator that universally either compresses or retains unnecessary whitespace, "Enable Whitespace Management" - http://help.adobe.com/en_US/ColdFusion/9.0/Admin/WSc3ff6d0ea77859461172e0811cbf3638e6-7ffc.html . Using the suppressWhiteSpace attribute of cfprocessingdirective, you can override this setting for a particular page or part of a page. So in your case:
<cfprocessingdirective suppressWhiteSpace="true">
<cfsavecontent variable="myvar">....
...
...
</cfsavecontent>
</cfprocessingdirective>
may help. Likewise, to ensure the retention of whitespace when building text emails, you'd use suppressWhiteSpace="false".
Cheers,

Coldfusion regex substring YouTube ID

I'm trying to pull a youtube ID from a link like this;
<img src="http://img.youtube.com/vi/OZ3jyvM0jZc/2.jpg" alt="" />
I've only been successful in taking out the ID, but not actually getting it!
<cfset ytID = '<img src="http://img.youtube.com/vi/0Z3jyvM0jZc/2.jpg" alt="" />' />
#reReplace(referer,"(vi=?(\=|\/))([-a-zA-Z0-9_]+)|(vi=\/)([-a-zA-Z0-9_]+)", "\1", "one")#
Output: <img src="http://img.youtube.com/vi//2.jpg" alt="" />
RegEx is not my friend today. What am I missing?
Thanks!
Try with regex:
vi\/([^\/]+) // 0Z3jyvM0jZc
You don't need to escape the forward slashes in CFML regex patterns. So take what The Mask has and use whichever method you prefer (both of these only work if the string is indeed a match):
<cfset ytID = '<img src="http://img.youtube.com/vi/0Z3jyvM0jZc/2.jpg" alt="" />'>
<cfoutput>
<pre>
<cfset sLenPos=REFind("/vi/([^/]+)", ytID, 1, "True")>
#mid(ytID, sLenPos.pos[2], sLenPos.len[2])# == OZ3jyvM0jZc
#reReplace(ytID,".*/vi/([^/]+)/.*", "\1")# == OZ3jyvM0jZc
</pre>
</cfoutput>
The key to keeping this simple is using the [^/]+ to match one or more characters that aren't /
I think regex might be the wrong tool for this job. How about using lists?
<cfset ytStr = '<img src="http://img.youtube.com/vi/0Z3jyvM0jZc/2.jpg" alt="" />'>
<cfset ytID = ListGetAt(ytStr, 4, '/')>
<cfset ytID = '<img src="http://img.youtube.com/vi/0Z3jyvM0jZc/2.jpg" alt="" />'>
<cfset sLenPos=REFind("(vi=?(\=|\/))([-a-zA-Z0-9_]+)|(vi=\/)([-a-zA-Z0-9_]+)", ytID, 1, "True")>
<cfoutput>
#mid(ytID, sLenPos.pos[1], sLenPos.len[1])#
</cfoutput>