Find Email Address in a string - ColdFusion 9 - coldfusion

I was wondering if coldfusion has a build-in function to find email addresses in a string.
I am trying to read through a query output ex. "John Smith jsmith#example.com" and get out only the email.
I did something like this in the past where I was counting the spaces of the string and after the second string i was wiping out all the characters on the left which it was keeping the email address alone.
Though this can work in my situation, it is not safe and almost guarantee bugs and misuse of data that may come in in a different format such as "John jsmith#example.com" which in this case I will wipe away all the information.

Regex is probably the easiest way. There is an ultimate regex for email that is quite large. This should cover most valid emails. This doesn't cover unicode for example. Note that the maximum TLD length is 63 (see this SO question & answer).
<cfset string="some garbae#.ca garbage#ca.a real#email.com another#garbage whatever another#email.com oh my!">
<cfset results = reMatchNoCase("[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,63}", string)>
<cfdump var="#results#">

You can use this UDF from cflib.org from Ray Camden. It works great for me
<cfscript>
/**
* Searches a string for email addresses.
* Based on the idea by Gerardo Rojas and the isEmail UDF by Jeff Guillaume.
* New TLDs
* v3 fix by Jorge Asch
*
* #param str String to search. (Required)
* #return Returns a list.
* #author Raymond Camden
* #version 3, June 13, 2011
*/
function getEmails(str) {
var email = "(['_a-z0-9-]+(\.['_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*\. ((aero|coop|info|museum|name|jobs|travel)|([a-z]{2,3})))";
var res = "";
var marker = 1;
var matches = "";
matches = reFindNoCase(email,str,marker,marker);
while(matches.len[1] gt 0) {
res = listAppend(res,mid(str,matches.pos[1],matches.len[1]));
marker = matches.pos[1] + matches.len[1];
matches = reFindNoCase(email,str,marker,marker);
}
return res;
}
</cfscript>

Related

Equivalent regex in T-SQL with start/end input markers

I'm trying to get the following RegEx to work:
^[a-zA-Z][a-zA-Z ''.-]+[a-zA-Z]$
It should allow any alphas, space, apostrophe, full stop and hyphen as long as the beginning and last chars as alphas.
John - ok
John Smith - ok
John-Smith - ok
John.Smith - ok
.John Smith - not ok
John Smith. - not ok
When I use this in T-SQL it doesn't seem to work and I'm not sure if it the input start/end markers that are not compatible in T-SQL. How do I translate this to valid T-SQL?:
CREATE Function [dbo].[IsValidName](#value VarChar(MAX))
RETURNS INT
AS
Begin
DECLARE #temp INT
SET #temp = (
SELECT
CASE WHEN #value LIKE '%^[a-zA-Z][a-zA-Z ''.-]+[a-zA-Z]$%' THEN 1
ELSE 0
END
)
RETURN #Temp
End
T-SQL doesn't support regular expressions "out of the box". Depending on what environment you are using, there are different solutions, but none will probably be "pure T-SQL". In a Microsoft environment you can use CLR procedures to achieve this.
See SQL Server Regular expressions in T-SQL for some options.
I made SOMETHING like this to scrub data, to remove non-alpha characters, I've slightly modified it to fit your needs
CREATE Function [dbo].Func (#Temp VarChar(1000))
Returns VarChar(1000)
AS
BEGIN
DECLARE #Len INT = LEN(#Temp)
DECLARE #RETURN INT
Declare #KeepValues as varchar(50)
Set #KeepValues = '%[^a-z^ ]%'
IF PatIndex(#KeepValues, #Temp) = 1
BEGIN
Set #RETURN = 0
END
IF PATINDEX(#KeepValues, #Temp) = #Len
BEGIN
SET #RETURN = 0
END
IF PATINDEX(#KeepValues, #Temp) = 0
SET #RETURN = 1
IF #RETURN IS NULL
BEGIN
SET #Return = 1
END
RETURN #RETURN
END
This is assuming you would not need to do any sort of data scrubbing for restricted characters. If you need to scrub for restricted characters let me know we can add a little more in there but based on your dataset this will return the correct answers

regex to return all values not just first found one

I'm learning Pig Latin and am using regular expressions. Not sure if the regex is language agnostic or not but here is what I'm trying to do.
If I have a table with two fields: tweet id and tweet, I'd like to go through each tweet and pull out all mentions up to 3.
So if a tweet goes something like "#tim bla #sam #joe something bla bla" then the line item for that tweet will have tweet id, tim, sam, joe.
The raw data has twitter ids not the actual handles so this regex seems to return a mention (.*)#user_(\\S{8})([:| ])(.*)
Here is what I have tried:
a = load 'data.txt' AS (id:chararray, tweet:chararray);
b = foreach a generate id, LOWER(tweet) as tweet;
// filter data so only tweets with mentions
c = FILTER b BY tweet MATCHES '(.*)#user_(\\S{8})([:| ])(.*)';
// try to pull out the mentions.
d = foreach c generate id,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1}',3) as mention1,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1,2}',3) as mention2,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){2,3}',3) as mention3;
e = limit d 20;
dump e;
So in that try I was playing with quantifiers, trying to return the first, second and 3rd instance of a match in a tweet {1}, {1,2}, {2,3}.
That did not work, mention 1-3 are just empty.
So I tried changing d:
d = foreach c generate id,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)',2) as mention1,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',5) as mention2,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',8) as mention3,
But, instead of returning each user mentioned, this returned the same mention 3 times. I had expected that by cutting n pasting the expression again I'd get the second match, and pasting it a 3rd time would get the 3rd match.
I'm not sure how well I've managed to word this question but to put it another way, imagine that the function regex_extract() returned an array of matched terms. I would like to get mention[0], mention[1], mention[2] on a single line item.
Whenever you use PATTERN_EXTRACT or PATTERN_EXTRACT_ALL udf, keep in mind that it is just pure regex handled by Java.
It is easier to test the regex through a local Java test. Here is the regex I found to be acceptable :
Pattern p = Pattern.compile("#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?");
String input = "So if a tweet goes something like #tim bla #sam #joe #bill something bla bla";
Matcher m = p.matcher(input);
if(m.find()){
for(int i=0; i<=m.groupCount(); i++){
System.out.println(i + " -> " + m.group(i));
}
}
With this regex, if there is at least a mention, it will returns three fields, the seconds and/or third being null if a second/third mention is not found.
Therefore, you may use the following PIG code :
d = foreach c generate id, REGEX_EXTRACT_ALL(
tweet, '#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?');
You do not even need to filter the data first.

Coldfusion-9 Trim Values

I am trying to write a code that takes a URL that has 3 parts (www).(domainname).(com) and trim the first part out completely.
So far I have this code that checks if on the left side I don't have a 'www' or 'dev'
go in and set siteDomainName = removecharsCGI.SERVER_NAME,1,2);
if (numHostParts eq 3 and listfindnocase('www,dev',left(CGI.SERVER_NAME,3)) eq 0) {
siteDomainName = removecharsCGI.SERVER_NAME,1,2);
The problem with the code above is that is deleting only 2 characters where I need it to delete ALL characters until numHostParts eq 2 or at least until the first "."
Another example would be:
akjnakdn.example.com I need the code to delete the first part of the URL with the dot included (akjnakdn.)
This code will help some of the queries that i have on the site to stop crushing because they are related with the #URL# and when the #URL# is fake I am getting cform query returned zero records error that is causing my contact forms to stop working.
You can just use listRest. It returns all the elements in a list, except the first one. Documentation is here http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec22c24-6d87.html
Example:
<cfscript>
name = cgi.server_name;
if (listlen(name,".") gte 3) {
name = listRest(name,".");
}
</cfscript>
You could do something like this:
<cfscript>
local.nameArr = ListToArray(CGI.SERVER_NAME, '.');
if (ArrayLen(local.nameArr) gt 2) {
ArrayDeleteAt(local.nameArr, 1);
}
siteDomainName = ArrayToList(local.nameArr, '.');
</cfscript>
I've split the server name into array elements with a period as the delimiter. If the number of elements is greater than two, remove the first element. Then convert it back to a list with the period as a delimiter.
UPDATE
As suggested by Robb, this could be more concise and perform better by skipping the array conversion process:
<cfscript>
siteDomainName = CGI.SERVER_NAME;
if (ListLen(siteDomainName, '.') gt 2) {
siteDomainName = ListDeleteAt(siteDomainName, 1, '.');
}
</cfscript>
I would use a regular expression, since you only want to "trim" certain subdomains (www,dev).
<cfset the_domain = REReplaceNoCase(cgi.SERVER_NAME, "(www|dev)\.", "") />
Just use a |-delimited list of subdomains you want to trim within the parentheses.

Validating a Salesforce Id

Is there a way to validate a Salesforce ID, maybe using RegEx? They are normally 15 chars or 18 chars but do they follow a pattern that we can use to check that it's a valid id.
There are two levels of validating salesforce id:
check format using regular expression [a-zA-Z0-9]{15}|[a-zA-Z0-9]{18}
for 18-characted ids you can check the the 3-character checksum:
Code examples provided in comments:
C#
Go
Javascript
Ruby
Something like this should work:
[a-zA-Z0-9]{15,18}
It was suggested that this may be more correct because it prevents Ids with lengths of 16 and 17 characters to be rejected, also we try to match against 18 char length first with 15 length as a fallback:
[a-zA-Z0-9]{18}|[a-zA-Z0-9]{15}
Just use instanceOf to check if the string is an instance of Id.
String s = '1234';
if (s instanceOf Id) System.debug('valid id');
else System.debug('invalid id');
The easiest way I've come across, is to create a new ID variable and assign a String to it.
ID MyTestID = null;
try {
MyTestID = MyTestString; }
catch(Exception ex) { }
If MyTestID is null after trying to assign it, the ID was invalid.
This regex has given me the optimal results so far.
\b[a-z0-9]\w{4}0\w{12}|[a-z0-9]\w{4}0\w{9}\b
You can also check for 15 chars, and then add an extra 3 chars optional, with an expression similar to:
^[a-z0-9]{15}(?:[a-z0-9]{3})?$
on i mode, or not:
^[A-Za-z0-9]{15}(?:[A-Za-z0-9]{3})?$
Demo
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
Javascript: /^(?=.*?\d)(?=.*?[a-z])[a-z\d]{18}$/i
These were the Salesforce Id validation requirements for me.
18 characters only
At least one digit
At least one alphabet
Case insensitive
Test cases
Should fail
1
a
1234
abgcde
1234aDcde
12345678901234567*
123456789012345678
abcDefghijabcdefgh
Should pass
1234567890abcDeFgh
1234abcd1234abcd12
abcd1234abcd1234ab
1abcDefhijabcdefgf
abcDefghijabcdefg1
12345678901234567a
a12345678901234567
For understanding the regex, please refer this thread
The regex provided by Daniel Sokolowski works perfectly to verify if the id is in the correct format.
If you want to verify if an id corresponds to an actual record in the database, you'll need to first find the object type from the first three characters (commonly known as prefix) and then query the object type:
boolean isValidAndExists(String key) {
Map<String, Schema.SObjectType> objTypes = Schema.getGlobalDescribe();
for (Schema.SObjectType objType : objTypes.values()) {
Schema.DescribeSObjectResult objDesc = objType.getDescribe();
if (objDesc.getKeyPrefix() == key.substring(0,3)) {
String objName = objDesc.getName();
String query = 'SELECT Id FROM ' + objName + ' WHERE Id = \'' + key + '\'';
SObject[] objs = Database.query(query);
return !objs.isEmpty();
}
}
return false;
}
Be aware that Schema.getGlobalDescribe can be an expensive operation and degrade the performance of your application if you use that often.
If you need to check that often, I recommend creating a Custom Setting or Custom Metadata to store the relation between prefixes and object types.
Assuming you want to validate Ids in Apex, there are a few approaches discussed in the other answers. Here is an alternative, with notes on the various approaches.
The try-catch method (credit to #matt_k) certainly works, but some folks worry about overhead, especially if testing many Ids.
I used instanceof Id for a long time (credit to #melani_s), until I discovered that it sometimes gives the wrong answer (e.g., '481D0B74-41CF-47E9').
Multiple answers suggest regexen. As the accepted answer correctly points out (credit to #zacheusz), 18 character Ids are only valid if their checksums are correct, which means the regex solutions can be wrong. That answer also helpfully provides code in several languages to test Id checksums. But not in Apex.
I was going to implement the checksum code in Apex, but then I realized the Salesforce had already done the work, so instead I just convert 18 digit Ids to 15 digit Ids (via .to15() which uses the checksum to fix capitalization, as opposed to truncating the string) and then back to 18 digits to let SF do the checksum calc, then I compare the original checksum and the new one. This is my method:
static Pattern ID_REGEX = Pattern.compile('[a-zA-Z0-9]{15}(?:[A-Z0-5]{3})?');
/**
* #description Determines if a string is a valid SalesforceId. Confirms checksum of 18 digit Ids.
* Works for cases where `x instanceof id` returns the wrong answer, like '481D0B74-41CF-47E9'.
* Does NOT check for the existence of a record with the given Id.
* #param s a string to validate
*
* #return true if the string `s` is a valid Salesforce Id.
*/
public static Boolean isValidId(String s) {
Matcher m = ID_REGEX.matcher(s);
if (m.matches() == false) return false; // if it doesn't match the regex it cannot be valid
if (s.length() == 15) return true; // if 15 char string matches the regex, assume it must be valid
String check = (Id)((Id)s).to15(); // Convert to 15 char Id, then to Id and back to string, giving correct 18-char Id
return s.right(3) == check.right(3); // if 18 char string matches the regex, valid if checksum correct
}
Additionally checking getSObjectType() != null would be perfect if we are dealing with Salesforce records
public static boolean isRecordId(string recordId){
try{
return string.isNotBlank(recordId) && ((Id)recordId.trim()).getSObjectType() != null;
}catch(Exception ex){
return false;
}
}

Use string function to select all text after specific character

How would I use use a string function to select everything in a variable after the last "/"
http://domain.com/g34/abctest.html
So in this case I would like to select "abctest.html"
Running ColdFusion 8.
Any suggestions?
Um, a bit strange to give very similar answer within few days, but ListLast looks as most compact and straightforward approach:
<cfset filename = ListLast("http://domain.com/g34/abctest.html","/") />
And yes, IMO you should start with this page before asking such questions on SO.
Joe I'd use the listToArray function, passing the "/" as the delimiter, get the length of the array and get the value in the last slot. See sample below
<cfset str = "http://domain.com/g34/abctest.html"/>
<cfset arr = ListToArray(str,"/","false")/>
<cfset val = arr[ArrayLen(arr)]/>
<cfoutput>#str# : #val#</cfoutput>
produces
http://domain.com/g34/abctest.html : abctest.html