Best way to compare phone numbers using Regex - regex

I have two databases that store phone numbers. The first one stores them with a country code in the format 15555555555 (a US number), and the other can store them in many different formats (ex. (555) 555-5555, 5555555555, 555-555-5555, 555-5555, etc.). When a phone number unsubscribes in one database, I need to unsubscribe all references to it in the other database.
What is the best way to find all instances of phone numbers in the second database that match the number in the first database? I'm using the entity framework. My code right now looks like this:
using (FusionEntities db = new FusionEntities())
{
var communications = db.Communications.Where(x => x.ValueType == 105);
foreach (var com in communications)
{
string sRegexCompare = Regex.Replace(com.Value, "[^0-9]", "");
if (sMobileNumber.Contains(sRegexCompare) && sRegexCompare.Length > 6)
{
var contact = db.Contacts.Where(x => x.ContactID == com.ContactID).FirstOrDefault();
contact.SMSOptOutDate = DateTime.Now;
}
}
}
Right now, my comparison checks to see if the first database contains at least 7 digits from the second database after all non-numeric characters are removed.
Ideally, I want to be able to apply the regex formatting to the point in the code where I get the data from the database. Initially I tried this, but I can't use replace in a LINQ query:
var communications = db.Communications.Where(x => x.ValueType == 105 && sMobileNumber.Contains(Regex.Replace(x.Value, "[^0-9]", "")));

Comparing phone numbers is a bit beyond the capability of regex by design. As you've discovered there are many ways to represent a phone number with and without things like area codes and formatting. Regex is for pattern matching so as you've found using the regex to strip out all formatting and then comparing strings is doable but putting logic into regex which is not what it's for.
I would suggest the first and biggest thing to do is sort out the representation of phone numbers. Since you have database access you might want to look at creating a new field or table to represent a phone number object. Then put your comparison logic in the model.
Yes it's more work but it keeps the code more understandable going forward and helps cleanup crap data.

Related

How to get the most accurate term in regex?

I have an angular app using the mongodb sdk for js.
I would like to suggest some words on a input field for the user from my words collection, so I did:
getSuggestions(term: string) {
var regex = new stitch.BSON.BSONRegExp('^' +term , 'i');
return from(this.words.find({ 'Noun': { $regex: regex } }).execute());
}
The problem is that if the user type for example Bie, the query returns a lot of documents but the most accurated are the last ones, for example Bier, first it returns the bigger words, like Bieberbach'sche Vermutung. How can I deal to return the closests documents first?
A regular-expression is probably not enough to do what you are intending to do here. They can only do what they're meant to do – match a string. They might be used to give you a candidate entry to present to the user, but can't judge or weigh them. You're going to have to devise that logic yourself.

Rails: efficient way of finding/replacing words in a large collection of records using a hash filter

Just wondering what the best way to approach this would be,
I have a large american to british english filter with about 2000 items
filter = {"authorized"=>"authorised"........}
and a large collection of about 4000 records
posts = Post.all
what would be the most efficient way to do a search and replace across a couple of properties (say post.title and post.description) for each record while maintaining original casing(ie. replace all characters after the first one only)?
edit: updated hash count
I would think about using gsub with a Regexp.union and the hash syntax:
string.gsub(Regexp.union(filter.keys), filter)
To iterate over all posts use find_each to improve memory usage:
FILTER = { "authorized" => "authorised", ... }
FILTER_REGEXP = Regexp.new(Regexp.union(FILTER.keys), Regexp::IGNORECASE)
def translate(string)
string.gsub(FILTER_REGEXP, FILTER)
end
Post.find_each do |post|
post.update(
title: translate(post.title),
description: translate(post.description)
)
end
To support the original casing, I would add both versions to the hash (upcase and lowercase), that makes the whole regexp bigger, but makes the code easier to read and you avoid the extra logic to handle different cases. To generate a hash with both versions from you current hash, just use:
filter = Hash[*filter.map { |k, v| [[k,v], [k.capitalize,v.capitalize]] }.flatten]

Validate phone number with Symfony

I want to validate the phone nummer in a form. I would like to check so number and the "(" and ")" char are valid only. So user can fill in +31(0)600000000. The +31 is already preset in the form. The number only is possible with the code below, only how to add the two chars?
Or is there a standaard better way to validate phone number?
#Assert\Length(min = 8, max = 20, minMessage = "min_lenght", maxMessage = "max_lenght")
#Assert\Regex(pattern="/^[0-9]*$/", message="number_only")
If you need a good and robust validator for numbers, with advanced options to valudate, I will advice to use google lib https://github.com/googlei18n/libphonenumber, there is existed symfony2 bundle https://github.com/misd-service-development/phone-number-bundle and you can see there is a assert annotation:
use Misd\PhoneNumberBundle\Validator\Constraints\PhoneNumber as AssertPhoneNumber;
/**
* #AssertPhoneNumber
*/
private $phoneNumber;
The regex you need is:
/^\(0\)[0-9]*$
or for the entire number
/^\+31\(0\)[0-9]*$
You can test and play around with your regex here (it also includes auto-generated explanations):
https://www.regex101.com/r/gD0hE5/1

How to find a substring anywhere in a string

This should be easy, but I'm finding it difficult.
I just want to find whether a substring exists anywhere in a string. In my case, whether the name of a website exists in the title of a product.
My code is like this:
#FindNoCase("Amazon.com", "Google Chromecast available at Amazon")#
The above returns a 0 which is correct because the entire substring "Amazon.com" doesn't exist in the main string. But some of it does, namely the "Amazon" part.
How could I achieve what I'm trying to do which is just see if ANY of the substring (at least more than 2 character in length) exists in the main string?
So I need something like FindOneOf() but actually "find at least three of". It should then look at the word "Amazon" in the product title and check if at least 3 characters in the sequence of "Amazon.com" exists. When it sees that "Ama" exists, then it just needs to return a true value. Can it be done using the existing built-in functions somehow?
Update: Very simple solution. I used Left("amazon", 3).
There's a lot of danger in false positives, like if someone was buying the Alabama state flag.
Because of store names that contain spaces, this is a little tricky (Wal Mart is often written with a space).
If your string always contains at [store], you can extract the store name by finding the last at in the sentence and creating a string by chopping off everything else.
Because it looks for occurrences of at only as a whole word, there's no danger with store names such as Beats Audio, or Sam's Meat Shop. I can't think of any any stores with the word at in the name. While that would technically trip it up, there's much lower risk, and you can do a pre-replace on such store names.
<cfset mystring = "Google Chromecast available at Amazon">
<cfset SellerName = REReplaceNoCase(mystring,".*\b(?:at)\b(?!.*\b(?:at)\b)\s*","")>
<cfoutput>Seller: #Sellername#</cfoutput>
You can then do your comparisons much more safely.
Per your comment, If you know all possible patterns, you can still obtain the data if you want to (false positives can either be embarrassing or catastrophic, depending on the action). If you know the stores you're working with, you can use a regex to pull out the string like this
<cfset mystring = "Google Chromecast available at Amazon.co.uk">
<cfset SellerName = REReplaceNoCase(mystring,".*\b((Google|Amazon|Wal[\W]*Mart|E[\W]*bay)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #Sellername#</cfoutput>
The only part you need to update is the pipe-delimited list You might add K-Mart as K[\W]*Mart the [\W]* permits any special character or space so it covers kMart, K-Mart, k*Mart, but not Kwik-E-Mart.
Update #2, per more comments
<cfset mystring = "Google Chromecast available at Toys-R-US">
<cfset SellerNameRE = REReplace(rsProduct.sellername,"[\W]+","[\W]*","ALL")>
<cfset TheSellerName = REReplaceNoCase(mystring,".*\b((#sellernameRE#)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #TheSellername# (#SellerNameRE#)</cfoutput>
This replaces any symbols with the wildcard character so that symbols aren't required so that if something says Wal*Mart, it will still match WalMart.
You could also load a seperate column with "Regex Names" so that you're not doing this each time.
So your table would look something like
SellerID SellerName RegexName
1 Wal-Mart Wal[\W]*Mart
2 Toys-R-US Toys[\W]*R[\W]*US
<cfset mystring = "Google Chromecast available at Toys-R-US">
<cfset TheSellerName = REReplaceNoCase(mystring,".*\b((#rsProduct.RegexName#)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #TheSellername# (#SellerNameRE#)</cfoutput>
Solved it by doing this
#FindNoCase(left("Amazon.com", 3), "Google Chromecast available at Amazon")#
Yes there is potential it won't do what I need in cases where the seller name less than 3 characters long. But I think its rare enough to be ok.

Validating a Salesforce Id

Is there a way to validate a Salesforce ID, maybe using RegEx? They are normally 15 chars or 18 chars but do they follow a pattern that we can use to check that it's a valid id.
There are two levels of validating salesforce id:
check format using regular expression [a-zA-Z0-9]{15}|[a-zA-Z0-9]{18}
for 18-characted ids you can check the the 3-character checksum:
Code examples provided in comments:
C#
Go
Javascript
Ruby
Something like this should work:
[a-zA-Z0-9]{15,18}
It was suggested that this may be more correct because it prevents Ids with lengths of 16 and 17 characters to be rejected, also we try to match against 18 char length first with 15 length as a fallback:
[a-zA-Z0-9]{18}|[a-zA-Z0-9]{15}
Just use instanceOf to check if the string is an instance of Id.
String s = '1234';
if (s instanceOf Id) System.debug('valid id');
else System.debug('invalid id');
The easiest way I've come across, is to create a new ID variable and assign a String to it.
ID MyTestID = null;
try {
MyTestID = MyTestString; }
catch(Exception ex) { }
If MyTestID is null after trying to assign it, the ID was invalid.
This regex has given me the optimal results so far.
\b[a-z0-9]\w{4}0\w{12}|[a-z0-9]\w{4}0\w{9}\b
You can also check for 15 chars, and then add an extra 3 chars optional, with an expression similar to:
^[a-z0-9]{15}(?:[a-z0-9]{3})?$
on i mode, or not:
^[A-Za-z0-9]{15}(?:[A-Za-z0-9]{3})?$
Demo
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
Javascript: /^(?=.*?\d)(?=.*?[a-z])[a-z\d]{18}$/i
These were the Salesforce Id validation requirements for me.
18 characters only
At least one digit
At least one alphabet
Case insensitive
Test cases
Should fail
1
a
1234
abgcde
1234aDcde
12345678901234567*
123456789012345678
abcDefghijabcdefgh
Should pass
1234567890abcDeFgh
1234abcd1234abcd12
abcd1234abcd1234ab
1abcDefhijabcdefgf
abcDefghijabcdefg1
12345678901234567a
a12345678901234567
For understanding the regex, please refer this thread
The regex provided by Daniel Sokolowski works perfectly to verify if the id is in the correct format.
If you want to verify if an id corresponds to an actual record in the database, you'll need to first find the object type from the first three characters (commonly known as prefix) and then query the object type:
boolean isValidAndExists(String key) {
Map<String, Schema.SObjectType> objTypes = Schema.getGlobalDescribe();
for (Schema.SObjectType objType : objTypes.values()) {
Schema.DescribeSObjectResult objDesc = objType.getDescribe();
if (objDesc.getKeyPrefix() == key.substring(0,3)) {
String objName = objDesc.getName();
String query = 'SELECT Id FROM ' + objName + ' WHERE Id = \'' + key + '\'';
SObject[] objs = Database.query(query);
return !objs.isEmpty();
}
}
return false;
}
Be aware that Schema.getGlobalDescribe can be an expensive operation and degrade the performance of your application if you use that often.
If you need to check that often, I recommend creating a Custom Setting or Custom Metadata to store the relation between prefixes and object types.
Assuming you want to validate Ids in Apex, there are a few approaches discussed in the other answers. Here is an alternative, with notes on the various approaches.
The try-catch method (credit to #matt_k) certainly works, but some folks worry about overhead, especially if testing many Ids.
I used instanceof Id for a long time (credit to #melani_s), until I discovered that it sometimes gives the wrong answer (e.g., '481D0B74-41CF-47E9').
Multiple answers suggest regexen. As the accepted answer correctly points out (credit to #zacheusz), 18 character Ids are only valid if their checksums are correct, which means the regex solutions can be wrong. That answer also helpfully provides code in several languages to test Id checksums. But not in Apex.
I was going to implement the checksum code in Apex, but then I realized the Salesforce had already done the work, so instead I just convert 18 digit Ids to 15 digit Ids (via .to15() which uses the checksum to fix capitalization, as opposed to truncating the string) and then back to 18 digits to let SF do the checksum calc, then I compare the original checksum and the new one. This is my method:
static Pattern ID_REGEX = Pattern.compile('[a-zA-Z0-9]{15}(?:[A-Z0-5]{3})?');
/**
* #description Determines if a string is a valid SalesforceId. Confirms checksum of 18 digit Ids.
* Works for cases where `x instanceof id` returns the wrong answer, like '481D0B74-41CF-47E9'.
* Does NOT check for the existence of a record with the given Id.
* #param s a string to validate
*
* #return true if the string `s` is a valid Salesforce Id.
*/
public static Boolean isValidId(String s) {
Matcher m = ID_REGEX.matcher(s);
if (m.matches() == false) return false; // if it doesn't match the regex it cannot be valid
if (s.length() == 15) return true; // if 15 char string matches the regex, assume it must be valid
String check = (Id)((Id)s).to15(); // Convert to 15 char Id, then to Id and back to string, giving correct 18-char Id
return s.right(3) == check.right(3); // if 18 char string matches the regex, valid if checksum correct
}
Additionally checking getSObjectType() != null would be perfect if we are dealing with Salesforce records
public static boolean isRecordId(string recordId){
try{
return string.isNotBlank(recordId) && ((Id)recordId.trim()).getSObjectType() != null;
}catch(Exception ex){
return false;
}
}