This should be easy, but I'm finding it difficult.
I just want to find whether a substring exists anywhere in a string. In my case, whether the name of a website exists in the title of a product.
My code is like this:
#FindNoCase("Amazon.com", "Google Chromecast available at Amazon")#
The above returns a 0 which is correct because the entire substring "Amazon.com" doesn't exist in the main string. But some of it does, namely the "Amazon" part.
How could I achieve what I'm trying to do which is just see if ANY of the substring (at least more than 2 character in length) exists in the main string?
So I need something like FindOneOf() but actually "find at least three of". It should then look at the word "Amazon" in the product title and check if at least 3 characters in the sequence of "Amazon.com" exists. When it sees that "Ama" exists, then it just needs to return a true value. Can it be done using the existing built-in functions somehow?
Update: Very simple solution. I used Left("amazon", 3).
There's a lot of danger in false positives, like if someone was buying the Alabama state flag.
Because of store names that contain spaces, this is a little tricky (Wal Mart is often written with a space).
If your string always contains at [store], you can extract the store name by finding the last at in the sentence and creating a string by chopping off everything else.
Because it looks for occurrences of at only as a whole word, there's no danger with store names such as Beats Audio, or Sam's Meat Shop. I can't think of any any stores with the word at in the name. While that would technically trip it up, there's much lower risk, and you can do a pre-replace on such store names.
<cfset mystring = "Google Chromecast available at Amazon">
<cfset SellerName = REReplaceNoCase(mystring,".*\b(?:at)\b(?!.*\b(?:at)\b)\s*","")>
<cfoutput>Seller: #Sellername#</cfoutput>
You can then do your comparisons much more safely.
Per your comment, If you know all possible patterns, you can still obtain the data if you want to (false positives can either be embarrassing or catastrophic, depending on the action). If you know the stores you're working with, you can use a regex to pull out the string like this
<cfset mystring = "Google Chromecast available at Amazon.co.uk">
<cfset SellerName = REReplaceNoCase(mystring,".*\b((Google|Amazon|Wal[\W]*Mart|E[\W]*bay)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #Sellername#</cfoutput>
The only part you need to update is the pipe-delimited list You might add K-Mart as K[\W]*Mart the [\W]* permits any special character or space so it covers kMart, K-Mart, k*Mart, but not Kwik-E-Mart.
Update #2, per more comments
<cfset mystring = "Google Chromecast available at Toys-R-US">
<cfset SellerNameRE = REReplace(rsProduct.sellername,"[\W]+","[\W]*","ALL")>
<cfset TheSellerName = REReplaceNoCase(mystring,".*\b((#sellernameRE#)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #TheSellername# (#SellerNameRE#)</cfoutput>
This replaces any symbols with the wildcard character so that symbols aren't required so that if something says Wal*Mart, it will still match WalMart.
You could also load a seperate column with "Regex Names" so that you're not doing this each time.
So your table would look something like
SellerID SellerName RegexName
1 Wal-Mart Wal[\W]*Mart
2 Toys-R-US Toys[\W]*R[\W]*US
<cfset mystring = "Google Chromecast available at Toys-R-US">
<cfset TheSellerName = REReplaceNoCase(mystring,".*\b((#rsProduct.RegexName#)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #TheSellername# (#SellerNameRE#)</cfoutput>
Solved it by doing this
#FindNoCase(left("Amazon.com", 3), "Google Chromecast available at Amazon")#
Yes there is potential it won't do what I need in cases where the seller name less than 3 characters long. But I think its rare enough to be ok.
Related
I have an angular app using the mongodb sdk for js.
I would like to suggest some words on a input field for the user from my words collection, so I did:
getSuggestions(term: string) {
var regex = new stitch.BSON.BSONRegExp('^' +term , 'i');
return from(this.words.find({ 'Noun': { $regex: regex } }).execute());
}
The problem is that if the user type for example Bie, the query returns a lot of documents but the most accurated are the last ones, for example Bier, first it returns the bigger words, like Bieberbach'sche Vermutung. How can I deal to return the closests documents first?
A regular-expression is probably not enough to do what you are intending to do here. They can only do what they're meant to do – match a string. They might be used to give you a candidate entry to present to the user, but can't judge or weigh them. You're going to have to devise that logic yourself.
I have been given the daunting task of sifting through a database of over 30,000 registrants and correcting the letter casing of names and addresses where needed. I am trying to write a program that will search for names and addresses in our database that are either all lowercase or all uppercase and output these mishaps in a webpage for me to review and correct more efficiently. I was informed that I could utilize Regular Expressions to find fields that adhere to my criteria, only I am new to programming and I am unfamiliar with the syntax of RegEx.
If anyone could provide me with some pointers as how to use RegEx to query for these inconsistencies, it would be greatly appreciated.
Thank you.
strComp should work
SELECT col
FROM table
WHERE strComp(col, lcase(col), 0) = 0 --all lower case
OR strComp(col, ucase(col), 0) = 0 --all upper case
The first two arguments are the columns to compare. The 3rd argument says to do a binary comparison. If the two strings are equal 0 is returned.
How will you accurately correct the data? If you see a last name of "MACGUYVER" should it change to Macguyver or MacGuyver? If you see a last name of "DE LA HOYA" will it become de la Hoya, De La Hoya, or something else? This task seems a bit dangerous.
If your plan is basically to just do initial capitalization then I suggest that you run an update first before doing any manual review.
You could run something like this to change your name fields to initial capital letters:
update yourTable
set lname = StrConv(lname,3)
where StrComp(lname, StrConv(lname,3), 0) <> 0
and StrComp(mid(lname,2,len(lname)), lcase(mid(lname,2,len(lname))), 0) = 0;
Where "lname" above is your last name column, for example.
The above would have to be run for each name field.
Note that this will not update names that legitimately have multiple capital letters, like MacGuyver or O'Connor, which need manual review.
Also note that it will update last names that start with van, von, de la, and others that may intentionally be lowercase.
You could then query for just the names that need manual review, which I assume will be a much smaller subset:
select *
from yourTable
where StrComp(lname, StrConv(lname,3), 0) <> 0;
Addresses are tougher. To find just those that are either all lowercase or all uppercase you can do this:
select *
from yourTable
where strComp(address1, lcase(address1), 0) = 0;
select *
from yourTable
where strComp(address1, ucase(address1), 0) = 0;
Obviously this won't catch address lines like "123 New YORK AveNUE".
Consider asking for permission to just set all address values to uppercase.
You'll save yourself a lot of trouble.
I can't seem to figure out the VLOOKUP magic needed to make this work as I want it to.
See, what I've got is a column B containing filenames, like this:
[COLUMN B]
./11001 Boogie Oogie Oogie (A Taste Of Honey).wav
./11001 Rescue Me (A Taste Of Honey).wav
./11001 Sukiyaki (A Taste Of Honey).wav
./11002 Memory (Acker Bilk).wav
./11002 Stuck On You (Acker Bilk).wav
./11002 Could I Have This Dance (Acker Bilk).wav
./11002 Do That To Me One More Time (Acker Bilk).wav
./11002 This Masquerade (Acker Bilk).wav
./11002 Just Once (Acker Bilk).wav
And so on for 6220 entries.
I have another column, Column E, which contains a TRACK NAME which is present within the filename. Looks like this:
American Patrol
Artistry In Rhythm
Begin The Beguine
Big John's Special
Cherokee
For example. So what I want to do is, in another column I want to search through Column B using the strings from Column E and then returning the matched string from Column B.
So if we imagine I put this formula in the C Column starting in the same row as the American Patrol track name, it would search through the range in Column B and return this:
./11249 American Patrol (BBC Big Band).wav
./11249 Artistry In Rhythm (BBC Big Band).wav
./11249 Begin The Beguine (BBC Big Band).wav
And so on.
I tried doing this formula
=VLOOKUP(E2;B2:B6235;2;TRUE)
So, this returns a file name, but it seems to have matched all the filenames and are just returning whichever result I specify in the col_index variable, so now it returns the second match (basically, just the second row in Column B) and if I put a 3 instead, it would just return the third hit, again having matched all the file names, it seems..
I'm not that familiar with Excel functions, so I'm not sure where to look for the solution beyond this.
You should not be using TRUE as a VLOOKUP function's range_lookup parameter on unsorted data. You can, however, wrap your track title in wildcards to achieve the search you are looking for.
The formula in C1 is,
=INDEX(B:B, MATCH("*"&E1&"*",B:B, 0))
... or,
=VLOOKUP("*"&E1&"*",B:B, 1, FALSE)
They accomplish the same thing.
I have a project where I have a huge list of email addresses (i.e. johnhahifas#example.com) and I have a list of top 5000 most common First names (i.e. john, jim etc)
I am trying to for each email address, remove any First name if it appears in the email, so that for example:
johnhahifas#example.com becomes hahifas#example.com
benTTTben#something.com becomes TTT#something.com
so far, I came up with a regex solution and a strfind solution, the regexprep is much faster; I even optimized it a little to remove the #example.com so that the operation might be faster; but it still takes forever to run.
you can download the common first names from this address (CSV file)
http://www.quietaffiliate.com/Files/CSV_Database_of_First_Names.csv
The regex code I have:
fileID = fopen('bademails');
emails = textscan(fileID,'%s');
str = emails{1,1}; %//Loaded in the emails
fileID = fopen('CSV_Database_of_First_Names.csv');
names = textscan(fileID,'%s');
Names = lower(names{1,1}); %//Loaded in the Names
K = regexprep(str,Names,''); %Regex on Names
Any faster solutions would be much appreciated, thank you in advance!
I have two databases that store phone numbers. The first one stores them with a country code in the format 15555555555 (a US number), and the other can store them in many different formats (ex. (555) 555-5555, 5555555555, 555-555-5555, 555-5555, etc.). When a phone number unsubscribes in one database, I need to unsubscribe all references to it in the other database.
What is the best way to find all instances of phone numbers in the second database that match the number in the first database? I'm using the entity framework. My code right now looks like this:
using (FusionEntities db = new FusionEntities())
{
var communications = db.Communications.Where(x => x.ValueType == 105);
foreach (var com in communications)
{
string sRegexCompare = Regex.Replace(com.Value, "[^0-9]", "");
if (sMobileNumber.Contains(sRegexCompare) && sRegexCompare.Length > 6)
{
var contact = db.Contacts.Where(x => x.ContactID == com.ContactID).FirstOrDefault();
contact.SMSOptOutDate = DateTime.Now;
}
}
}
Right now, my comparison checks to see if the first database contains at least 7 digits from the second database after all non-numeric characters are removed.
Ideally, I want to be able to apply the regex formatting to the point in the code where I get the data from the database. Initially I tried this, but I can't use replace in a LINQ query:
var communications = db.Communications.Where(x => x.ValueType == 105 && sMobileNumber.Contains(Regex.Replace(x.Value, "[^0-9]", "")));
Comparing phone numbers is a bit beyond the capability of regex by design. As you've discovered there are many ways to represent a phone number with and without things like area codes and formatting. Regex is for pattern matching so as you've found using the regex to strip out all formatting and then comparing strings is doable but putting logic into regex which is not what it's for.
I would suggest the first and biggest thing to do is sort out the representation of phone numbers. Since you have database access you might want to look at creating a new field or table to represent a phone number object. Then put your comparison logic in the model.
Yes it's more work but it keeps the code more understandable going forward and helps cleanup crap data.