VBA regexp pattern combining match and not matching a pattern - regex

Trying to extract valid phone numbers from a string with no visible delimiters between different data types. In effect, the data around the potential phone number is random and irrelevant.
What should be matched. I am trying to match either one of the following:
[random garbage][optional '1'][optional '(']###[optional')'[[random or no white space]###-####[random garbage] or
[random garbage]###[optional '-']###,[optional '-']####[random garbage]
Only the 1st phone number is to be used, so I used Global = False in the code below. I could make it even more robust, but I've examined the data and this should be enough.
Working pattern.Here's a code snippet from a function (it returns the matching phone number) that contains the pattern that worked.
With regex
.Global = False
.Pattern = "1?(\(?\d{3}\)?\(s+)?|\d{3}-?)\d{3}-?\d{4}"
'this works but does detect an extension as explained above
End With
What should not be matched. I realized that I also need to search for an extension next to the phone number (i.e. [phone number][white space]x#) and if that exists, to treat the phone number as invalid (.test should evaluate to false).
Failed attemps. They all failed (even valid phone numbers had .test evaluate to false):
.Pattern = "1?(\(?\d{3}\)?\(s+)?|\d{3}-?)\d{3}-?\d{4}^(\s?x\d)"
'detect not([optional single white space]x#), added "^(\s?x\d)"
'or
.Pattern = "1?(\(?\d{3}\)?\(s+)?|\d{3}-?)\d{3}-?\d{4}^((\s+?)[x]\d)"
'detect not([optional multiple white space]x#), added "^((\s+?)[x]\d)"
Not sure how to combine positive match tests and negative (not) match tests in the same pattern.
Work-arounds I've tried. When I couldn't get it to work, I tried the following Like patterns (using VBA 'Like', prior to calling the function that utilized Regexp) and that also failed (all evaluated to false even with test strings that contained examples such as "...1x2" or "5 x33" or "...7 x444"; with patterns like "*#x#*", "*#{ x}#*", ""*#{ x}#*".
Here is the code snippet to test the Like function:
If Not (OrigNum Like "*#x#" Or OrigNum Like "*#[ x}#" Or OrigNum Like "*#[ x]#*") Then
Debug.Print "No #x# in string"
OrigNum = ExtractPhoneNumber(OrigNum)
Else
Debug.Print "#x# in string"
End If
Every string evaluated ended up causing "No x# in string" to be displayed (evaluated to false), even when the string contained the above examples, which should have evaluated to true and "#x# in string" being displayed.
Dazed and confused for so long it's...OK, enough of the Led Zepp reference :-)

Phone number:
[optional '1'][optional '(']###[optional')'[[random or no white space]###-####
###[optional '-']###[optional '-']####
*I removed a comma I assumed as a typo, and also assuming the leading 1 is optional for both cases from what I read from your patterns.
Don't match:
[phone number][white space]x#
What you're looking for is negative lookaheads.
(?! subexpression ) asserts for that subexpression from the current position and, if the subexpression matches, the match attempt fails (i.e. not followed by).
E.g. (?!\s*x\d) fails when the current position is followed by optional whitespace, an "x" and a digit.
Regex:
1?(?:\(\d{3}\)|\d{3}-?)\s*\d{3}-?\d{4}(?!\s*x\d)
Code:
Public Function getPhoneNumber(strInput As String) As Variant
Dim regex As New RegExp
Dim matches As Object
regex.Pattern = "1?(?:\(\d{3}\)\s*|\d{3}-?)\d{3}-?\d{4}(?!\s*x\d)"
regex.Global = False
Set matches = regex.Execute(strInput)
If matches.Count = 0 Then
getPhoneNumber = CVErr(xlErrNA)
Else
getPhoneNumber = matches(0).Value
End If
End Function
Results (🎵As it was, then again it will be; though the course may change sometimes🎵):

Related

How to split a string in VBA to array by Split function delimited by Regular Expression

I am writing an Excel Add In to read a text file, extract values and write them to an Excel file. I need to split a line, delimited by one or more white spaces and store it in the form of array, from which I want to extract desired values.
I am trying to implement something like this:
arrStr = Split(line, "/^\s*/")
But the editor is throwing an error while compiling.
How can I do what I want?
If you are looking for the Regular Expressions route, then you could do something like this:
Dim line As String, arrStr, i As Long
line = "This is a test"
With New RegExp
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
IMPORTANT: You will need to create a reference to:
Microsoft VBScript Regular Expressions 5.5 in Tools > References
Otherwise, you can see Late Binding below
Your original implementation of your original pattern \^S*\$ had some issues:
S* was actually matching a literal uppercase S, not the whitespace character you were looking for - because it was not escaped.
Even if it was escaped, you would have matched every string that you used because of your quantifier: * means to match zero or more of \S. You were probably looking for the + quantifier (one or more of).
You were good for making it greedy (not using *?) since you were wanting to consume as much as possible.
The Pattern I used: (\S+) is placed in a capturing group (...) that will capture all cases of \S+ (all characters that are NOT a white space, + one or more times.
I also used the .Global so you will continue matching after the first match.
Once you have captured all your words, you can then loop through the match collection and place them into an array.
Late Binding:
Dim line As String, arrStr, i As Long
line = "This is a test"
With CreateObject("VBScript.RegExp")
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
Miscellaneous Notes
I would have advised just to use Split(), but you stated that there were cases where more than one consecutive space may have been an issue. If this wasn't the case, you wouldn't need regex at all, something like:
arrStr = Split(line)
Would have split on every occurance of a space

Excel VBA Regex Check For Repeated Strings

I have some user input that I want to validate for correctness. The user should input 1 or more sets of characters, separated by commas.
So these are valid input
COM1
COM1,COM2,1234
these are invalid
COM -- only 3 characters
COM1,123 -- one set is only 3 characters
COM1.1234,abcd -- a dot separator not comma
I googled for a regex pattern to this and found a possible pattern that tested for a recurring instance of any 3 characters, and I modified like so
/^(.{4,}).*\1$/
but this is not finding matches.
I can manage the last comma that may or may not be there before passing to the test so that it is always there.
Preferably, I would like to test for letters (any case) and numbers only, but I can live with any characters.
I know I could easily do this in straight VBA splitting the input on a comma delimiter and looping through each character of each array element, but regex seems more efficient, and I will have more cases than have slightly different patterns, so parameterising the regex for that would be better design.
TIA
I believe this does what you want:
^([A-Z|a-z|0-9]{4},)*[A-Z|a-z|0-9]{4}$
It's a line beginning followed by zero or more groups of four letters or numbers ending with a comma, followed by one group of four letters or number followed by an end-of-line.
You can play around with it here: https://regex101.com/r/Hdv65h/1
The regular expression
"^[\w]{4}(,[\w]{4})*$"
should work.
You can try this to see whether it works for all your cases using the following function. Assuming your test strings are in cells A1 thru A5 on the spreadsheet:
Sub findPattern()
Dim regEx As New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.Pattern = "^[\w]{4}(,[\w]{4})*$"
Dim i As Integer
Dim val As String
For i = 1 To 5:
val = Trim(Cells(i, 1).Value)
Set mat = regEx.Execute(val)
If mat.Count = 0 Then
MsgBox ("No match found for " & val)
Else
MsgBox ("Match found for " & val)
End If
Next
End Sub

Regex to determine if a string is a name of a range or a cell's address

I'm struggling to come up with a regular expression pattern that can help me determine if a string is a cell's address or if it is a cell's name.
Here are some examples of cell addresses:
"E5"
"AA55:E5"
"DD5555:DDD55555, E5, F5:AA55"
"$F7:$G$7"
Here are some examples of cell names:
"bis_document_id"
"PCR1MM_YPCVolume"
"sheet_error7"
"blahE5"
"training_A1"
"myNameIsGeorgeJR"
Is there a regex pattern you guys can come up with that will match all of either group and none of the other?
I have been able to think of a couple of ways to determine what a string is not:
If it has any other character than "$" or ":" in it, I know it is not a cell's name and is most likely a cell's address.
If it has more than three consecutive numbers, it is most likely not a cell's address.
A cell's address is extremely unlikely to have more than 2 letters preceding a number, 99.9% of the cell addresses will be in columns A to ZZ.
Alas, these three small tests can hardly prove what this string is.
Thanks for the help!
OK, this one's fun:
^\$?[A-Z]+\$?\d+(?::\$?[A-Z]+\$?\d+)?(?:,\s*(?:\$?[A-Z]+\$?\d+(?::\$?[A-Z]+\$?\d+)?))*$
Let's break it down, because it's rather nasty. The magic subpattern, really, is this:
\$?[A-Z]+\$?\d+
This little thing will match any single valid cell address, with optional absolute-value $s. The next bit,
(?::\$?[A-Z]+\$?\d+)?
will match the same thing optionally (the ? quantifier at the end), but preceded by a colon (:). That lets us get ranges. The next bit,
(?:,\s*(?:\$?[A-Z]+\$?\d+(?::\$?[A-Z]+\$?\d+)?))*
matches the same thing as the first, but zero or more times (using the * quantifier), and preceded by a comma and optional spaces using the special \s token (which means "any whitespace").
Demo on Regex101
If we want to get really fancy (and, mind you, I have no idea whether Excel's regex engine supports this; I just wrote it for fun), we can use recursion to accomplish the same thing:
^((\$?[A-Z]+\$?\d+)(?::(?2))?)(?:,\s*(?1))*$
In this case, the magic \$?[A-Z]+\$?\d+ is inside the second capturing group, which is used recursively by the (?2) token. The entire subpattern for a single address or range of them is contained within the first capture group, and is then used to match additional addresses or ranges in a list.
Demo on Regex101
So here's a regex for VBA which will find any cell reference irrespective where it is.
NOTE: I've assumed you're performing this on a Formula object and thus doesn't require being at the start or end of the string; so you can have a string with cell references and cell names and it will only pick up the cell references as below:
(?:\W|^)(\$?[A-Z]{1,3}\$?[0-9]{1,7}(:\$?[A-Z]{1,3}\$?[0-9]{1,7})?)(?!\w)
(?:\W|^) is at the start and ensures that there is a non-word character before it or the start of the string (remove |^ if it there is always a = at the start as in Formula objects) --- VBA I found out regrettably does not have a functioning negative lookbehind)
(\$?[A-Z]{1,3}\$?[0-9]{1,7}(:\$?[A-Z]{1,3}\$?[0-9]{1,7})?) finds the actual cell reference and is broken down below:
\$?[A-Z]{1,3}\$?[0-9]{1,7} matches to one to three capital letters (as applicable to Excel's possible current ranges;
(:\$?[A-Z]{1,3}\$?[0-9]{1,7})? is the same as above except it adds the option of a second cell reference after a column ? makes it optional.
(?!\w) is a negative look forward and says that the character after it must not be a word character (presumably in functions the only things you can have around a cell references are parentheses and operators).
I wrote a VBA function in Excel and it returned the following with the above RegEx:
NB: It doesn't pick up obviously if the characters are in the right order as the reference $AZO113:A4 is returned despite it being impossible.
After trying several solutions I had to modify a regex so it works for me. my version only support non-named ranges.
((?![\=,\(\);])(\w+!)|('.+'!))?((\$?[A-Z]{1,3}\$?[0-9]{1,7}(:\$?[A-Z]{1,3}\$?[0-9]{1,7})?)|(\$?[A-Z]{1,3}(:\$?[A-Z]{1,3}\$?)))
It will capture ranges in all of the following situations
=FUNCTION(F:F)
=FUNCTION($B22,G$5)
=SUM($F$10:$F$11)
=$J10-$K10
=SUMMARY!D4
I created the following function for RegEx. but first tick the reference to "Microsoft VBScript Regular Expressions 5.5" from Tools>References
Function RegExp(ByVal sText As String, ByVal sPattern, Optional bGlobal As Boolean = True, Optional bIgnoreCase As Boolean = False, Optional bArray As Boolean = False) As Variant
Dim objRegex As New RegExp
Dim Matches As MatchCollection
Dim Match As Match
Dim i As Integer
objRegex.IgnoreCase = bIgnoreCase
objRegex.Global = bGlobal
objRegex.Pattern = sPattern
If objRegex.test(sText) Then
Set Matches = objRegex.Execute(sText)
If Matches.count <> 0 Then
If bArray Then ' if we want to return array instead of MatchCollection
ReDim aMatches(Matches.count - 1) As Variant
For Each Match In Matches
aMatches(i) = Match.value
i = i + 1
Next
RegExp = aMatches
Else
Set RegExp = Matches
End If
End If
End If
End Function

How to test for specific characters with regex in VBA

I need to test for a string variable to ensure it matches a specific format:
XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
...where x can be any alphanumerical character (a - z, 0 - 9).
I've tried the following, but it doesn't seem to work (test values constantly fail)
If val Like "^([A-Za-z0-9_]{8})([-]{1})([A-Za-z0-9_]{4})([-]{1})([A-Za-z0-9_]{4})([-]{1})([A-Za-z0-9_]{4})([-]{1})([A-Za-z0-9_]{12})" Then
MsgBox "OK"
Else
MsgBox "FAIL"
End If
.
fnCheckSubscriptionID "fdda752d-32de-474e-959e-4b5bf7574436"
Any pointers? I don't mind if this can be achieved in vba or with a formula.
You are already using the ^ beginning-of-string anchor, which is terrific. You also need the $ end-of-string anchor, otherwise in the last group of digits, the regex engine is able to match the first 12 digits of a longer group of digits (e.g. 15 digits).
I rewrote your regex in a more compact way:
^[A-Z0-9]{8}-(?:[A-Z0-9]{4}-){3}[A-Z0-9]{12}$
Note these few tweaks:
[-]{1} can just be expressed with -
I removed the underscores as you say you only want letters and digits. If you do want underscores, instead of [A-Z0-9]{8} (for instance), you can just write \w{8} as \w matches letters, digits and underscores.
Removed the lowercase letters. If you do want to allow lowercase letters, we'll turn on case-insensitive mode in the code (see line 3 of the sample code below).
No need for (capturing groups), so removed the parentheses
We have three groups of four letters and a dash, so wrote (?:[A-Z0-9]{4}-) with a {3}
Sample code
Dim myRegExp, FoundMatch
Set myRegExp = New RegExp
myRegExp.IgnoreCase = True
myRegExp.Pattern = "^[A-Z0-9]{8}-(?:[A-Z0-9]{4}-){3}[A-Z0-9]{12}$"
FoundMatch = myRegExp.Test(SubjectString)
You can do this either with a regular expression, or with just native VBA. I am assuming from your code that the underscore character is also valid in the string.
To do this with native VBA, you need to build up the LIKE string since quantifiers are not included. Also using Option Compare Text makes the "like" action case insensitive.
Option Explicit
Option Compare Text
Function TestFormat(S As String) As Boolean
'Sections
Dim S1 As String, S2_4 As String, S5 As String
Dim sLike As String
With WorksheetFunction
S1 = .Rept("[A-Z0-9_]", 8)
S2_4 = .Rept("[A-Z0-9_]", 4)
S5 = .Rept("[A-Z0-9_]", 12)
sLike = S1 & .Rept("-" & S2_4, 3) & "-" & S5
End With
TestFormat = S Like sLike
End Function
With regular expressions, the pattern is simpler to build, but the execution time may be longer, and that may make a difference if you are processing very large amounts of data.
Function TestFormatRegex(S As String) As Boolean
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.MultiLine = True
.Pattern = "^\w{8}(?:-\w{4}){3}-\w{12}$"
TestFormatRegex = .test(S)
End With
End Function
Sub Test()
MsgBox fnCheckSubscriptionID("XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX")
End Sub
Function fnCheckSubscriptionID(strCont)
' Tools - References - add "Microsoft VBScript Regular Expressions 5.5"
With New RegExp
.Pattern = "^\w{8}-\w{4}-\w{4}-\w{4}-\w{12}$"
.Global = True
.MultiLine = True
fnCheckSubscriptionID = .Test(strCont)
End With
End Function
In case of any problems with early binding you can use late binding With CreateObject("VBScript.RegExp") instead of With New RegExp.

VBScript Regular Expression

Need help building a VBScript regex that checks for a valid computer name and returns only invalid characters. String can contain numbers, upper and lower case letters, and (-) sign only. It can't start or end with (-) and cannot be only numbers.
Valid (Returns no match):
computer
Computer8
8Computer
Com8puter
Com-Puter
Computer-123
Invalid (Returns a match to invalid characters):
123
-computer
computer-
com*puter
PC&123
According to this: http://msdn.microsoft.com/en-us/library/ms974570.aspx VBScript has its own regex syntactic flavour. Note that NetBIOS computer names have a length limit of 15.
Then it should be "^\w[\w-]{0,14}$"
That RegEx satisfies all of the requirements except the "is all numbers". That can be done by running a second regex "^\d+$".
In code:
Dim regexValid, regexNumber
Set regexValid = New RegExp
Set regexNumber = New RegExp
regexValid.Global = True
regexValid.IgnoreCase = True
regexNumber.Global = True
regexNumber.IgnoreCase = True
regexValid.Pattern = "^\w[\w\-]{0,14}$"
regexNumber.Pattern = "^\d+$"
Dim inputString
inputString = InputBox("Computer name?")
If regexValid.Test( inputString ) And Not regexNumber.Test( inputString ) Then
' It's a valid computer name string
Else
' It's invalid
End If
Hmm, this is the first VBScript I've written this year.
I ended up switching the valid and invalid returns. I also ended up using two different RegEx strings. The first is:
^[0-9a-zA-Z]{1,}[-]*[0-9a-zA-Z]{1,}$
This doesn't allow the (-) at the beginning or end and requires all numbers, letters, or (-). It also requires a string of at least two characters.
The second is:
"[a-zA-Z]"
This makes sure there is at least one letter included.
Something like this /^([0-9]|[a-zA-Z]){1,}[a-zA-Z0-9-]+([0-9]|[a-zA-Z]){1,}$/