RegEx array / list / collection of all matches in VBA - regex

I'm trying to use RegEx to get all instances of varying strings that exist in between a particular pair set of strings. E.g. in the following string:
"The Start. Hello. Jamie. Bye. The Middle. Hello. Sarah. Bye. The End"
I want to get a collection / array consisting of "Jamie" and "Sarah" by checking in between "Hello. " and ". Bye. "
My RegEx object is working fine and I feel I'm nearly successful:
Sub Reggie()
Dim x As String: x = "The Start. Hello. Jamie. Bye. The Middle. Hello. Sarah. Bye. The End"
Dim regEx As RegExp
Set regEx = New RegExp
Dim rPat1 As String: rPat1 = "Hello. "
Dim rPat2 As String: rPat2 = " Bye."
Dim rPat3 As String: rPat3 = ".*"
With regEx
.Global = True
.ignorecase = True
.Pattern = "(^.*" & rPat1 & ")(" & rPat3 & ")(" & rPat2 & ".*)"
.MultiLine = True
' COMMAND HERE
End With
End Sub
But the last bit COMMAND HERE I'm trying .replace(x, "$2") which gives me a string of the last instance of a match i.e. Sarah
I've tried .Execute(x) which gives me a MatchCollection object and when browsing the immediate window I see that object only has the last instance of a match.
Is what I'm requiring possible and how?

That is because .* matches as many any chars as possible and you should not match the whole string by adding .* on both ends of your regular expression.
Besides, you need to escape special chars in the regex pattern, here, . is special as it matches any char other than a line break char.
You need to fix your regex declaration like
rPat1 = "Hello\. "
rPat2 = " Bye\."
rPat3 = ".*?"`
.Pattern = rPat1 & "(" & rPat3 & ")" & rPat2
Or, to further enhance the regex, you may
Replace literal spaces with \s* (zero or more whitespaces) or \s+ (one or more whitespaces) to support any whitespace
Match any non-word chars after the captures string with \W+ or \W*.
rPat1 = "Hello\.\s*"
rPat2 = "\W+Bye\."
rPat3 = ".*?"`
.Pattern = rPat1 & "(" & rPat3 & ")" & rPat2
See the regex demo. Details:
Hello\. - Hello. string
\s* - zero or more whitespaces
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
\W+ - one or more chars other than ASCII letters/digits/_
Bye\. - Bye. string.

Related

How to write a regular expression that includes numbers [0-9] and a defined word?

I'm using VBA and struggling to make a regex.replace function to clean my string cells
Example: "Foo World 4563"
What I want: "World"
by replacing the numbers and the word "Foo"
Another example: "Hello World 435 Foo", I want "Hello World"
This is what my code looks like so far:
Public Function Replacement(sInput) As String
Dim regex As New RegExp
With regex
.Global = True
.IgnoreCase = True
End With
regex.Pattern = "[0-9,()/-]+\bfoo\b"
Replacement = regex.Replace(sInput, "")
End Function
You can use
Function Replacement(sInput) As String
Dim regex As New regExp
With regex
.Global = True
.IgnoreCase = True
End With
regex.Pattern = "\s*(?:\bfoo\b|\d+)"
Replacement = Trim(regex.Replace(sInput, ""))
End Function
See the regex demo. Excel test:
Details:
\s* - zero or more whitespaces
(?:\bfoo\b|\d+) - either a whole word foo or one or more digits.
Note the use of Trim(), it is necessary to remove leading/trailing spaces that may remain after the replacement.
My two cents, capturing preceding whitespace chars when present trying to prevent possible false positives:
(^|\s+)(?:foo|\d+)(?=\s+|$)
See an online demo.
(^|\s+) - 1st Capture group to assert position is preceded by whitespace or directly at start of string;
(?:foo|\d+) - Non-capture group with the alternation between digits or 'foo';
(?=\s+|$) - Positive lookahead to assert position is followed by whitespace or end-line anchor.
Sub Test()
Dim arr As Variant: arr = Array("Foo World 4563", "Hello World 435 Foo", "There is a 99% chance of false positives which is foo-bar!")
For Each el In arr
Debug.Print Replacement(el)
Next
End Sub
Public Function Replacement(sInput) As String
With CreateObject("vbscript.regexp")
.Global = True
.IgnoreCase = True
.Pattern = "(^|\s+)(?:foo|\d+)(?=\s+|$)"
Replacement = Application.Trim(.Replace(sInput, "$1"))
End With
End Function
Print:
World
Hello World
There is a 99% chance of false positives which is foo-bar!
Here Application.Trim() does take care of multiple whitespace chars left inside your string.

Better way to extract numbers from a string

I have been trying to change a string like this, {X=5, Y=9} to a string like this (5, 9), as it would be used as an on-screen coordinate.
I finally came up with this code:
Dim str As String = String.Empty
Dim regex As Regex = New Regex("\d+")
Dim m As Match = regex.Match("{X=9")
If m.Success Then str = m.Value
Dim s As Match = regex.Match("Y=5}")
If s.Success Then str = "(" & str & ", " & s.Value & ")"
MsgBox(str)
which does work, but surely there must be a better way to do this (I not familiar with Regex).
I have many to convert in my program, and doing it like above would be torturous.
You may use
Dim result As String = Regex.Replace(input, ".*?=(\d+).*?=(\d+).*", "($1, $2)")
The regex means
.*? - any 0+ chars other than newline chars as few as possible
= - an equals sign
(\d+) - Group 1: one or more digits
.*?= - any 0+ chars other than newline chars as few as possible and then a = char
(\d+) - Group 2: one or more digits
.* - any 0+ chars other than newline chars as many as possible
The $1 and $2 in the replacement pattern are replacement backreferences that point to the values stored in Group 1 and 2 memory buffer.

Exclude some words from regular expression

I have function which inserts space after characters like : / -
Private Function formatColon(oldString As String) As String
Dim reg As New RegExp: reg.Global = True: reg.Pattern = "(\D:|\D/|\D-)"
Dim newString As String: newString = reg.Replace(oldString, "$1 ")
formatColon = Replace(Replace(Replace(newString, ": ", ": "), "/ ", "/ "), "- ", "- ")
End Function
The code excludes dates easily. I want to exclude some a particular strings like 'w/d' also.
Is there any way?
before abc/abc/15/06/2017 ref:123243-11 ref-111 w/d
after abc/ abc/ 15/06/2017 ref: 123243-11 ref- 111 w/ d
i want to exclude last w/d
You may use a (?!w/d) lookahead to avoid matching w/d with your pattern:
Dim oldString As String, newString As String
Dim reg As New RegExp
With reg
.Global = True
.Pattern = "(?!w/d)\D[:/-]"
End With
oldString = "abc/abc/15/06/2017 ref:123243-11 ref-111 w/d"
newString = reg.Replace(oldString, "$& ")
Debug.Print newString
See the regex demo.
Pattern details
(?!w/d) - the location not followed with w/d
\D - any non-digit char
[:/-] - a :, / or - char.
The $& backreference refers to the whole match from the replacement pattern, no need to enclose the whole pattern with the capturing parentheses.
Here is another solution.
^/(?!ignoreme$)(?!ignoreme2$)[a-z0-9]+$

Extracting Parenthetical Data Using Regex

I have a small sub that extracts parenthetical data (including parentheses) from a string and stores it in cells adjacent to the string:
Sub parens()
Dim s As String, i As Long
Dim c As Collection
Set c = New Collection
s = ActiveCell.Value
ary = Split(s, ")")
For i = LBound(ary) To UBound(ary) - 1
bry = Split(ary(i), "(")
c.Add "(" & bry(1) & ")"
Next i
For i = 1 To c.Count
ActiveCell.Offset(0, i).NumberFormat = "#"
ActiveCell.Offset(0, i).Value = c.Item(i)
Next i
End Sub
For example:
I am now trying to replace this with some Regex code. I am NOT a regex expert. I want to create a pattern that looks for an open parenthesis followed by zero or more characters of any type followed by a close parenthesis.
I came up with:
\((.+?)\)
My current new code is:
Sub qwerty2()
Dim inpt As String, outpt As String
Dim MColl As MatchCollection, temp2 As String
Dim regex As RegExp, L As Long
inpt = ActiveCell.Value
MsgBox inpt
Set regex = New RegExp
regex.Pattern = "\((.+?)\)"
Set MColl = regex.Execute(inpt)
MsgBox MColl.Count
temp2 = MColl(0).Value
MsgBox temp2
End Sub
The code has at least two problems:
It will only get the first match in the string.(Mcoll.Count is always 1)
It will not recognize zero characters between the parentheses. (I think the .+? requires at least one character)
Does anyone have any suggestions ??
By default, RegExp Global property is False. You need to set it to True.
As for the regex, to match zero or more chars as few as possible, you need *?, not +?. Note that both are lazy (match as few as necessary to find a valid match), but + requires at least one char, while * allows matching zero chars (an empty string).
Thus, use
Set regex = New RegExp
regex.Global = True
regex.Pattern = "\((.*?)\)"
As for the regex, you can also use
regex.Pattern = "\(([^()]*)\)"
where [^()] is a negated character class matching any char but ( and ), zero or more times (due to * quantifier), matching as many such chars as possible (* is a greedy quantifier).

VBS script to report AD groups - Regex pattern not working with multiple matches

Having an issue with getting a regex statement to accept two expressions.
The "re.pattern" code here works:
If UserChoice = "" Then WScript.Quit 'Detect Cancel
re.Pattern = "[^(a-z)^(0,4,5,6,7,8,9)]"
re.Global = True
re.IgnoreCase = True
if re.test( UserChoice ) then
Exit Do
End if
MsgBox "Please choose either 1, 2 or 3 ", 48, "Invalid Entry"
While the below "regex.pattern " code does not. I want to use it to format the results of a DSQUERY command where groups are collected, but I don't want any of the info after the ",", nor do i want the leading CN= that is normally collected when the following dsquery is run:
"dsquery.exe user forestroot -samid "& strInput &" | dsget user -memberof")
The string I want to format would look something like this before formatting:
CN=APP_GROUP_123,OU=Global Groups,OU=Accounts,DC=corp,DC=contoso,DC=biz
This is the result I want:
APP_GROUP_123
Set regEx = New RegExp
**regEx.Pattern = "[,.*]["CN=]"**
Result = regEx.Replace(StrLine, "")
I'm only able to get the regex to work when used individually, either
regEx.Pattern = ",."
or
regEx.Pattern = "CN="
code is nested here:
Set InputFile = FSO.OpenTextFile("Temp.txt", 1)
Set InputFile = FSO.OpenTextFile("Temp.txt", 1)
set OutPutFile = FSO.OpenTextFile(StrInput & "-Results.txt", 8, True)
do While InputFile.AtEndOfStream = False
StrLine = InputFile.ReadLine
If inStr(strLine, TaskChoice) then
Set regEx = New RegExp
regEx.Pattern = "[A-Za-z]{2}=(.+?),.*"
Result = regEx.Replace(StrLine, "")
OutputFile.write(Replace(Result,"""","")) & vbCrLf
End if
This should get you started:
str = "CN=APP_GROUP_123,OU=Global Groups,OU=Accounts,DC=corp,DC=contoso,DC=biz"
Set re = New RegExp
re.pattern = "[A-Za-z]{2}=(.+?),.*"
if re.Test(str) then
set matches = re.Execute(str)
matched_str = "Matched: " & matches(0).SubMatches(0)
Wscript.echo matched_str
else
Wscript.echo "Not a match"
end if
Output:Matched: APP_GROUP_123
The regex you need is [A-Za-z]{2}=(.+?),.*
If the match is successful, it captures everything in the parenthesis. .+? means it will match any character non-greedily up until the first comma. The ? in .+? makes the expression non-greedy. If you were to omit it, you would capture everything up to the final comma at ,DC=biz
Your regular expression "[,.*]["CN=]" doesn't work for 2 reasons:
It contains an unescaped double quote. Double quotes inside VBScript strings must be escaped by doubling them, otherwise the interpreter would interpret your expression as a string "[,.*][", followed by an (invalid) variablename CN=] (without an operator too) and the beginning of the next string (the 3rd double quote).
You misunderstand regular expression syntax. Square brackets indicate a character class. An expression [,.*] would match any single comma, period or asterisk, not a comma followed by any number of characters.
What you meant to use was an alternation, which is expressed by a pipe symbol (|), and the beginning of a string is matched by a caret (^):
regEx.Pattern = ",.*|^CN="
With that said, in your case a better approach would be using a group and replacing the whole string with just the group match:
regEx.Pattern = "^cn=(.*?),.*"
regEx.IgnoreCase = True
Result = regEx.Replace(strLine, "$1")