How to extract substring in parentheses using Regex pattern - regex

This is probably a simple problem, but unfortunately I wasn't able to get the results I wanted...
Say, I have the following line:
"Wouldn't It Be Nice" (B. Wilson/Asher/Love)
I would have to look for this pattern:
" (<any string>)
In order to retrieve:
B. Wilson/Asher/Love
I tried something like "" (([^))]*)) but it doesn't seem to work. Also, I'd like to use Match.Submatches(0) so that might complicate things a bit because it relies on brackets...

Edit: After examining your document, the problem is that there are non-breaking spaces before the parentheses, not regular spaces. So this regex should work: ""[ \xA0]*\(([^)]+)\)
"" 'quote (twice to escape)
[ \xA0]* 'zero or more non-breaking (\xA0) or a regular spaces
\( 'left parenthesis
( 'open capturing group
[^)]+ 'anything not a right parenthesis
) 'close capturing group
\) 'right parenthesis
In a function:
Public Function GetStringInParens(search_str As String)
Dim regEx As New VBScript_RegExp_55.RegExp
Dim matches
GetStringInParens = ""
regEx.Pattern = """[ \xA0]*\(([^)]+)\)"
regEx.Global = True
If regEx.test(search_str) Then
Set matches = regEx.Execute(search_str)
GetStringInParens = matches(0).SubMatches(0)
End If
End Function

Not strictly an answer to your question, but sometimes, for things this simple, good ol' string functions are less confusing and more concise than Regex.
Function BetweenParentheses(s As String) As String
BetweenParentheses = Mid(s, InStr(s, "(") + 1, _
InStr(s, ")") - InStr(s, "(") - 1)
End Function
Usage:
Debug.Print BetweenParentheses("""Wouldn't It Be Nice"" (B. Wilson/Asher/Love)")
'B. Wilson/Asher/Love
EDIT #alan points our that this will falsely match the contents of parentheses in the song title. This is easily circumvented with a little modification:
Function BetweenParentheses(s As String) As String
Dim iEndQuote As Long
Dim iLeftParenthesis As Long
Dim iRightParenthesis As Long
iEndQuote = InStrRev(s, """")
iLeftParenthesis = InStr(iEndQuote, s, "(")
iRightParenthesis = InStr(iEndQuote, s, ")")
If iLeftParenthesis <> 0 And iRightParenthesis <> 0 Then
BetweenParentheses = Mid(s, iLeftParenthesis + 1, _
iRightParenthesis - iLeftParenthesis - 1)
End If
End Function
Usage:
Debug.Print BetweenParentheses("""Wouldn't It Be Nice"" (B. Wilson/Asher/Love)")
'B. Wilson/Asher/Love
Debug.Print BetweenParentheses("""Don't talk (yell)""")
' returns empty string
Of course this is less concise than before!

This a nice regex
".*\(([^)]*)
In VBA/VBScript:
Dim myRegExp, ResultString, myMatches, myMatch As Match
Dim myRegExp As RegExp
Set myRegExp = New RegExp
myRegExp.Pattern = """.*\(([^)]*)"
Set myMatches = myRegExp.Execute(SubjectString)
If myMatches.Count >= 1 Then
Set myMatch = myMatches(0)
If myMatch.SubMatches.Count >= 3 Then
ResultString = myMatch.SubMatches(3-1)
Else
ResultString = ""
End If
Else
ResultString = ""
End If
This matches
Put Your Head on My Shoulder
in
"Don't Talk (Put Your Head on My Shoulder)"
Update 1
I let the regex loose on your doc file and it matches as requested. Quite sure the regex is fine. I'm not fluent in VBA/VBScript but my guess is that's where it goes wrong
If you want to discuss the regex some further that's fine with me. I'm not eager to start digging into this VBscript API which looks arcane.
Given the new input the regex is tweaked to
".*".*\(([^)]*)
So that it doesn't falsely match (Put Your Head on My Shoulder) which appears inside the quotes.

This function worked on your example string:
Function GetArtist(songMeta As String) As String
Dim artist As String
' split string by ")" and take last portion
artist = Split(songMeta, "(")(UBound(Split(songMeta, "(")))
' remove closing parenthesis
artist = Replace(artist, ")", "")
End Function
Ex:
Sub Test()
Dim songMeta As String
songMeta = """Wouldn't It Be Nice"" (B. Wilson/Asher/Love)"
Debug.Print GetArtist(songMeta)
End Sub
prints "B. Wilson/Asher/Love" to the Immediate Window.
It also solves the problem alan mentioned. Ex:
Sub Test()
Dim songMeta As String
songMeta = """Wouldn't (It Be) Nice"" (B. Wilson/Asher/Love)"
Debug.Print GetArtist(songMeta)
End Sub
also prints "B. Wilson/Asher/Love" to the Immediate Window. Unless of course, the artist names also include parentheses.

This another Regex tested with a vbscript (?:\()(.*)(?:\)) Demo Here
Data = """Wouldn't It Be Nice"" (B. Wilson/Asher/Love)"
wscript.echo Extract(Data)
'---------------------------------------------------------------
Function Extract(Data)
Dim strPattern,oRegExp,Matches
strPattern = "(?:\()(.*)(?:\))"
Set oRegExp = New RegExp
oRegExp.IgnoreCase = True
oRegExp.Pattern = strPattern
set Matches = oRegExp.Execute(Data)
If Matches.Count > 0 Then Extract = Matches(0).SubMatches(0)
End Function
'---------------------------------------------------------------

I think you need a better data file ;) You might want to consider pre-processing the file to a temp file for modification, so that outliers that don't fit your pattern are modified to where they'll meet your pattern. It's a bit time consuming to do, but it is always difficult when a data file lacks consistency.

Related

Extracting Parenthetical Data Using Regex

I have a small sub that extracts parenthetical data (including parentheses) from a string and stores it in cells adjacent to the string:
Sub parens()
Dim s As String, i As Long
Dim c As Collection
Set c = New Collection
s = ActiveCell.Value
ary = Split(s, ")")
For i = LBound(ary) To UBound(ary) - 1
bry = Split(ary(i), "(")
c.Add "(" & bry(1) & ")"
Next i
For i = 1 To c.Count
ActiveCell.Offset(0, i).NumberFormat = "#"
ActiveCell.Offset(0, i).Value = c.Item(i)
Next i
End Sub
For example:
I am now trying to replace this with some Regex code. I am NOT a regex expert. I want to create a pattern that looks for an open parenthesis followed by zero or more characters of any type followed by a close parenthesis.
I came up with:
\((.+?)\)
My current new code is:
Sub qwerty2()
Dim inpt As String, outpt As String
Dim MColl As MatchCollection, temp2 As String
Dim regex As RegExp, L As Long
inpt = ActiveCell.Value
MsgBox inpt
Set regex = New RegExp
regex.Pattern = "\((.+?)\)"
Set MColl = regex.Execute(inpt)
MsgBox MColl.Count
temp2 = MColl(0).Value
MsgBox temp2
End Sub
The code has at least two problems:
It will only get the first match in the string.(Mcoll.Count is always 1)
It will not recognize zero characters between the parentheses. (I think the .+? requires at least one character)
Does anyone have any suggestions ??
By default, RegExp Global property is False. You need to set it to True.
As for the regex, to match zero or more chars as few as possible, you need *?, not +?. Note that both are lazy (match as few as necessary to find a valid match), but + requires at least one char, while * allows matching zero chars (an empty string).
Thus, use
Set regex = New RegExp
regex.Global = True
regex.Pattern = "\((.*?)\)"
As for the regex, you can also use
regex.Pattern = "\(([^()]*)\)"
where [^()] is a negated character class matching any char but ( and ), zero or more times (due to * quantifier), matching as many such chars as possible (* is a greedy quantifier).

Regex - Quantifier {x,y} following nothing

I'm creating a basic text editor and I'm using regex to achieve a find and replace function. To do this I've gotten this code:
Private Function GetRegExpression() As Regex
Dim result As Regex
Dim regExString As [String]
' Get what the user entered
If TabControl1.SelectedIndex = 0 Then
regExString = txtbx_Find2.Text
ElseIf TabControl1.SelectedIndex = 1 Then
regExString = txtbx_Find.Text
End If
If chkMatchCase.Checked Then
result = New Regex(regExString)
Else
result = New Regex(regExString, RegexOptions.IgnoreCase)
End If
Return result
End Function
And this is the Find method
Private Sub FindText()
''
Dim WpfTest1 As New Spellpad.Tb
Dim ElementHost1 As System.Windows.Forms.Integration.ElementHost = frm_Menu.Controls("ElementHost1")
Dim TheTextBox As System.Windows.Controls.TextBox = CType(ElementHost1.Child, Tb).ctrl_TextBox
''
' Is this the first time find is called?
' Then make instances of RegEx and Match
If isFirstFind Then
regex = GetRegExpression()
match = regex.Match(TheTextBox.Text)
isFirstFind = False
Else
' match.NextMatch() is also ok, except in Replace
' In replace as text is changing, it is necessary to
' find again
'match = match.NextMatch();
match = regex.Match(TheTextBox.Text, match.Index + 1)
End If
' found a match?
If match.Success Then
' then select it
Dim row As Integer = TheTextBox.GetLineIndexFromCharacterIndex(TheTextBox.CaretIndex)
MoveCaretToLine(TheTextBox, row + 1)
TheTextBox.SelectionStart = match.Index
TheTextBox.SelectionLength = match.Length
Else
If TabControl1.SelectedIndex = 0 Then
MessageBox.Show([String].Format("Cannot find ""{0}"" ", txtbx_Find2.Text), Application.ProductName, MessageBoxButtons.OK, MessageBoxIcon.Information)
ElseIf TabControl1.SelectedIndex = 1 Then
MessageBox.Show([String].Format("Cannot find ""{0}"" ", txtbx_Find.Text), Application.ProductName, MessageBoxButtons.OK, MessageBoxIcon.Information)
End If
isFirstFind = True
End If
End Sub
When I run the program I get errors:
For ?, parsing "?" - Quantifier {x,y} following nothing.; and
For *, parsing "*" - Quantifier {x,y} following nothing.
It's as if I can't use these but I really need to. How can I solve this problem?
? and * are quantifiers in regular expressions:
? is used to specify that something is optional, for instance b?au can match both bau and au.
* means the group with which it binds can be repeated zero, one or multiple times: for instance ba*u can bath bu, bau, baau, baaaaaaaau,...
Now most regular expressions use {l,u} as a third pattern with l the lower bound on the number of times something is repeated, and u the upper bound on the number of occurences. So ? is replaced by {0,1} and * by {0,}.
Now if you provide them without any character before them, evidently, the regex parser doesn't know what you mean. In other words if you do (used csharp, but the ideas are generally applicable):
$ csharp
Mono C# Shell, type "help;" for help
Enter statements below.
csharp> Regex r = new Regex("fo*bar");
csharp> r.Replace("Fooobar fooobar fbar fobar","<MATCH>");
"Fooobar <MATCH> <MATCH> <MATCH>"
csharp> r.Replace("fooobar far qux fooobar quux fbar echo fobar","<MATCH>");
"<MATCH> far qux <MATCH> quux <MATCH> echo <MATCH>"
If you wish to do a "raw text find and replace", you should use string.Replace.
EDIT:
Another way to process them is by escaping special regex characters. Ironically enough, you can do this by replacing them by a regex ;).
Private Function GetRegExpression() As Regex
Dim result As Regex
Dim regExString As [String]
' Get what the user entered
If TabControl1.SelectedIndex = 0 Then
regExString = txtbx_Find2.Text
ElseIf TabControl1.SelectedIndex = 1 Then
regExString = txtbx_Find.Text
End If
'Added code
Dim baseRegex As Regex = new Regex("[\\.$^{\[(|)*+?]")
regExString = baseRegex.Replace(regExString,"\$0")
'End added code
If chkMatchCase.Checked Then
result = New Regex(regExString)
Else
result = New Regex(regExString, RegexOptions.IgnoreCase)
End If
Return result
End Function

VBA: REGEX LOOKBEHIND MS ACCESS 2010

I have a function that was written so that VBA can be used in MS Access
I wish to do the following
I have set up my code below. Everything before the product works perfectly but trying to get the information behind just returns "" which is strange as when i execute it within Notepad++ it works perfectly fine
So it looks for the letters MIP and one of the 3 letter codes (any of them)
StringToCheck = "MADHUBESOMIPTDTLTRCOYORGLEJ"
' PART 1
' If MIP appears in the string, then delete any of the following codes if they exist - DOM, DOX, DDI, ECX, LOW, WPX, SDX, DD6, DES, BDX, CMX,
' WMX, TDX, TDT, BSA, EPA, EPP, ACP, ACA, ACE, ACS, GMB, MAL, USP, NWP.
' EXAMPLE 1. Flagged as: MADHUBESOMIPTDTLTRCOYORGLEJ, should be MADHUBESOMIPLTRCOYORGLEJ
Do While regexp(StringToCheck, "MIP(DOM|DOX|DDI|ECX|LOW|WPX|SDX|DD6|DES|BDX|CMX|WMX|TDX|TDT|BSA|EPA|EPP|ACP|ACA|ACE|ACS|GMB|MAL|USP|NWP|BBX)", False) <> ""
' SELECT EVERYTHING BEFORE THE THREE LETTER CODES
strPart1 = regexp(StringToCheck, ".*^[^_]+(?=DOM|DOX|DDI|ECX|LOW|WPX|SDX|DD6|DES|BDX|CMX|WMX|TDX|TDT|BSA|EPA|EPP|ACP|ACA|ACE|ACS|GMB|MAL|USP|NWP|BBX)", False)
' SELECT EVERYTHING AFTER THE THREE LETTER CODES
strPart2 = regexp(StringToCheck, "(?<=(DOM|DOX|DDI|ECX|LOW|WPX|SDX|DD6|DES|BDX|CMX|WMX|TDX|TDT|BSA|EPA|EPP|ACP|ACA|ACE|ACS|GMB|MAL|USP|NWP|BBX).*", False)
StringToCheck = strPart1 & strPart2
Loop
The function i am using which i have taken from the internet is below
Function regexp(StringToCheck As Variant, PatternToUse As String, Optional CaseSensitive As Boolean = True) As String
On Error GoTo RefErr:
Dim re As New regexp
re.Pattern = PatternToUse
re.Global = False
re.IgnoreCase = Not CaseSensitive
Dim m
For Each m In re.Execute(StringToCheck)
regexp = UCase(m.Value)
Next
RefErr:
On Error Resume Next
End Function
Just do it in two steps:
Check if MIP is in the string
If it is, remove the other codes.
Like this:
Sub Test()
Dim StringToCheck As String
StringToCheck = "MADHUBESOMIPTDTLTRCOYORGLEJ"
Debug.Print StringToCheck
Debug.Print CleanupString(StringToCheck)
End Sub
Function CleanupString(str As String) As String
Dim reCheck As New RegExp
Dim reCodes As New RegExp
reCheck.Pattern = "^(?:...)*?MIP"
reCodes.Pattern = "^((?:...)*?)(?:DOM|DOX|DDI|ECX|LOW|WPX|SDX|DD6|DES|BDX|CMX|WMX|TDX|TDT|BSA|EPA|EPP|ACP|ACA|ACE|ACS|GMB|MAL|USP|NWP|BBX)"
reCodes.Global = True
If reCheck.Test(str) Then
While reCodes.Test(str)
str = reCodes.Replace(str, "$1")
Wend
End If
CleanupString = str
End Function
Note that the purpose of (?:...)*? is to group the letters in threes.
Since the VBScript regular expression engine does support look-aheads, you can of course also do it in a single regex:
Function CleanupString(str As String) As String
Dim reClean As New RegExp
reClean.Pattern = "^(?=(?:...)*?MIP)((?:...)*?)(?:DOM|DOX|DDI|ECX|LOW|WPX|SDX|DD6|DES|BDX|CMX|WMX|TDX|TDT|BSA|EPA|EPP|ACP|ACA|ACE|ACS|GMB|MAL|USP|NWP|BBX)"
While reClean.Test(str)
str = reClean.Replace(str, "$1")
Wend
CleanupString = str
End Function
Personally, I like the two-step check/remove pattern better because it is a lot more obvious and therefore more maintainable.
Non RE option:
Function DeMIPString(StringToCheck As String) As String
If Not InStr(StringToCheck, "MIP") Then
DeMIPString = StringToCheck
Else
Dim i As Long
For i = 1 To Len(StringToCheck) Step 3
Select Case Mid$(StringToCheck, i, 3)
Case "MIP", "DOM", "DOX", "DDI", "ECX", "LOW", "WPX", "SDX", "DD6", "DES", "BDX", "CMX", "WMX", "TDX", "TDT", "BSA", "EPA", "EPP", "ACP", "ACA", "ACE", "ACS", "GMB", "MAL", "USP", "NWP":
Case Else
DeMIPString = DeMIPString & Mid$(StringToCheck, i, 3)
End Select
Next
End If
End Function

VB.Net Regular Expressions - Extracting Wildcard Value

I need help extracting the value of a wildcard from a Regular Expressions match. For example:
Regex: "I like *"
Input: "I like chocolate"
I would like to be able to extract the string "chocolate" from the Regex match (or whatever else is there). If possible, I also want to be able to retrieve several wildcard values from a single wildcard match. For example:
Regex: "I play the * and the *"
Input: "I play the guitar and the bass"
I want to be able to extract both "guitar" and "bass". Is there a way to do it?
In general regex utilize the concepts of groups. Groups are indicated by parenthesis.
So I like
Would be I like (.) . = All character * meaning as many or none of the preceding character
Sub Main()
Dim s As String = "I Like hats"
Dim rxstr As String = "I Like(.*)"
Dim m As Match = Regex.Match(s, rxstr)
Console.WriteLine(m.Groups(1))
End Sub
The above code will work for and string that has I Like and will print out all characters after including the ' ' as . matches even white space.
Your second case is more interesting because the first rx will match the entire end of the string you need something more restrictive.
I Like (\w+) and (\w+) : this will match I Like then a space and one or more word characters and then an and a space and one or more word characters
Sub Main()
Dim s2 As String = "I Like hats and dogs"
Dim rxstr2 As String = "I Like (\w+) and (\w+)"
Dim m As Match = Regex.Match(s2, rxstr2)
Console.WriteLine("{0} : {1}", m.Groups(1), m.Groups(2))
End Sub
For a more complete treatment of regex take a look at this site which has a great tutorial.
Here is my RegexExtract Function in VBA. It will return just the sub match you specify (only the stuff in parenthesis). So in your case, you'd write:
=RegexExtract(A1, "I like (.*)")
Here is the code.
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String) As String
Application.ScreenUpdating = False
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
RE.Pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)
RegexExtract = allMatches.Item(0).submatches.Item(0)
Application.ScreenUpdating = True
End Function
Here is a version that will allow you to use multiple groups to extract multiple parts at once:
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String) As String
Application.ScreenUpdating = False
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long
Dim result As String
RE.Pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)
For i = 0 To allMatches.Item(0).submatches.count - 1
result = result & allMatches.Item(0).submatches.Item(i)
Next
RegexExtract = result
Application.ScreenUpdating = True
End Function

Split a string according to a regexp in VBScript

I would like to split a string into an array according to a regular expression similar to what can be done with preg_split in PHP or VBScript Split function but with a regex in place of delimiter.
Using VBScript Regexp object, I can execute a regex but it returns the matches (so I get a collection of my splitters... that's not what I want)
Is there a way to do so ?
Thank you
If you can reserve a special delimiter string, i.e. a string that you can choose that will never be a part of the real input string (perhaps something like "###"), then you can use regex replacement to replace all matches of your pattern to "###", and then split on "###".
Another possibility is to use a capturing group. If your delimiter regex is, say, \d+, then you search for (.*?)\d+, and then extract what the group captured in each match (see before and after on rubular.com).
You can alway use the returned array of matches as input to the split function. You split the original string using the first match - the first part of the string is the first split, then split the remainder of the string (minus the first part and the first match)... continue until done.
I wrote this for my use. Might be what you're looking for.
Function RegSplit(szPattern, szStr)
Dim oAl, oRe, oMatches
Set oRe = New RegExp
oRe.Pattern = "^(.*)(" & szPattern & ")(.*)$"
oRe.IgnoreCase = True
oRe.Global = True
Set oAl = CreateObject("System.Collections.ArrayList")
Do
Set oMatches = oRe.Execute(szStr)
If oMatches.Count > 0 Then
oAl.Add oMatches(0).SubMatches(2)
szStr = oMatches(0).SubMatches(0)
Else
oAl.Add szStr
Exit Do
End If
Loop
oAl.Reverse
RegSplit = oAl.ToArray
End Function
'**************************************************************
Dim A
A = RegSplit("[,|;|#]", "bob,;joe;tony#bill")
WScript.Echo Join(A, vbCrLf)
Returns:
bob
joe
tony
bill
I think you can achieve this by using Execute to match on the required splitter string, but capturing all the preceding characters (after the previous match) as a group. Here is some code that could do what you want.
'// Function splits a string on matches
'// against a given string
Function SplitText(strInput,sFind)
Dim ArrOut()
'// Don't do anything if no string to be found
If len(sFind) = 0 then
redim ArrOut(0)
ArrOut(0) = strInput
SplitText = ArrOut
Exit Function
end If
'// Define regexp
Dim re
Set re = New RegExp
'// Pattern to be found - i.e. the given
'// match or the end of the string, preceded
'// by any number of characters
re.Pattern="(.*?)(?:" & sFind & "|$)"
re.IgnoreCase = True
re.Global = True
'// find all the matches >> match collection
Dim oMatches: Set oMatches = re.Execute( strInput )
'// Prepare to process
Dim oMatch
Dim ix
Dim iMax
'// Initialize the output array
iMax = oMatches.Count - 1
redim arrOut( iMax)
'// Process each match
For ix = 0 to iMax
'// get the match
Set oMatch = oMatches(ix)
'// Get the captured string that precedes the match
arrOut( ix ) = oMatch.SubMatches(0)
Next
Set re = nothing
'// Check if the last entry was empty - this
'// removes one entry if the string ended on a match
if arrOut(iMax) = "" then Redim Preserve ArrOut(iMax-1)
'// Return the processed output
SplitText = arrOut
End Function