How to get the position of submatches in VBA? - regex

I need to get the index position value of submatched string. As per documentation, I have read through this Regular expression and got to know FirstIndex property to get the position of matched string.
But this works only for one dimensional matched string. I couldn't apply FirstIndex for submatches.
Pls refer sample matches
I tried this format,
Dim myRegExp As Object, match As MatchCollection
Dim matched As String
Set myRegExp = CreateObject("VBScript.RegExp")
myRegExp.pattern = find
If myRegExp.test(text) = True Then
Set match = myRegExp.Execute(text)
Debug.Print match(0).submatches(0) '' this is matched string
Where should I call FirstIndex to get position of submatched string
output:
match(0)=>Berry, Brent. (2006). What accounts for race and ethnic differences in Berry,
Brent. parental financial transfers to adult children in the United States? Journal of Family
Issues 37:1583-1604.
submatches(0)=>Berry, Brent.
submatches(6)=>2006
EXPECTED OUTPUT:
submatches(0) at 0th position
submatches(6) at 16th position and so on

You can't apply .FirstIndex to SubMatches(x) because it returns a String, not a Match. If the groups will return unique matches, you can find its location by simply using the Instr function:
With CreateObject("VBScript.RegExp")
.Pattern = Find
If .Test(text) Then
Set match = .Execute(text)
Debug.Print InStr(1, text, match(0).SubMatches(0)) '0
Debug.Print InStr(1, text, match(0).SubMatches(5)) '16
'and so on
End If
End With
If the groups will not return unique results, you can track the position of the last match and loop through the results. Note that VBScript.RegExp doesn't support look-behinds, so you don't have to take the length of the matches into account:
With CreateObject("VBScript.RegExp")
.Pattern = find
If .Test(text) Then
Set match = .Execute(text)
Dim i As Long, pos As Long, found As String
pos = 1
For i = 0 To match(0).SubMatches.Count - 1
found = match(0).SubMatches(i)
pos = InStr(pos, text, match(0).SubMatches(i))
Debug.Print found, pos
Next
End If
End With

The Submatches collection contains strings:
A SubMatches collection contains individual submatch strings, ...
Each item in the SubMatches collection is the string found and
captured by the regular expression.
So you can't get the positions/indices.

Related

How to save SubMatches as array and print not empty submatches?

When I try the following Regex code and add a "Add Watch" (Shift + F9) to Matches
Sub TestRegEx1()
Dim regex As Object, Matches As Object
Dim str As String
str = "This is text for the submatches"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = "Th(is).+(for).+(submatches)|.+(\d)|([A-Z]{3})"
regex.IgnoreCase = True
Set Matches = regex.Execute(str)
End Sub
I see that Matches is structured like this (with 2 empty submatches):
2 questions:
How can I save in an array variable the SubMatches?
How can I Debug.Print only elements that are not empty?
I've tried doing like below but is not working
Set Arr = Matches.SubMatches
Set Arr = Matches(1).SubMatches
Set Arr = Matches.Item(1).SubMatches
Thanks in advance
Is the following what you intended? Oversize an array at the start and redim at the end. First version prints only non-empty but stores all. Second version prints and stores only non-empty.
You probably want to .Test to ensure there are matches.
Option Explicit
Sub TestRegEx1()
Dim regex As Object, matches As Object, match As Object, subMatch As Variant
Dim str As String, subMatches(), i As Long
ReDim subMatches(0 To 1000)
str = "This is text for the submatches"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = "Th(is).+(for).+(submatches)|.+(\d)|([A-Z]{3})"
regex.IgnoreCase = True
Set matches = regex.Execute(str)
For Each match In matches
For Each subMatch In match.subMatches
subMatches(i) = match.subMatches(i)
If Not IsEmpty(subMatches(i)) Then Debug.Print subMatches(i)
i = i + 1
Next
Next
ReDim Preserve subMatches(0 To i)
End Sub
If you only want to store non-empty then
Option Explicit
Sub TestRegEx1()
Dim regex As Object, matches As Object, match As Object, subMatch As Variant
Dim str As String, subMatches(), i As Long
ReDim subMatches(0 To 1000)
str = "This is text for the submatches"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = "Th(is).+(for).+(submatches)|.+(\d)|([A-Z]{3})"
regex.IgnoreCase = True
Set matches = regex.Execute(str)
For Each match In matches
For Each subMatch In match.subMatches
subMatches(i) = match.subMatches(i)
If Not IsEmpty(subMatches(i)) Then
Debug.Print subMatches(i)
i = i + 1
End If
Next
Next
ReDim Preserve subMatches(0 To i)
End Sub
You may use a Collection and fill it on the go.
Add
Dim m, coll As Collection
Initialize the collection:
Set coll = New Collection
Then, once you get the matches, use
If Matches.Count > 0 Then ' if there are matches
For Each m In Matches(0).SubMatches ' you need the first match submatches
If Len(m) > 0 Then coll.Add (m) ' if not 0 length, save value to collection
Next
End If
Result of the code with changes:

Excel VBA - Looking up a string with wildcards

Im trying to look up a string which contains wildcards. I need to find where in a specific row the string occurs. The string all take form of "IP##W## XX" where XX are the 2 letters by which I look up the value and the ## are the number wildcards that can be any random number. Hence this is what my look up string looks like :
FullLookUpString = "IP##W## " & LookUpString
I tried using the Find Command to find the column where this first occurs but I keep on getting with errors. Here's what I had so far but it doesn't work :L if anyone has an easy way of doing. Quite new to VBA -.-
Dim GatewayColumn As Variant
Dim GatewayDateColumn As Variant
Dim FirstLookUpRange As Range
Dim SecondLookUpRange As Range
FullLookUpString = "IP##W## " & LookUpString
Set FirstLookUpRange = wsMPNT.Range(wsMPNT.Cells(3, 26), wsMPNT.Cells(3, lcolumnMPNT))
Debug.Print FullLookUpString
GatewayColumn = FirstLookUpRange.Find(What:=FullLookUpString, After:=Range("O3")).Column
Debug.Print GatewayColumn
Per the comment by #SJR you can do this two ways. Using LIKE the pattern is:
IP##W## [A-Z][A-Z]
Using regular expressions, the pattern is:
IP\d{2}W\d{2} [A-Z]{2}
Example code:
Option Explicit
Sub FindString()
Dim ws As Worksheet
Dim rngData As Range
Dim rngCell As Range
Set ws = ThisWorkbook.Worksheets("Sheet1") '<-- set your sheet
Set rngData = ws.Range("A1:A4")
' with LIKE operator
For Each rngCell In rngData
If rngCell.Value Like "IP##W## [A-Z][A-Z]" Then
Debug.Print rngCell.Address
End If
Next rngCell
' with regular expression
Dim objRegex As Object
Dim objMatch As Object
Set objRegex = CreateObject("VBScript.RegExp")
objRegex.Pattern = "IP\d{2}W\d{2} [A-Z]{2}"
For Each rngCell In rngData
If objRegex.Test(rngCell.Value) Then
Debug.Print rngCell.Address
End If
Next rngCell
End Sub
If we can assume that ALL the strings in the row match the given pattern, then we can examine only the last three characters:
Sub FindAA()
Dim rng As Range, r As Range, Gold As String
Set rng = Range(Range("A1"), Cells(1, Columns.Count))
Gold = " AA"
For Each r In rng
If Right(r.Value, 3) = Gold Then
MsgBox r.Address(0, 0)
Exit Sub
End If
Next r
End Sub
Try this:
If FullLookUpString Like "*IP##W##[a-zA-Z][a-zA-Z]*" Then
MsgBox "Match is found"
End If
It will find your pattern (pattern can be surrounded by any characters - that's allowed by *).

Regex extract data before certain text

I have large text documents that has some data I want to be extracted.
As you can see in a screenshot , I want to extract A040 to excel column next to the filename.
Before the A040 there is always three empty spaces and than text Sheet (also in screenshot)
Every file has different number and there is always letter A with three digits and text Sheet. --> example file uploaded:
I has something already in VB with Excel but it is not working.
Dim cell As Range
Dim rng As Range
Dim output As String
Set rng = ws.Range("A1", ws.Range("A1").SpecialCells(xlLastCell).Address)
For Each cell In rng
On Error Resume Next
output = ExtA(cell.Value)
If Len(output) > 0 Then
Range("B" & j) = output
Exit For
End If
Next
j = j + 1
ws.Cells.ClearContents
'Call DelConns
strFileName = Dir 'next file
Loop
End Sub
Function ExtA(ByVal text As String) As String
'REGEX Match VBA in excel
Dim result As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
RE.Pattern = "(?<=Sheet)[^Sheet]*\ Sheet"
RE.Global = True
RE.IgnoreCase = True
Set allMatches = RE.Execute(text)
If allMatches.Count <> 0 Then
result = allMatches.Item(0).submatches.Item(0)
End If
ExtA = result
End Function
This seems to work on your sample.
Option Explicit
Function AthreeDigits(str As String)
Dim n As Long, nums() As Variant
Static rgx As Object, cmat As Object
'with rgx as static, it only has to be created once; beneficial when filling a long column with this UDF
If rgx Is Nothing Then
Set rgx = CreateObject("VBScript.RegExp")
Else
Set cmat = Nothing
End If
AthreeDigits = vbNullString
With rgx
.Global = False
.MultiLine = True
.Pattern = "\A[0-9]{3}[\s]{3}Sheet"
If .Test(str) Then
Set cmat = .Execute(str)
AthreeDigits = Left(cmat.Item(0), 4)
End If
End With
End Function
Did you mean to say that there are 4 spaces after the A040 and before the "Sheet"? If so, try this pattern:
.pattern = "(A\d\d\d)\s{3}Sheet"
EDIT: I thought you said 4 spaces, but you said 3. My pattern now reflects that.
EDIT 2: (I need more coffee!) Change the \b to \s.
See Example here
"\s+[Aa]\d*\s+Sheet"
Or
\s+[Aa]\d*\s+(Sheet)
Or
[Aa]\d*\s+(Sheet)
Demo
https://regex101.com/r/Qo8iUf/3
\s+ Matches any whitespace character (equal to [\r\n\t\f\v ])
+ Quantifier — Matches between one and unlimited times, as many times as possible
Aa Matches a single character in the list Aa (case sensitive)
\d* Matches a digit (equal to [0-9])
* Quantifier — Matches between zero and unlimited times, as many times as possible

Extract four numbers without brackets from a bracketed entry, if entry exists

What I have:
A list of about 1000 titles of reports in column B.
Some of these titles have a four digit number surrounded by brackets (eg: (3672)) somewhere in a string of text and numbers.
I want to extract these four numbers - without brackets - in column C in the same row.
If there is no four digit number with brackets in column B, then to return "" in column C.
What I have so far:
I can successfully identify the cells in column B which have four digits surrounded by brackets. The problem is it returns the whole title including the four numbers.
Taken from: VBA RegEx extracting data from within a string
NB: I am Using Excel Professional Plus 2010, have checked the box next to "Microsoft VBScript Regular Expressions 5.5".
Sub ExtractTicker()
Dim regEx
Dim i As Long
Dim pattern As String
Set regEx = CreateObject("VBScript.RegExp")
regEx.IgnoreCase = True
regEx.Global = True
regEx.pattern = "(\()([0-9]{4})(\))"
For i = 2 To ActiveSheet.UsedRange.Rows.Count
If (regEx.Test(Cells(i, 2).Value)) Then
Cells(i, 3).Value = regEx.Replace(Cells(i, 2).Value, "$2")
End If
Next i
End Sub
Try
regEx.pattern = "(.*\()([0-9]{4})(\).*)"
the .* and the start and end of the string ensure you capture the entire string, then this is fully substituted by the 2nd submatch ([0-9]{4})
To fully optimise the code
use variant arrays rather than ranges
setting Global and IgnoreCase is redundant when you are running a case insensitive match on the full string
you are using late binding so you dont need the Reference
code
Sub ExtractTicker()
Dim regEx As Object
Dim pattern As String
Dim X
Dim lngCNt As Long
X = Range([b1], Cells(Rows.Count, "B").End(xlUp)).Value2
Set regEx = CreateObject("VBScript.RegExp")
With regEx
.pattern = "(.*\()([0-9]{4})(\).*)"
For lngCNt = 1 To UBound(X)
If .Test(X(lngCNt, 1)) Then
X(lngCNt, 1) = .Replace(X(lngCNt, 1), "$2")
Else
X(lngCNt, 1) = vbNullString
End If
Next
End With
[c1].Resize(UBound(X, 1), 1).Value2 = X
End Sub

Split a string according to a regexp in VBScript

I would like to split a string into an array according to a regular expression similar to what can be done with preg_split in PHP or VBScript Split function but with a regex in place of delimiter.
Using VBScript Regexp object, I can execute a regex but it returns the matches (so I get a collection of my splitters... that's not what I want)
Is there a way to do so ?
Thank you
If you can reserve a special delimiter string, i.e. a string that you can choose that will never be a part of the real input string (perhaps something like "###"), then you can use regex replacement to replace all matches of your pattern to "###", and then split on "###".
Another possibility is to use a capturing group. If your delimiter regex is, say, \d+, then you search for (.*?)\d+, and then extract what the group captured in each match (see before and after on rubular.com).
You can alway use the returned array of matches as input to the split function. You split the original string using the first match - the first part of the string is the first split, then split the remainder of the string (minus the first part and the first match)... continue until done.
I wrote this for my use. Might be what you're looking for.
Function RegSplit(szPattern, szStr)
Dim oAl, oRe, oMatches
Set oRe = New RegExp
oRe.Pattern = "^(.*)(" & szPattern & ")(.*)$"
oRe.IgnoreCase = True
oRe.Global = True
Set oAl = CreateObject("System.Collections.ArrayList")
Do
Set oMatches = oRe.Execute(szStr)
If oMatches.Count > 0 Then
oAl.Add oMatches(0).SubMatches(2)
szStr = oMatches(0).SubMatches(0)
Else
oAl.Add szStr
Exit Do
End If
Loop
oAl.Reverse
RegSplit = oAl.ToArray
End Function
'**************************************************************
Dim A
A = RegSplit("[,|;|#]", "bob,;joe;tony#bill")
WScript.Echo Join(A, vbCrLf)
Returns:
bob
joe
tony
bill
I think you can achieve this by using Execute to match on the required splitter string, but capturing all the preceding characters (after the previous match) as a group. Here is some code that could do what you want.
'// Function splits a string on matches
'// against a given string
Function SplitText(strInput,sFind)
Dim ArrOut()
'// Don't do anything if no string to be found
If len(sFind) = 0 then
redim ArrOut(0)
ArrOut(0) = strInput
SplitText = ArrOut
Exit Function
end If
'// Define regexp
Dim re
Set re = New RegExp
'// Pattern to be found - i.e. the given
'// match or the end of the string, preceded
'// by any number of characters
re.Pattern="(.*?)(?:" & sFind & "|$)"
re.IgnoreCase = True
re.Global = True
'// find all the matches >> match collection
Dim oMatches: Set oMatches = re.Execute( strInput )
'// Prepare to process
Dim oMatch
Dim ix
Dim iMax
'// Initialize the output array
iMax = oMatches.Count - 1
redim arrOut( iMax)
'// Process each match
For ix = 0 to iMax
'// get the match
Set oMatch = oMatches(ix)
'// Get the captured string that precedes the match
arrOut( ix ) = oMatch.SubMatches(0)
Next
Set re = nothing
'// Check if the last entry was empty - this
'// removes one entry if the string ended on a match
if arrOut(iMax) = "" then Redim Preserve ArrOut(iMax-1)
'// Return the processed output
SplitText = arrOut
End Function