RegEx VBA Excel complex string - regex

I have a function pulled from here. My problem is that I don't know what RegEx pattern I need to use to split out the following data:
+1 vorpal unholy longsword +31/+26/+21/+16 (2d6+13)
+1 vorpal flaming whip +30/+25/+20 (1d4+7 plus 1d6 fire and entangle)
2 slams +31 (1d10+12)
I want it to look like:
+1 vorpal unholy longsword, 31
+1 vorpal flaming whip, 30
2 slams, 31
Here is the VBA code that does the RegExp validation:
Public Function RXGET(ByRef find_pattern As Variant, _
ByRef within_text As Variant, _
Optional ByVal submatch As Long = 0, _
Optional ByVal start_num As Long = 0, _
Optional ByVal case_sensitive As Boolean = True) As Variant
' RXGET - Looks for a match for regular expression pattern find_pattern
' in the string within_text and returns it if found, error otherwise.
' Optional long submatch may be used to return the corresponding submatch
' if specified - otherwise the entire match is returned.
' Optional long start_num specifies the number of the character to start
' searching for in within_text. Default=0.
' Optional boolean case_sensitive makes the regex pattern case sensitive
' if true, insensitive otherwise. Default=true.
Dim objRegex As VBScript_RegExp_55.RegExp
Dim colMatch As VBScript_RegExp_55.MatchCollection
Dim vbsMatch As VBScript_RegExp_55.Match
Dim colSubMatch As VBScript_RegExp_55.SubMatches
Dim sMatchString As String
Set objRegex = New VBScript_RegExp_55.RegExp
' Initialise Regex object
With objRegex
.Global = False
' Default is case sensitive
If case_sensitive Then
.IgnoreCase = False
Else: .IgnoreCase = True
End If
.pattern = find_pattern
End With
' Return out of bounds error
If start_num >= Len(within_text) Then
RXGET = CVErr(xlErrNum)
Exit Function
End If
sMatchString = Right$(within_text, Len(within_text) - start_num)
' Create Match collection
Set colMatch = objRegex.Execute(sMatchString)
If colMatch.Count = 0 Then ' No match
RXGET = CVErr(xlErrNA)
Else
Set vbsMatch = colMatch(0)
If submatch = 0 Then ' Return match value
RXGET = vbsMatch.Value
Else
Set colSubMatch = vbsMatch.SubMatches ' Use the submatch collection
If colSubMatch.Count < submatch Then
RXGET = CVErr(xlErrNum)
Else
RXGET = CStr(colSubMatch(submatch - 1))
End If
End If
End If
End Function

I don't know about Excel but this should get you started on the RegEx:
/(?:^|, |and |or )(\+?\d?\s?[^\+]*?) (?:\+|-)(\d+)/
NOTE: There is a slight caveat here. This will also match if an element begins with + only (not being followed by a digit).
Capture groups 1 and 2 contain the strings that go left and right of your comma (if the whole pattern has index 0). So you can something like capture[1] + ', ' + capture[2] (whatever your syntax for that is).
Here is an explanation of the regex:
/(?:^|, |and |or ) # make sure that we only start looking after
# the beginning of the string, after a comma, after an
# and or after an or; the "?:" makes sure that this
# subpattern is not capturing
(\+? # a literal "+"
\d+ # at least one digit
# a literal space
[^+]*?) # arbitrarily many non-plus characters; the ? makes it
# non-greedy, otherwise it might span multiple lines
# a literal space
\+ # a literal "+"
(\d+)/ # at least one digit (and the brakets are for capturing)

Related

Extract an 8 digits number from a string with additional conditions

I need to extract a number from a string with several conditions.
It has to start with 1-9, not with 0, and it will have 8 digits. Like 23242526 or 65478932
There will be either an empty space or a text variable before it. Like MMX: 23242526 or bgr65478932
It could have come in rare cases: 23,242,526
It ends with an emty space or a text variable.
Here are several examples:
From RE: Markitwire: 120432889: Mx: 24,693,059 i need to get 24693059
From Automatic reply: Auftrag zur Übertragung IRD Ref-Nr. MMX_23497152 need to get 23497152
From FW: CGMSE 2019-2X A1AN XS2022418672 Contract 24663537 need to get 24663537
From RE: BBVA-MAD MMX_24644644 + MMX_24644645 need to get 24644644, 24644645
Right now I'm using the regexextract function(found it on this web-site), which extracts any number with 8 digits starting with 2. However it would also extract a number from, let's say, this expression TGF00023242526, which is incorrect. Moreover, I don't know how to add additional conditions to the code.
=RegexExtract(A11, ""(2\d{7})\b"", ", ")
Thank you in advance.
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String, _
Optional seperator As String = "") As String
Dim i As Long, j As Long
Dim result As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
RE.Pattern = extract_what
RE.Global = True
RE.IgnoreCase = True
Set allMatches = RE.Execute(text)
For i = 0 To allMatches.Count - 1
For j = 0 To allMatches.Item(i).SubMatches.Count - 1
result = result & seperator & allMatches.Item(i).SubMatches.Item(j)
Next
Next
If Len(result) <> 0 Then
result = Right(result, Len(result) - Len(seperator))
End If
RegexExtract = result
End Function
You may create a custom boundary using a non-capturing group before the pattern you have:
(?:[\D0]|^)(2\d{7})\b
^^^^^^^^^^^
The (?:[\D0]|^) part matches either a non-digit (\D) or 0 or (|) start of string (^).
As an alternative to also match 8 digits in values like 23,242,526 and start with a digit 1-9 you might use
\b[1-9](?:,?\d){7}\b
\b Word boundary
[1-9] Match the firstdigit 1-9
(?:,?\d){7} Repeat 7 times matching an optional comma and a digit
\b Word boundary
Regex demo
Then you could afterwards replace the comma's with an empty string.

Excel VBA RegEx that extracts numbers from price values in range (has commas, $ and -)

I have a field data extracted from a database which represents a range of values, but it's coming in Excel as a String format $86,000 - $162,000.
I need to extract the minimum value and the maximum value from each cell, so I need to extract the numeric portion of it, and ignore the $, - and the ,.
I've attached an image of the data I have, and the values I want to extract from it.
This is the closest pattern I got with RegEx, but I'ts not what I'm looking for.
Pattern = (\d+)(?:\.(\d{1,2}))?
Can anyone assist ?
Just wondering why Regex?
Function GetParts(priceRange As String) As Double()
Dim arr() As String
Dim parts() As Double
If InStr(1, priceRange, "-") > 0 Then
arr = Split(priceRange, "-")
ReDim parts(0 To UBound(arr))
Dim i As Long
For i = 0 To UBound(arr)
parts(i) = CDbl(Replace$(Replace$(Trim$(arr(i)), "$", ""), ",", ""))
Next i
End If
GetParts = parts
End Function
Sub test()
MsgBox GetParts("$14,000 - $1,234,567")(0) 'Minimum
End Sub
EDIT
Yet you could do this with regex to match the data string into the parts:
Function GetPartsRegEx(priceRange As String) As Variant
Dim arr() As Double
Dim pricePattern As String
pricePattern = "(\$?\d+[\,\.\d]*)"
'START EDIT
Static re As RegExp
If re Is Nothing Then
Set re = New RegExp
re.IgnoreCase = True
re.Global = True
re.Pattern = pricePattern & "\s*[\-]\s*" & pricePattern 'look for the pattern first
End If
Static nums As RegExp
If nums Is Nothing Then
Set nums = New RegExp
'to remove all non digits, except decimal point in case you have pennies
nums.Pattern = "[^0-9.]"
nums.Global = True
End If
'END EDIT
If re.test(priceRange) Then
ReDim arr(0 To 1) ' fill return array
arr(0) = CDbl(nums.Replace(re.Replace(priceRange, "$1"), ""))
arr(1) = CDbl(nums.Replace(re.Replace(priceRange, "$2"), ""))
Else
'do some error handling here
Exit Function
End If 'maybe throw error if no +ve test or
GetPartsRegEx = arr
End Function
Sub test()
MsgBox GetPartsRegEx("$1,005.45 - $1,234,567.88")(1)
End Sub
Here is quick Example Demo https://regex101.com/r/RTNlVF/1
Pattern "^\$(\d+\,\d+)\s\-\s\$(\d+\,\d+)"
Option Explicit
Private Sub Example()
Dim RegExp As New RegExp
Dim Pattern As String
Dim CelValue As String
Dim rng As Range
Dim Cel As Range
Set rng = ActiveWorkbook.Sheets("Sheet1" _
).Range("A2", Range("A9999" _
).End(xlUp))
For Each Cel In rng
DoEvents
Pattern = "^\$(\d+\,\d+)\s\-\s\$(\d+\,\d+)"
If Pattern <> "" Then
With RegExp
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = Pattern
End With
If RegExp.Test(Cel.Value) Then
' Debug.Print Cel.Value
Debug.Print RegExp.Replace(CStr(Cel), "$1")
Debug.Print RegExp.Replace(CStr(Cel), "$2")
End If
End If
Next
End Sub
Without a loop (but still no regex):
Sub Split()
With Columns("B:B")
.Replace What:="$", Replacement:=""
Application.CutCopyMode = False
.TextToColumns Destination:=Range("B1"), DataType:=xlDelimited, Other:=True, OtherChar:="-", FieldInfo:=Array(Array(1, 1), Array(2, 1))
End With
Columns("B:C").Insert Shift:=xlToRight
Columns("D:E").NumberFormat = "0"
Range("D1").FormulaR1C1 = "Min Value"
Range("E1").FormulaR1C1 = "Max Value"
With Range("D1:E1").Interior
.Pattern = xlSolid
.PatternColorIndex = xlAutomatic
.Color = 12611584
End With
With Range("D1:E1").Font
.ThemeColor = xlThemeColorDark1
.TintAndShade = 0
End With
End Sub
I made this function:
Hope it helps.
Code:
Function ExtractNumber(ByVal TextInput As String, _
Optional ByVal Position As Byte = 0, _
Optional ByVal Delimiter As String = "-") As Variant
' You can use this function in a subprocess that
' writes the values in the cells you want, or
' you can use it directly in the ouput cells
' Variables
Dim RemoveItems(2) As String
Dim Aux As Variant
' The variable RemoveItems is an array
' containing the characters you want to remove
RemoveItems(0) = "."
RemoveItems(1) = ","
RemoveItems(2) = " "
' STEP 1 - The variable Aux will store the text
' given as input
Aux = TextInput
' STEP 2 - Characters stored in the variable
' RemoveItems will be removed from Aux
For i = 0 To UBound(RemoveItems)
Aux = Replace(Aux, RemoveItems(i), "")
Next i
' STEP 3 - Once Aux is "clean", it will be
' transformed into an array containing the
' values separated by the delimiter
' As you can see at the function's header,
' Delimiter default value is "-". You can change
' it depending on the situation
Aux = Split(Aux, Delimiter)
' STEP 4 - The result of this function will be
' a numeric value. So, if the value of the
' selected position in Aux is not numeric it will
' remove the first character assuming it is a
' currency symbol.
' If something fails in the process the function
' will return "ERROR", so you can know you may
' verify the inputs or adjust this code for
' your needs.
On Error GoTo ErrHndl
If Not IsNumeric(Aux(Position)) Then
ExtractNumber = CLng(Mid(Aux(Position), 2))
Else
ExtractNumber = CLng(Aux(Position))
End If
Exit Function
ErrHndl:
ExtractNumber = "ERROR"
End Function
You can even do this with just worksheet formulas. Under certain circumstances, Excel will ignore the $ and ,. The double unary converts the returned string to a numeric value.
First Value: =--LEFT(A1,FIND("-",A1)-1)
Second Value: =--MID(A1,FIND("-",A1)+1,99)

VBA - Modify sheet naming from source file

I received help in the past for an issue regarding grabbing a source file name and naming a newly created worksheet the date from said source file name, i.e. "010117Siemens Hot - Cold Report.xls" and outputting "010117".
However the code only works for file names with this exact format, for example, file named "Siemens Hot - Cold Report 010117.xls", an error occurs because the newly created sheet does not find the date in the source file.
CODE
Application.ScreenUpdating = False
Dim n As Double
Dim wksNew As Excel.Worksheet
Dim src As Workbook
Set src = Workbooks.Open(filePath, False, False)
Dim srcRng As Range
With src.Worksheets("Sheet1")
Set srcRng = .Range(.Range("A1"), .Range("A1").End(xlDown).End(xlToRight))
End With
With ThisWorkbook
Set wksNew = .Worksheets.Add(After:=.Worksheets(.Sheets.Count))
n = .Sheets.Count
.Worksheets(n).Range("A1").Resize(srcRng.Rows.Count, srcRng.Columns.Count).Value = srcRng.Value
End With
' ======= get the digits part from src.Name using a RegEx object =====
' RegEx variables
Dim Reg As Object
Dim RegMatches As Variant
Set Reg = CreateObject("VBScript.RegExp")
With Reg
.Global = True
.IgnoreCase = True
.Pattern = "\d{0,9}" ' Match any set of 0 to 9 digits
End With
Set RegMatches = Reg.Execute(src.Name)
On Error GoTo CloseIt
If RegMatches.Count >= 1 Then ' make sure there is at least 1 match
ThisWorkbook.Worksheets(n).Name = RegMatches(0) ' rename "Sheet2" to the numeric part of the filename
End If
src.Close False
Set src = Nothing
So, my question is, how can I get my code to recognize the string of digits no matter its position in the file name?
Code
^\d{0,9}\B|\b\d{0,9}(?=\.)
Usage
I decided to make a function that can be called inside a cell as such: =GetMyNum(x) where x is a pointer to a cell (i.e. A1).
To get the code below to work:
Open Microsoft Visual Basic for Applications (ALT + F11)
Insert a new module (right click in the Project Pane and select Insert -> Module).
Click Tools -> References and find Microsoft VBScript Regular Expressions 5.5, enable it and click OK
Now copy/paste the following code into the new module:
Option Explicit
Function GetMyNum(Myrange As Range) As String
Dim regEx As New RegExp
Dim strPattern As String
Dim strInput As String
Dim strReplace As String
Dim strOutput As String
Dim match As Object
strPattern = "^\d{0,9}\B|\b\d{0,9}(?=\.)"
If strPattern <> "" Then
strInput = Myrange.Value
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern
End With
If regEx.test(strInput) Then
Set match = regEx.Execute(strInput)
GetMyNum = match.Item(0)
Else
GetMyNum = ""
End If
End If
End Function
Results
Input
A1: Siemens Hot - Cold Report 010117.xls
A2: 010117Siemens Hot - Cold Report.xls
B1: =GetMyNum(A1)
B2: =GetMyNum(A1)
Output
010117 # Contents of B1
010117 # Contents of B2
Explanation
I will explain each regex option separately. You can reorder the options in terms of importance in such a way that the most important option is first and least important is last.
^\d{0,9}\B Match the following
^ Assert position at the start of the line
\d{0,9} Match any digit 0-9 times
\B Ensure position does not match where a word boundary matches (this is used but may be dropped depending on usage - I added it because it seems the number you're trying to get is immediately followed by a word character and not followed by a space - if that's not always the case just remove this token)
\b\d{0,9}(?=\.) Match the following
\b Assert position as a word boundary
\d{0,9} Match any digit 0-9 times
(?=\.) Positive lookahead ensuring a literal dot . follows
Just my alternative solution to RegEx :)
This finds the first occurence of 6 consecutive digits, omitting blanks and periods... although there are probably some more issues with using IsNumeric as I believe a lowercase e is considered acceptable by it...
Sub FindTheNumber()
For i = 1 To Len(Range("A1").Value)
If IsNumeric(Mid(Range("A1").Value, i, 6)) = True And InStr(Mid(Range("A1").Value, i, 6), " ") = 0 And InStr(Mid(Range("A1").Value, i, 6), ".") = 0 Then
MyNumber = Mid(Range("A1").Value, i, 6)
Debug.Print MyNumber
Exit For
End If
Next i
For i = 1 To Len(Range("A2").Value)
If IsNumeric(Mid(Range("A2").Value, i, 6)) = True And InStr(Mid(Range("A2").Value, i, 6), " ") = 0 And InStr(Mid(Range("A2").Value, i, 6), ".") = 0 Then
MyNumber = Mid(Range("A2").Value, i, 6)
Debug.Print MyNumber
Exit For
End If
Next i
End Sub
Examples:
Immediate window:

Regex extract data before certain text

I have large text documents that has some data I want to be extracted.
As you can see in a screenshot , I want to extract A040 to excel column next to the filename.
Before the A040 there is always three empty spaces and than text Sheet (also in screenshot)
Every file has different number and there is always letter A with three digits and text Sheet. --> example file uploaded:
I has something already in VB with Excel but it is not working.
Dim cell As Range
Dim rng As Range
Dim output As String
Set rng = ws.Range("A1", ws.Range("A1").SpecialCells(xlLastCell).Address)
For Each cell In rng
On Error Resume Next
output = ExtA(cell.Value)
If Len(output) > 0 Then
Range("B" & j) = output
Exit For
End If
Next
j = j + 1
ws.Cells.ClearContents
'Call DelConns
strFileName = Dir 'next file
Loop
End Sub
Function ExtA(ByVal text As String) As String
'REGEX Match VBA in excel
Dim result As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
RE.Pattern = "(?<=Sheet)[^Sheet]*\ Sheet"
RE.Global = True
RE.IgnoreCase = True
Set allMatches = RE.Execute(text)
If allMatches.Count <> 0 Then
result = allMatches.Item(0).submatches.Item(0)
End If
ExtA = result
End Function
This seems to work on your sample.
Option Explicit
Function AthreeDigits(str As String)
Dim n As Long, nums() As Variant
Static rgx As Object, cmat As Object
'with rgx as static, it only has to be created once; beneficial when filling a long column with this UDF
If rgx Is Nothing Then
Set rgx = CreateObject("VBScript.RegExp")
Else
Set cmat = Nothing
End If
AthreeDigits = vbNullString
With rgx
.Global = False
.MultiLine = True
.Pattern = "\A[0-9]{3}[\s]{3}Sheet"
If .Test(str) Then
Set cmat = .Execute(str)
AthreeDigits = Left(cmat.Item(0), 4)
End If
End With
End Function
Did you mean to say that there are 4 spaces after the A040 and before the "Sheet"? If so, try this pattern:
.pattern = "(A\d\d\d)\s{3}Sheet"
EDIT: I thought you said 4 spaces, but you said 3. My pattern now reflects that.
EDIT 2: (I need more coffee!) Change the \b to \s.
See Example here
"\s+[Aa]\d*\s+Sheet"
Or
\s+[Aa]\d*\s+(Sheet)
Or
[Aa]\d*\s+(Sheet)
Demo
https://regex101.com/r/Qo8iUf/3
\s+ Matches any whitespace character (equal to [\r\n\t\f\v ])
+ Quantifier — Matches between one and unlimited times, as many times as possible
Aa Matches a single character in the list Aa (case sensitive)
\d* Matches a digit (equal to [0-9])
* Quantifier — Matches between zero and unlimited times, as many times as possible

Extracting Parenthetical Data Using Regex

I have a small sub that extracts parenthetical data (including parentheses) from a string and stores it in cells adjacent to the string:
Sub parens()
Dim s As String, i As Long
Dim c As Collection
Set c = New Collection
s = ActiveCell.Value
ary = Split(s, ")")
For i = LBound(ary) To UBound(ary) - 1
bry = Split(ary(i), "(")
c.Add "(" & bry(1) & ")"
Next i
For i = 1 To c.Count
ActiveCell.Offset(0, i).NumberFormat = "#"
ActiveCell.Offset(0, i).Value = c.Item(i)
Next i
End Sub
For example:
I am now trying to replace this with some Regex code. I am NOT a regex expert. I want to create a pattern that looks for an open parenthesis followed by zero or more characters of any type followed by a close parenthesis.
I came up with:
\((.+?)\)
My current new code is:
Sub qwerty2()
Dim inpt As String, outpt As String
Dim MColl As MatchCollection, temp2 As String
Dim regex As RegExp, L As Long
inpt = ActiveCell.Value
MsgBox inpt
Set regex = New RegExp
regex.Pattern = "\((.+?)\)"
Set MColl = regex.Execute(inpt)
MsgBox MColl.Count
temp2 = MColl(0).Value
MsgBox temp2
End Sub
The code has at least two problems:
It will only get the first match in the string.(Mcoll.Count is always 1)
It will not recognize zero characters between the parentheses. (I think the .+? requires at least one character)
Does anyone have any suggestions ??
By default, RegExp Global property is False. You need to set it to True.
As for the regex, to match zero or more chars as few as possible, you need *?, not +?. Note that both are lazy (match as few as necessary to find a valid match), but + requires at least one char, while * allows matching zero chars (an empty string).
Thus, use
Set regex = New RegExp
regex.Global = True
regex.Pattern = "\((.*?)\)"
As for the regex, you can also use
regex.Pattern = "\(([^()]*)\)"
where [^()] is a negated character class matching any char but ( and ), zero or more times (due to * quantifier), matching as many such chars as possible (* is a greedy quantifier).