Extract an 8 digits number from a string with additional conditions - regex

I need to extract a number from a string with several conditions.
It has to start with 1-9, not with 0, and it will have 8 digits. Like 23242526 or 65478932
There will be either an empty space or a text variable before it. Like MMX: 23242526 or bgr65478932
It could have come in rare cases: 23,242,526
It ends with an emty space or a text variable.
Here are several examples:
From RE: Markitwire: 120432889: Mx: 24,693,059 i need to get 24693059
From Automatic reply: Auftrag zur Übertragung IRD Ref-Nr. MMX_23497152 need to get 23497152
From FW: CGMSE 2019-2X A1AN XS2022418672 Contract 24663537 need to get 24663537
From RE: BBVA-MAD MMX_24644644 + MMX_24644645 need to get 24644644, 24644645
Right now I'm using the regexextract function(found it on this web-site), which extracts any number with 8 digits starting with 2. However it would also extract a number from, let's say, this expression TGF00023242526, which is incorrect. Moreover, I don't know how to add additional conditions to the code.
=RegexExtract(A11, ""(2\d{7})\b"", ", ")
Thank you in advance.
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String, _
Optional seperator As String = "") As String
Dim i As Long, j As Long
Dim result As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
RE.Pattern = extract_what
RE.Global = True
RE.IgnoreCase = True
Set allMatches = RE.Execute(text)
For i = 0 To allMatches.Count - 1
For j = 0 To allMatches.Item(i).SubMatches.Count - 1
result = result & seperator & allMatches.Item(i).SubMatches.Item(j)
Next
Next
If Len(result) <> 0 Then
result = Right(result, Len(result) - Len(seperator))
End If
RegexExtract = result
End Function

You may create a custom boundary using a non-capturing group before the pattern you have:
(?:[\D0]|^)(2\d{7})\b
^^^^^^^^^^^
The (?:[\D0]|^) part matches either a non-digit (\D) or 0 or (|) start of string (^).

As an alternative to also match 8 digits in values like 23,242,526 and start with a digit 1-9 you might use
\b[1-9](?:,?\d){7}\b
\b Word boundary
[1-9] Match the firstdigit 1-9
(?:,?\d){7} Repeat 7 times matching an optional comma and a digit
\b Word boundary
Regex demo
Then you could afterwards replace the comma's with an empty string.

Related

VBA Excel Replace last 2 digits of number if occurs at beginning of string

I'm trying to replace the last two digits of number with "XX BLOCK" if it occurs at the start of the string and has more than 2 digits.
I'm using the Microsoft VBScript Regular Expressions 5.5 reference.
Dim regEx As New RegExp
With regEx
.Global = True 'Matches whole string, not just first occurrence
.IgnoreCase = True 'Matches upper or lowercase
.MultiLine = True 'Checks each line in case cell has multiple lines
.pattern = "^(\d{2,})" 'Checks beginning of string for at least 2 digits
End With
'cell is set earlier in code not shown, refers to an Excel cell
regEx.replace(cell.Value, "XX BLOCK")
Desired results:
"1091 foo address" --> "10XX BLOCK foo address"
"1016 foo 1010 address" --> "10XX BLOCK foo 1010 address"
"foo 1081 address" --> "foo 1081 address"
"10 bar address" --> "XX BLOCK bar address"
"8 baz address" --> "8 baz address"
I'm new to regex and not sure where to start. I tried using ^(\d{2,}) but then it replaces the entire number.
There is also a guarantee that the number (if exists) will always be followed with a white space.
You may use
^(\d*)\d{2}\b
Or, if you cannot rely on a word boundary, you may also use
^(\d*)\d{2}(?!\d) ' no digit is allowed after the 2-digit sequence
^(\d*)\d{2}(?!\S) ' a whitespace boundary
And replace with $1XX BLOCK.
See the regex demo.
Details
^ - start of string
(\d*) - Group 1: zero or more digits
\d{2} - two digits
\b - a word boundary, no digit, letter or _ is allowed right after the two digits
(?!\d) - a negative lookahead that fails the match if there is a digit immediately to the right of the current location
(?!\S) - a negative lookahead that fails the match if there is a non-whitespace char immediately to the right of the current location.
https://regex101.com/r/M1QrPZ/1
Pattern = "^\d{2}(\d{2})"
Try the following
Option Explicit
Private Sub Example()
Dim RegExp As New RegExp
Dim Pattern As String
Dim rng As Range
Dim Cel As Range
Set rng = ActiveWorkbook.Sheets("Sheet1" _
).Range("A1", Range("A9999" _
).End(xlUp))
Dim Matches As Variant
For Each Cel In rng
DoEvents
Pattern = "^\d{2}(\d{2})"
If Pattern <> "" Then
With RegExp
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = Pattern
Set Matches = .Execute(Cel.Value)
End With
If Matches.Count > 0 Then
Debug.Print Matches(0) ' full mach
Debug.Print Matches(0).SubMatches(0) ' sub match
Cel.Value = Replace(CStr(Cel), Matches(0).SubMatches(0), "XX BLOCK")
End If
End If
Next
End Sub

Extract largest numeric sequence from string (regex, or?)

I have strings similar to the following:
4123499-TESCO45-123
every99999994_54
And I want to extract the largest numeric sequence in each string, respectively:
4123499
99999994
I have previously tried regex (I am using VB6)
Set rx = New RegExp
rx.Pattern = "[^\d]"
rx.Global = True
StringText = rx.Replace(StringText, "")
Which gets me partway there, but it only removes the non-numeric values, and I end up with the first string looking like:
412349945123
Can I find a regex that will give me what I require, or will I have to try another method? Essentially, my pattern would have to be anything that isn't the longest numeric sequence. But I'm not actually sure if that is even a reasonable pattern. Could anyone with a better handle of regex tell me if I am going down a rabbit hole? I appreciate any help!
You cannot get the result by just a regex. You will have to extract all numeric chunks and get the longest one using other programming means.
Here is an example:
Dim strPattern As String: strPattern = "\d+"
Dim str As String: str = "4123499-TESCO45-123"
Dim regEx As New RegExp
Dim matches As MatchCollection
Dim match As Match
Dim result As String
With regEx
.Global = True
.MultiLine = False
.IgnoreCase = False
.Pattern = strPattern
End With
Set matches = regEx.Execute(str)
For Each m In matches
If result < Len(m.Value) Then result = m.Value
Next
Debug.Print result
The \d+ with RegExp.Global=True will find all digit chunks and then only the longest will be printed after all matches are processed in a loop.
That's not solvable with an RE on its own.
Instead you can simply walk along the string tracking the longest consecutive digit group:
For i = 1 To Len(StringText)
If IsNumeric(Mid$(StringText, i, 1)) Then
a = a & Mid$(StringText, i, 1)
Else
a = ""
End If
If Len(a) > Len(longest) Then longest = a
Next
MsgBox longest
(first result wins a tie)
If the two examples you gave, are of a standard where:
<long_number>-<some_other_data>-<short_number>
<text><long_number>_<short_number>
Are the two formats that the strings come in, there are some solutions.
However, if you are searching any string in any format for the longest number, these will not work.
Solution 1
([0-9]+)[_-].*
See the demo
In the first capture group, you should have the longest number for those 2 formats.
Note: This assumes that the longest number will be the first number it encounters with an underscore or a hyphen next to it, matching those two examples given.
Solution 2
\d{6,}
See the demo
Note: This assumes that the shortest number will never exceed 5 characters in length, and the longest number will never be shorter than 6 characters in length
Please, try.
Pure VB. No external libs or objects.
No brain-breaking regexp's patterns.
No string manipulations, so - speed. Superspeed. ~30 times faster than regexp :)
Easy transform on variouse needs.
For example, concatenate all digits from the source string to a single string.
Moreover, if target string is only intermediate step,
so it's possible to manipulate with numbers only.
Public Sub sb_BigNmb()
Dim sSrc$, sTgt$
Dim taSrc() As Byte, taTgt() As Byte, tLB As Byte, tUB As Byte
Dim s As Byte, t As Byte, tLenMin As Byte
tLenMin = 4
sSrc = "every99999994_54"
sTgt = vbNullString
taSrc = StrConv(sSrc, vbFromUnicode)
tLB = LBound(taSrc)
tUB = UBound(taSrc)
ReDim taTgt(tLB To tUB)
t = 0
For s = tLB To tUB
Select Case taSrc(s)
Case 48 To 57
taTgt(t) = taSrc(s)
t = t + 1
Case Else
If CBool(t) Then Exit For ' *** EXIT FOR ***
End Select
Next
If (t > tLenMin) Then
ReDim Preserve taTgt(tLB To (t - 1))
sTgt = StrConv(taTgt, vbUnicode)
End If
Debug.Print "'" & sTgt & "'"
Stop
End Sub
How to handle sSrc = "ev_1_ery99999994_54", please, make by yourself :)
.

Extracting Parenthetical Data Using Regex

I have a small sub that extracts parenthetical data (including parentheses) from a string and stores it in cells adjacent to the string:
Sub parens()
Dim s As String, i As Long
Dim c As Collection
Set c = New Collection
s = ActiveCell.Value
ary = Split(s, ")")
For i = LBound(ary) To UBound(ary) - 1
bry = Split(ary(i), "(")
c.Add "(" & bry(1) & ")"
Next i
For i = 1 To c.Count
ActiveCell.Offset(0, i).NumberFormat = "#"
ActiveCell.Offset(0, i).Value = c.Item(i)
Next i
End Sub
For example:
I am now trying to replace this with some Regex code. I am NOT a regex expert. I want to create a pattern that looks for an open parenthesis followed by zero or more characters of any type followed by a close parenthesis.
I came up with:
\((.+?)\)
My current new code is:
Sub qwerty2()
Dim inpt As String, outpt As String
Dim MColl As MatchCollection, temp2 As String
Dim regex As RegExp, L As Long
inpt = ActiveCell.Value
MsgBox inpt
Set regex = New RegExp
regex.Pattern = "\((.+?)\)"
Set MColl = regex.Execute(inpt)
MsgBox MColl.Count
temp2 = MColl(0).Value
MsgBox temp2
End Sub
The code has at least two problems:
It will only get the first match in the string.(Mcoll.Count is always 1)
It will not recognize zero characters between the parentheses. (I think the .+? requires at least one character)
Does anyone have any suggestions ??
By default, RegExp Global property is False. You need to set it to True.
As for the regex, to match zero or more chars as few as possible, you need *?, not +?. Note that both are lazy (match as few as necessary to find a valid match), but + requires at least one char, while * allows matching zero chars (an empty string).
Thus, use
Set regex = New RegExp
regex.Global = True
regex.Pattern = "\((.*?)\)"
As for the regex, you can also use
regex.Pattern = "\(([^()]*)\)"
where [^()] is a negated character class matching any char but ( and ), zero or more times (due to * quantifier), matching as many such chars as possible (* is a greedy quantifier).

VBA regexp to check special symbols

I have tried to use what I've learnt in this post,
and now I want to compose a RegExp which checks whether a string contains digits and commas. For example, "1,2,55,2" should be ok, whereas "a,2,55,2" or "1.2,55,2" should fail test. My code:
Private Function testRegExp(str, pattern) As Boolean
Dim regEx As New RegExp
If pattern <> "" Then
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.pattern = pattern
End With
If regEx.Test(str) Then
testRegExp = True
Else
testRegExp = False
End If
Else
testRegExp = True
End If
End Function
Public Sub foo()
MsgBox testRegExp("2.d", "[0-9]+")
End Sub
MsgBox yields true instead of false. What's the problem ?
Your regex matches a partial string, it matches a digit in all 55,2, a,2,55,2, 1.2,55,2 input strings.
Use anchors ^ and $ to enforce a full string match and add a comma to the character class as you say you want to match strings that only contain digits and commas:
MsgBox testRegExp("2.d", "^[0-9,]*$")
^ ^ ^
I also suggest using * quantifier to match 0 or more occurrences, rather than + (1 or more occurrences), but it is something you need to decide for yourself (whether you want to allow an empty string match or not).
Here is the regex demo. Note it is for PCRE regex flavor, but this regex will perform similarly in VBA.
Yes, as #Chaz suggests, if you do not need to match the string/line itself, the alternative is to match an inverse character class:
MsgBox testRegExp("2.d", "[^0-9,]")
This way, the negated character class [^0-9,] will match any character but a comma / digit, invalidating the string. If the result is True, it will mean the string contains some characters other than digits and a comma.
You can use the limited built in pattern matching for that:
function isOk(str) As boolean
for i = 1 To len(str)
if Mid$(str, i, 1) Like "[!0-9,]" then exit function
next
g = True and Len(str) > 0
end function

RegEx VBA Excel complex string

I have a function pulled from here. My problem is that I don't know what RegEx pattern I need to use to split out the following data:
+1 vorpal unholy longsword +31/+26/+21/+16 (2d6+13)
+1 vorpal flaming whip +30/+25/+20 (1d4+7 plus 1d6 fire and entangle)
2 slams +31 (1d10+12)
I want it to look like:
+1 vorpal unholy longsword, 31
+1 vorpal flaming whip, 30
2 slams, 31
Here is the VBA code that does the RegExp validation:
Public Function RXGET(ByRef find_pattern As Variant, _
ByRef within_text As Variant, _
Optional ByVal submatch As Long = 0, _
Optional ByVal start_num As Long = 0, _
Optional ByVal case_sensitive As Boolean = True) As Variant
' RXGET - Looks for a match for regular expression pattern find_pattern
' in the string within_text and returns it if found, error otherwise.
' Optional long submatch may be used to return the corresponding submatch
' if specified - otherwise the entire match is returned.
' Optional long start_num specifies the number of the character to start
' searching for in within_text. Default=0.
' Optional boolean case_sensitive makes the regex pattern case sensitive
' if true, insensitive otherwise. Default=true.
Dim objRegex As VBScript_RegExp_55.RegExp
Dim colMatch As VBScript_RegExp_55.MatchCollection
Dim vbsMatch As VBScript_RegExp_55.Match
Dim colSubMatch As VBScript_RegExp_55.SubMatches
Dim sMatchString As String
Set objRegex = New VBScript_RegExp_55.RegExp
' Initialise Regex object
With objRegex
.Global = False
' Default is case sensitive
If case_sensitive Then
.IgnoreCase = False
Else: .IgnoreCase = True
End If
.pattern = find_pattern
End With
' Return out of bounds error
If start_num >= Len(within_text) Then
RXGET = CVErr(xlErrNum)
Exit Function
End If
sMatchString = Right$(within_text, Len(within_text) - start_num)
' Create Match collection
Set colMatch = objRegex.Execute(sMatchString)
If colMatch.Count = 0 Then ' No match
RXGET = CVErr(xlErrNA)
Else
Set vbsMatch = colMatch(0)
If submatch = 0 Then ' Return match value
RXGET = vbsMatch.Value
Else
Set colSubMatch = vbsMatch.SubMatches ' Use the submatch collection
If colSubMatch.Count < submatch Then
RXGET = CVErr(xlErrNum)
Else
RXGET = CStr(colSubMatch(submatch - 1))
End If
End If
End If
End Function
I don't know about Excel but this should get you started on the RegEx:
/(?:^|, |and |or )(\+?\d?\s?[^\+]*?) (?:\+|-)(\d+)/
NOTE: There is a slight caveat here. This will also match if an element begins with + only (not being followed by a digit).
Capture groups 1 and 2 contain the strings that go left and right of your comma (if the whole pattern has index 0). So you can something like capture[1] + ', ' + capture[2] (whatever your syntax for that is).
Here is an explanation of the regex:
/(?:^|, |and |or ) # make sure that we only start looking after
# the beginning of the string, after a comma, after an
# and or after an or; the "?:" makes sure that this
# subpattern is not capturing
(\+? # a literal "+"
\d+ # at least one digit
# a literal space
[^+]*?) # arbitrarily many non-plus characters; the ? makes it
# non-greedy, otherwise it might span multiple lines
# a literal space
\+ # a literal "+"
(\d+)/ # at least one digit (and the brakets are for capturing)