Extract largest numeric sequence from string (regex, or?) - regex

I have strings similar to the following:
4123499-TESCO45-123
every99999994_54
And I want to extract the largest numeric sequence in each string, respectively:
4123499
99999994
I have previously tried regex (I am using VB6)
Set rx = New RegExp
rx.Pattern = "[^\d]"
rx.Global = True
StringText = rx.Replace(StringText, "")
Which gets me partway there, but it only removes the non-numeric values, and I end up with the first string looking like:
412349945123
Can I find a regex that will give me what I require, or will I have to try another method? Essentially, my pattern would have to be anything that isn't the longest numeric sequence. But I'm not actually sure if that is even a reasonable pattern. Could anyone with a better handle of regex tell me if I am going down a rabbit hole? I appreciate any help!

You cannot get the result by just a regex. You will have to extract all numeric chunks and get the longest one using other programming means.
Here is an example:
Dim strPattern As String: strPattern = "\d+"
Dim str As String: str = "4123499-TESCO45-123"
Dim regEx As New RegExp
Dim matches As MatchCollection
Dim match As Match
Dim result As String
With regEx
.Global = True
.MultiLine = False
.IgnoreCase = False
.Pattern = strPattern
End With
Set matches = regEx.Execute(str)
For Each m In matches
If result < Len(m.Value) Then result = m.Value
Next
Debug.Print result
The \d+ with RegExp.Global=True will find all digit chunks and then only the longest will be printed after all matches are processed in a loop.

That's not solvable with an RE on its own.
Instead you can simply walk along the string tracking the longest consecutive digit group:
For i = 1 To Len(StringText)
If IsNumeric(Mid$(StringText, i, 1)) Then
a = a & Mid$(StringText, i, 1)
Else
a = ""
End If
If Len(a) > Len(longest) Then longest = a
Next
MsgBox longest
(first result wins a tie)

If the two examples you gave, are of a standard where:
<long_number>-<some_other_data>-<short_number>
<text><long_number>_<short_number>
Are the two formats that the strings come in, there are some solutions.
However, if you are searching any string in any format for the longest number, these will not work.
Solution 1
([0-9]+)[_-].*
See the demo
In the first capture group, you should have the longest number for those 2 formats.
Note: This assumes that the longest number will be the first number it encounters with an underscore or a hyphen next to it, matching those two examples given.
Solution 2
\d{6,}
See the demo
Note: This assumes that the shortest number will never exceed 5 characters in length, and the longest number will never be shorter than 6 characters in length

Please, try.
Pure VB. No external libs or objects.
No brain-breaking regexp's patterns.
No string manipulations, so - speed. Superspeed. ~30 times faster than regexp :)
Easy transform on variouse needs.
For example, concatenate all digits from the source string to a single string.
Moreover, if target string is only intermediate step,
so it's possible to manipulate with numbers only.
Public Sub sb_BigNmb()
Dim sSrc$, sTgt$
Dim taSrc() As Byte, taTgt() As Byte, tLB As Byte, tUB As Byte
Dim s As Byte, t As Byte, tLenMin As Byte
tLenMin = 4
sSrc = "every99999994_54"
sTgt = vbNullString
taSrc = StrConv(sSrc, vbFromUnicode)
tLB = LBound(taSrc)
tUB = UBound(taSrc)
ReDim taTgt(tLB To tUB)
t = 0
For s = tLB To tUB
Select Case taSrc(s)
Case 48 To 57
taTgt(t) = taSrc(s)
t = t + 1
Case Else
If CBool(t) Then Exit For ' *** EXIT FOR ***
End Select
Next
If (t > tLenMin) Then
ReDim Preserve taTgt(tLB To (t - 1))
sTgt = StrConv(taTgt, vbUnicode)
End If
Debug.Print "'" & sTgt & "'"
Stop
End Sub
How to handle sSrc = "ev_1_ery99999994_54", please, make by yourself :)
.

Related

Extract an 8 digits number from a string with additional conditions

I need to extract a number from a string with several conditions.
It has to start with 1-9, not with 0, and it will have 8 digits. Like 23242526 or 65478932
There will be either an empty space or a text variable before it. Like MMX: 23242526 or bgr65478932
It could have come in rare cases: 23,242,526
It ends with an emty space or a text variable.
Here are several examples:
From RE: Markitwire: 120432889: Mx: 24,693,059 i need to get 24693059
From Automatic reply: Auftrag zur Übertragung IRD Ref-Nr. MMX_23497152 need to get 23497152
From FW: CGMSE 2019-2X A1AN XS2022418672 Contract 24663537 need to get 24663537
From RE: BBVA-MAD MMX_24644644 + MMX_24644645 need to get 24644644, 24644645
Right now I'm using the regexextract function(found it on this web-site), which extracts any number with 8 digits starting with 2. However it would also extract a number from, let's say, this expression TGF00023242526, which is incorrect. Moreover, I don't know how to add additional conditions to the code.
=RegexExtract(A11, ""(2\d{7})\b"", ", ")
Thank you in advance.
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String, _
Optional seperator As String = "") As String
Dim i As Long, j As Long
Dim result As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
RE.Pattern = extract_what
RE.Global = True
RE.IgnoreCase = True
Set allMatches = RE.Execute(text)
For i = 0 To allMatches.Count - 1
For j = 0 To allMatches.Item(i).SubMatches.Count - 1
result = result & seperator & allMatches.Item(i).SubMatches.Item(j)
Next
Next
If Len(result) <> 0 Then
result = Right(result, Len(result) - Len(seperator))
End If
RegexExtract = result
End Function
You may create a custom boundary using a non-capturing group before the pattern you have:
(?:[\D0]|^)(2\d{7})\b
^^^^^^^^^^^
The (?:[\D0]|^) part matches either a non-digit (\D) or 0 or (|) start of string (^).
As an alternative to also match 8 digits in values like 23,242,526 and start with a digit 1-9 you might use
\b[1-9](?:,?\d){7}\b
\b Word boundary
[1-9] Match the firstdigit 1-9
(?:,?\d){7} Repeat 7 times matching an optional comma and a digit
\b Word boundary
Regex demo
Then you could afterwards replace the comma's with an empty string.

VBA - Modify sheet naming from source file

I received help in the past for an issue regarding grabbing a source file name and naming a newly created worksheet the date from said source file name, i.e. "010117Siemens Hot - Cold Report.xls" and outputting "010117".
However the code only works for file names with this exact format, for example, file named "Siemens Hot - Cold Report 010117.xls", an error occurs because the newly created sheet does not find the date in the source file.
CODE
Application.ScreenUpdating = False
Dim n As Double
Dim wksNew As Excel.Worksheet
Dim src As Workbook
Set src = Workbooks.Open(filePath, False, False)
Dim srcRng As Range
With src.Worksheets("Sheet1")
Set srcRng = .Range(.Range("A1"), .Range("A1").End(xlDown).End(xlToRight))
End With
With ThisWorkbook
Set wksNew = .Worksheets.Add(After:=.Worksheets(.Sheets.Count))
n = .Sheets.Count
.Worksheets(n).Range("A1").Resize(srcRng.Rows.Count, srcRng.Columns.Count).Value = srcRng.Value
End With
' ======= get the digits part from src.Name using a RegEx object =====
' RegEx variables
Dim Reg As Object
Dim RegMatches As Variant
Set Reg = CreateObject("VBScript.RegExp")
With Reg
.Global = True
.IgnoreCase = True
.Pattern = "\d{0,9}" ' Match any set of 0 to 9 digits
End With
Set RegMatches = Reg.Execute(src.Name)
On Error GoTo CloseIt
If RegMatches.Count >= 1 Then ' make sure there is at least 1 match
ThisWorkbook.Worksheets(n).Name = RegMatches(0) ' rename "Sheet2" to the numeric part of the filename
End If
src.Close False
Set src = Nothing
So, my question is, how can I get my code to recognize the string of digits no matter its position in the file name?
Code
^\d{0,9}\B|\b\d{0,9}(?=\.)
Usage
I decided to make a function that can be called inside a cell as such: =GetMyNum(x) where x is a pointer to a cell (i.e. A1).
To get the code below to work:
Open Microsoft Visual Basic for Applications (ALT + F11)
Insert a new module (right click in the Project Pane and select Insert -> Module).
Click Tools -> References and find Microsoft VBScript Regular Expressions 5.5, enable it and click OK
Now copy/paste the following code into the new module:
Option Explicit
Function GetMyNum(Myrange As Range) As String
Dim regEx As New RegExp
Dim strPattern As String
Dim strInput As String
Dim strReplace As String
Dim strOutput As String
Dim match As Object
strPattern = "^\d{0,9}\B|\b\d{0,9}(?=\.)"
If strPattern <> "" Then
strInput = Myrange.Value
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern
End With
If regEx.test(strInput) Then
Set match = regEx.Execute(strInput)
GetMyNum = match.Item(0)
Else
GetMyNum = ""
End If
End If
End Function
Results
Input
A1: Siemens Hot - Cold Report 010117.xls
A2: 010117Siemens Hot - Cold Report.xls
B1: =GetMyNum(A1)
B2: =GetMyNum(A1)
Output
010117 # Contents of B1
010117 # Contents of B2
Explanation
I will explain each regex option separately. You can reorder the options in terms of importance in such a way that the most important option is first and least important is last.
^\d{0,9}\B Match the following
^ Assert position at the start of the line
\d{0,9} Match any digit 0-9 times
\B Ensure position does not match where a word boundary matches (this is used but may be dropped depending on usage - I added it because it seems the number you're trying to get is immediately followed by a word character and not followed by a space - if that's not always the case just remove this token)
\b\d{0,9}(?=\.) Match the following
\b Assert position as a word boundary
\d{0,9} Match any digit 0-9 times
(?=\.) Positive lookahead ensuring a literal dot . follows
Just my alternative solution to RegEx :)
This finds the first occurence of 6 consecutive digits, omitting blanks and periods... although there are probably some more issues with using IsNumeric as I believe a lowercase e is considered acceptable by it...
Sub FindTheNumber()
For i = 1 To Len(Range("A1").Value)
If IsNumeric(Mid(Range("A1").Value, i, 6)) = True And InStr(Mid(Range("A1").Value, i, 6), " ") = 0 And InStr(Mid(Range("A1").Value, i, 6), ".") = 0 Then
MyNumber = Mid(Range("A1").Value, i, 6)
Debug.Print MyNumber
Exit For
End If
Next i
For i = 1 To Len(Range("A2").Value)
If IsNumeric(Mid(Range("A2").Value, i, 6)) = True And InStr(Mid(Range("A2").Value, i, 6), " ") = 0 And InStr(Mid(Range("A2").Value, i, 6), ".") = 0 Then
MyNumber = Mid(Range("A2").Value, i, 6)
Debug.Print MyNumber
Exit For
End If
Next i
End Sub
Examples:
Immediate window:

Extracting Parenthetical Data Using Regex

I have a small sub that extracts parenthetical data (including parentheses) from a string and stores it in cells adjacent to the string:
Sub parens()
Dim s As String, i As Long
Dim c As Collection
Set c = New Collection
s = ActiveCell.Value
ary = Split(s, ")")
For i = LBound(ary) To UBound(ary) - 1
bry = Split(ary(i), "(")
c.Add "(" & bry(1) & ")"
Next i
For i = 1 To c.Count
ActiveCell.Offset(0, i).NumberFormat = "#"
ActiveCell.Offset(0, i).Value = c.Item(i)
Next i
End Sub
For example:
I am now trying to replace this with some Regex code. I am NOT a regex expert. I want to create a pattern that looks for an open parenthesis followed by zero or more characters of any type followed by a close parenthesis.
I came up with:
\((.+?)\)
My current new code is:
Sub qwerty2()
Dim inpt As String, outpt As String
Dim MColl As MatchCollection, temp2 As String
Dim regex As RegExp, L As Long
inpt = ActiveCell.Value
MsgBox inpt
Set regex = New RegExp
regex.Pattern = "\((.+?)\)"
Set MColl = regex.Execute(inpt)
MsgBox MColl.Count
temp2 = MColl(0).Value
MsgBox temp2
End Sub
The code has at least two problems:
It will only get the first match in the string.(Mcoll.Count is always 1)
It will not recognize zero characters between the parentheses. (I think the .+? requires at least one character)
Does anyone have any suggestions ??
By default, RegExp Global property is False. You need to set it to True.
As for the regex, to match zero or more chars as few as possible, you need *?, not +?. Note that both are lazy (match as few as necessary to find a valid match), but + requires at least one char, while * allows matching zero chars (an empty string).
Thus, use
Set regex = New RegExp
regex.Global = True
regex.Pattern = "\((.*?)\)"
As for the regex, you can also use
regex.Pattern = "\(([^()]*)\)"
where [^()] is a negated character class matching any char but ( and ), zero or more times (due to * quantifier), matching as many such chars as possible (* is a greedy quantifier).

Extract four numbers without brackets from a bracketed entry, if entry exists

What I have:
A list of about 1000 titles of reports in column B.
Some of these titles have a four digit number surrounded by brackets (eg: (3672)) somewhere in a string of text and numbers.
I want to extract these four numbers - without brackets - in column C in the same row.
If there is no four digit number with brackets in column B, then to return "" in column C.
What I have so far:
I can successfully identify the cells in column B which have four digits surrounded by brackets. The problem is it returns the whole title including the four numbers.
Taken from: VBA RegEx extracting data from within a string
NB: I am Using Excel Professional Plus 2010, have checked the box next to "Microsoft VBScript Regular Expressions 5.5".
Sub ExtractTicker()
Dim regEx
Dim i As Long
Dim pattern As String
Set regEx = CreateObject("VBScript.RegExp")
regEx.IgnoreCase = True
regEx.Global = True
regEx.pattern = "(\()([0-9]{4})(\))"
For i = 2 To ActiveSheet.UsedRange.Rows.Count
If (regEx.Test(Cells(i, 2).Value)) Then
Cells(i, 3).Value = regEx.Replace(Cells(i, 2).Value, "$2")
End If
Next i
End Sub
Try
regEx.pattern = "(.*\()([0-9]{4})(\).*)"
the .* and the start and end of the string ensure you capture the entire string, then this is fully substituted by the 2nd submatch ([0-9]{4})
To fully optimise the code
use variant arrays rather than ranges
setting Global and IgnoreCase is redundant when you are running a case insensitive match on the full string
you are using late binding so you dont need the Reference
code
Sub ExtractTicker()
Dim regEx As Object
Dim pattern As String
Dim X
Dim lngCNt As Long
X = Range([b1], Cells(Rows.Count, "B").End(xlUp)).Value2
Set regEx = CreateObject("VBScript.RegExp")
With regEx
.pattern = "(.*\()([0-9]{4})(\).*)"
For lngCNt = 1 To UBound(X)
If .Test(X(lngCNt, 1)) Then
X(lngCNt, 1) = .Replace(X(lngCNt, 1), "$2")
Else
X(lngCNt, 1) = vbNullString
End If
Next
End With
[c1].Resize(UBound(X, 1), 1).Value2 = X
End Sub

Split a string according to a regexp in VBScript

I would like to split a string into an array according to a regular expression similar to what can be done with preg_split in PHP or VBScript Split function but with a regex in place of delimiter.
Using VBScript Regexp object, I can execute a regex but it returns the matches (so I get a collection of my splitters... that's not what I want)
Is there a way to do so ?
Thank you
If you can reserve a special delimiter string, i.e. a string that you can choose that will never be a part of the real input string (perhaps something like "###"), then you can use regex replacement to replace all matches of your pattern to "###", and then split on "###".
Another possibility is to use a capturing group. If your delimiter regex is, say, \d+, then you search for (.*?)\d+, and then extract what the group captured in each match (see before and after on rubular.com).
You can alway use the returned array of matches as input to the split function. You split the original string using the first match - the first part of the string is the first split, then split the remainder of the string (minus the first part and the first match)... continue until done.
I wrote this for my use. Might be what you're looking for.
Function RegSplit(szPattern, szStr)
Dim oAl, oRe, oMatches
Set oRe = New RegExp
oRe.Pattern = "^(.*)(" & szPattern & ")(.*)$"
oRe.IgnoreCase = True
oRe.Global = True
Set oAl = CreateObject("System.Collections.ArrayList")
Do
Set oMatches = oRe.Execute(szStr)
If oMatches.Count > 0 Then
oAl.Add oMatches(0).SubMatches(2)
szStr = oMatches(0).SubMatches(0)
Else
oAl.Add szStr
Exit Do
End If
Loop
oAl.Reverse
RegSplit = oAl.ToArray
End Function
'**************************************************************
Dim A
A = RegSplit("[,|;|#]", "bob,;joe;tony#bill")
WScript.Echo Join(A, vbCrLf)
Returns:
bob
joe
tony
bill
I think you can achieve this by using Execute to match on the required splitter string, but capturing all the preceding characters (after the previous match) as a group. Here is some code that could do what you want.
'// Function splits a string on matches
'// against a given string
Function SplitText(strInput,sFind)
Dim ArrOut()
'// Don't do anything if no string to be found
If len(sFind) = 0 then
redim ArrOut(0)
ArrOut(0) = strInput
SplitText = ArrOut
Exit Function
end If
'// Define regexp
Dim re
Set re = New RegExp
'// Pattern to be found - i.e. the given
'// match or the end of the string, preceded
'// by any number of characters
re.Pattern="(.*?)(?:" & sFind & "|$)"
re.IgnoreCase = True
re.Global = True
'// find all the matches >> match collection
Dim oMatches: Set oMatches = re.Execute( strInput )
'// Prepare to process
Dim oMatch
Dim ix
Dim iMax
'// Initialize the output array
iMax = oMatches.Count - 1
redim arrOut( iMax)
'// Process each match
For ix = 0 to iMax
'// get the match
Set oMatch = oMatches(ix)
'// Get the captured string that precedes the match
arrOut( ix ) = oMatch.SubMatches(0)
Next
Set re = nothing
'// Check if the last entry was empty - this
'// removes one entry if the string ended on a match
if arrOut(iMax) = "" then Redim Preserve ArrOut(iMax-1)
'// Return the processed output
SplitText = arrOut
End Function