Regex extract data before certain text

Regex extract data before certain text - regex

I have large text documents that has some data I want to be extracted.
As you can see in a screenshot , I want to extract A040 to excel column next to the filename.
Before the A040 there is always three empty spaces and than text Sheet (also in screenshot)
Every file has different number and there is always letter A with three digits and text Sheet. --> example file uploaded:
I has something already in VB with Excel but it is not working.
Dim cell As Range
Dim rng As Range
Dim output As String
Set rng = ws.Range("A1", ws.Range("A1").SpecialCells(xlLastCell).Address)
For Each cell In rng
On Error Resume Next
output = ExtA(cell.Value)
If Len(output) > 0 Then
Range("B" & j) = output
Exit For
End If
Next
j = j + 1
ws.Cells.ClearContents
'Call DelConns
strFileName = Dir 'next file
Loop
End Sub
Function ExtA(ByVal text As String) As String
'REGEX Match VBA in excel
Dim result As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
RE.Pattern = "(?<=Sheet)[^Sheet]*\ Sheet"
RE.Global = True
RE.IgnoreCase = True
Set allMatches = RE.Execute(text)
If allMatches.Count <> 0 Then
result = allMatches.Item(0).submatches.Item(0)
End If
ExtA = result
End Function

This seems to work on your sample.
Option Explicit
Function AthreeDigits(str As String)
Dim n As Long, nums() As Variant
Static rgx As Object, cmat As Object
'with rgx as static, it only has to be created once; beneficial when filling a long column with this UDF
If rgx Is Nothing Then
Set rgx = CreateObject("VBScript.RegExp")
Else
Set cmat = Nothing
End If
AthreeDigits = vbNullString
With rgx
.Global = False
.MultiLine = True
.Pattern = "\A[0-9]{3}[\s]{3}Sheet"
If .Test(str) Then
Set cmat = .Execute(str)
AthreeDigits = Left(cmat.Item(0), 4)
End If
End With
End Function

Did you mean to say that there are 4 spaces after the A040 and before the "Sheet"? If so, try this pattern:
.pattern = "(A\d\d\d)\s{3}Sheet"
EDIT: I thought you said 4 spaces, but you said 3. My pattern now reflects that.
EDIT 2: (I need more coffee!) Change the \b to \s.

See Example here
"\s+[Aa]\d*\s+Sheet"
Or
\s+[Aa]\d*\s+(Sheet)
Or
[Aa]\d*\s+(Sheet)
Demo
https://regex101.com/r/Qo8iUf/3
\s+ Matches any whitespace character (equal to [\r\n\t\f\v ])
+ Quantifier — Matches between one and unlimited times, as many times as possible
Aa Matches a single character in the list Aa (case sensitive)
\d* Matches a digit (equal to [0-9])
* Quantifier — Matches between zero and unlimited times, as many times as possible

Related

How to save SubMatches as array and print not empty submatches?

When I try the following Regex code and add a "Add Watch" (Shift + F9) to Matches
Sub TestRegEx1()
Dim regex As Object, Matches As Object
Dim str As String
str = "This is text for the submatches"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = "Th(is).+(for).+(submatches)|.+(\d)|([A-Z]{3})"
regex.IgnoreCase = True
Set Matches = regex.Execute(str)
End Sub
I see that Matches is structured like this (with 2 empty submatches):
2 questions:
How can I save in an array variable the SubMatches?
How can I Debug.Print only elements that are not empty?
I've tried doing like below but is not working
Set Arr = Matches.SubMatches
Set Arr = Matches(1).SubMatches
Set Arr = Matches.Item(1).SubMatches
Thanks in advance

Is the following what you intended? Oversize an array at the start and redim at the end. First version prints only non-empty but stores all. Second version prints and stores only non-empty.
You probably want to .Test to ensure there are matches.
Option Explicit
Sub TestRegEx1()
Dim regex As Object, matches As Object, match As Object, subMatch As Variant
Dim str As String, subMatches(), i As Long
ReDim subMatches(0 To 1000)
str = "This is text for the submatches"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = "Th(is).+(for).+(submatches)|.+(\d)|([A-Z]{3})"
regex.IgnoreCase = True
Set matches = regex.Execute(str)
For Each match In matches
For Each subMatch In match.subMatches
subMatches(i) = match.subMatches(i)
If Not IsEmpty(subMatches(i)) Then Debug.Print subMatches(i)
i = i + 1
Next
Next
ReDim Preserve subMatches(0 To i)
End Sub
If you only want to store non-empty then
Option Explicit
Sub TestRegEx1()
Dim regex As Object, matches As Object, match As Object, subMatch As Variant
Dim str As String, subMatches(), i As Long
ReDim subMatches(0 To 1000)
str = "This is text for the submatches"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = "Th(is).+(for).+(submatches)|.+(\d)|([A-Z]{3})"
regex.IgnoreCase = True
Set matches = regex.Execute(str)
For Each match In matches
For Each subMatch In match.subMatches
subMatches(i) = match.subMatches(i)
If Not IsEmpty(subMatches(i)) Then
Debug.Print subMatches(i)
i = i + 1
End If
Next
Next
ReDim Preserve subMatches(0 To i)
End Sub

You may use a Collection and fill it on the go.
Add
Dim m, coll As Collection
Initialize the collection:
Set coll = New Collection
Then, once you get the matches, use
If Matches.Count > 0 Then ' if there are matches
For Each m In Matches(0).SubMatches ' you need the first match submatches
If Len(m) > 0 Then coll.Add (m) ' if not 0 length, save value to collection
Next
End If
Result of the code with changes:

VBA regex - Value used in formula is of the wrong data type

I can't seem to figure out why this function which includes a regex keeps returning an error of wrong data type? I'm trying to return a match to the identified pattern from a file path string in an excel document. An example of the pattern I'm looking for is "02 Package_2018-1011" from a sample string "H:\H1801100 MLK Middle School Hartford\2-Archive! Issued Bid Packages\01 Package_2018-0905 Demolition and Abatement Bid Set_Drawings - PDF\00 HazMat\HM-1.pdf". Copy of the VBA code is listed below.
Function textpart(Myrange As Range) As Variant
Dim strInput As String
Dim regex As Object
Set regex = CreateObject("VBScript.RegExp")
strInput = Myrange.Value
With regex
.Pattern = "\D{2}\sPackage_\d{4}-\d{4}"
.Global = True
End With
Set textpart = regex.Execute(strInput)
End Function

You need to use \d{2} to match 2-digit chunk, not \D{2}. Besides, you are trying to assign the whole match collection to the function result, while you should extract the first match value and assign that value to the function result:
Function textpart(Myrange As Range) As Variant
Dim strInput As String
Dim regex As Object
Dim matches As Object
Set regex = CreateObject("VBScript.RegExp")
strInput = Myrange.Value
With regex
.Pattern = "\d{2}\sPackage_\d{4}-\d{4}"
End With
Set matches = regex.Execute(strInput)
If matches.Count > 0 Then
textpart = matches(0).Value
End If
End Function
Note that to match it as a whole word you may add word boundaries:
.Pattern = "\b\d{2}\sPackage_\d{4}-\d{4}\b"
^^ ^^
To only match it after \, you may use a capturing group:
.Pattern = "\\(\d{2}\sPackage_\d{4}-\d{4})\b"
' ...
' and then
' ...
textpart = matches(0).Submatches(0)

VBA - Modify sheet naming from source file

I received help in the past for an issue regarding grabbing a source file name and naming a newly created worksheet the date from said source file name, i.e. "010117Siemens Hot - Cold Report.xls" and outputting "010117".
However the code only works for file names with this exact format, for example, file named "Siemens Hot - Cold Report 010117.xls", an error occurs because the newly created sheet does not find the date in the source file.
CODE
Application.ScreenUpdating = False
Dim n As Double
Dim wksNew As Excel.Worksheet
Dim src As Workbook
Set src = Workbooks.Open(filePath, False, False)
Dim srcRng As Range
With src.Worksheets("Sheet1")
Set srcRng = .Range(.Range("A1"), .Range("A1").End(xlDown).End(xlToRight))
End With
With ThisWorkbook
Set wksNew = .Worksheets.Add(After:=.Worksheets(.Sheets.Count))
n = .Sheets.Count
.Worksheets(n).Range("A1").Resize(srcRng.Rows.Count, srcRng.Columns.Count).Value = srcRng.Value
End With
' ======= get the digits part from src.Name using a RegEx object =====
' RegEx variables
Dim Reg As Object
Dim RegMatches As Variant
Set Reg = CreateObject("VBScript.RegExp")
With Reg
.Global = True
.IgnoreCase = True
.Pattern = "\d{0,9}" ' Match any set of 0 to 9 digits
End With
Set RegMatches = Reg.Execute(src.Name)
On Error GoTo CloseIt
If RegMatches.Count >= 1 Then ' make sure there is at least 1 match
ThisWorkbook.Worksheets(n).Name = RegMatches(0) ' rename "Sheet2" to the numeric part of the filename
End If
src.Close False
Set src = Nothing
So, my question is, how can I get my code to recognize the string of digits no matter its position in the file name?

Code
^\d{0,9}\B|\b\d{0,9}(?=\.)
Usage
I decided to make a function that can be called inside a cell as such: =GetMyNum(x) where x is a pointer to a cell (i.e. A1).
To get the code below to work:
Open Microsoft Visual Basic for Applications (ALT + F11)
Insert a new module (right click in the Project Pane and select Insert -> Module).
Click Tools -> References and find Microsoft VBScript Regular Expressions 5.5, enable it and click OK
Now copy/paste the following code into the new module:
Option Explicit
Function GetMyNum(Myrange As Range) As String
Dim regEx As New RegExp
Dim strPattern As String
Dim strInput As String
Dim strReplace As String
Dim strOutput As String
Dim match As Object
strPattern = "^\d{0,9}\B|\b\d{0,9}(?=\.)"
If strPattern <> "" Then
strInput = Myrange.Value
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern
End With
If regEx.test(strInput) Then
Set match = regEx.Execute(strInput)
GetMyNum = match.Item(0)
Else
GetMyNum = ""
End If
End If
End Function
Results
Input
A1: Siemens Hot - Cold Report 010117.xls
A2: 010117Siemens Hot - Cold Report.xls
B1: =GetMyNum(A1)
B2: =GetMyNum(A1)
Output
010117 # Contents of B1
010117 # Contents of B2
Explanation
I will explain each regex option separately. You can reorder the options in terms of importance in such a way that the most important option is first and least important is last.
^\d{0,9}\B Match the following
^ Assert position at the start of the line
\d{0,9} Match any digit 0-9 times
\B Ensure position does not match where a word boundary matches (this is used but may be dropped depending on usage - I added it because it seems the number you're trying to get is immediately followed by a word character and not followed by a space - if that's not always the case just remove this token)
\b\d{0,9}(?=\.) Match the following
\b Assert position as a word boundary
\d{0,9} Match any digit 0-9 times
(?=\.) Positive lookahead ensuring a literal dot . follows

Just my alternative solution to RegEx :)
This finds the first occurence of 6 consecutive digits, omitting blanks and periods... although there are probably some more issues with using IsNumeric as I believe a lowercase e is considered acceptable by it...
Sub FindTheNumber()
For i = 1 To Len(Range("A1").Value)
If IsNumeric(Mid(Range("A1").Value, i, 6)) = True And InStr(Mid(Range("A1").Value, i, 6), " ") = 0 And InStr(Mid(Range("A1").Value, i, 6), ".") = 0 Then
MyNumber = Mid(Range("A1").Value, i, 6)
Debug.Print MyNumber
Exit For
End If
Next i
For i = 1 To Len(Range("A2").Value)
If IsNumeric(Mid(Range("A2").Value, i, 6)) = True And InStr(Mid(Range("A2").Value, i, 6), " ") = 0 And InStr(Mid(Range("A2").Value, i, 6), ".") = 0 Then
MyNumber = Mid(Range("A2").Value, i, 6)
Debug.Print MyNumber
Exit For
End If
Next i
End Sub
Examples:
Immediate window:

How to replace Numbers in Parentheses with some calculations in MS Word

I have a problem to replace some serial number such as [30] [31] [32]... to [31] [32] [33]... in MS word when I insert a new references in the middle of article. I have not found a solution way in GUI so I try to use VBA to do that replacement. I find a similar problem in stack overflow:
MS Word Macro to increment all numbers in word document
However, this way is a bit inconvenient because it have to generate some replacement array in other place. Can I make that replacement with regex and some function in MS Word VBA like code below?
Sub replaceWithregExp()
Dim regExp As Object
Dim regx, S$, Strnew$
Set regExp = CreateObject("vbscript.regexp")
With regExp
.Pattern = "\[([0-9]{2})\]"
.Global = True
End With
'How to do some calculations with $1?
Selection.Text = regExp.Replace(Selection.Text, "[$1]")
End Sub
But I don't know how to do some calculations with $1 in regExp? I have try use "[$1+1]" but it return [31+1] [32+1] [33+1]. Can anyone help? Thanks!

It is impossible to pass a callback function to the RegExp.Replace, so you have the only option: use RegExp.execute and process matches in a loop.
Here is an example code for your case (I took a shortcut since you only have the value to modify inside known delimiters, [ and ].)
Sub replaceWithregExp()
Dim regExp As Object
Dim regx, S$, Strnew$
Set regExp = CreateObject("vbscript.regexp")
With regExp
.Pattern = "\[([0-9]{2})]"
.Global = True
End With
'How to do some calculations with $1?
' Removing regExp.Replace(Selection.Text, "[$1]")
For Each m In regExp.Execute(Selection.Text)
Selection.Text = Left(Selection.Text, m.FirstIndex+1) _
& Replace(m.Value, m.Value, CStr(CInt(m.Submatches(0)) + 10)) _
& Mid(Selection.Text, m.FirstIndex + Len(m.Value))
Next m
End Sub
Here,
Selection.Text = Left(Selection.Text, m.FirstIndex+1) - Get what is before
& Replace(m.Value, m.Value, CStr(CInt(m.Submatches(0)) + 10)) - Add 10 to the captured number
& Mid(Selection.Text, m.FirstIndex + Len(m.Value)) - Append what is after the capture

That should do the trick :
Sub IncrementWithRegex()
Dim Para As Paragraph
Set Para = ThisDocument.Paragraphs.First
Dim ParaNext As Paragraph
Dim oRange As Range
Set oRange = Para.Range
Dim regEx As New RegExp
Dim regMatch As Variant
Dim ACrO As String
With regEx
.Global = True
.MultiLine = False
.IgnoreCase = False
.Pattern = "[\[]([0-9]{2})[\]]"
End With
Do While Not Para Is Nothing
Set ParaNext = Para.Next
Set oRange = Para.Range
'Debug.Print oRange.Text
If regEx.test(oRange.Text) Then
For Each regMatch In regEx.Execute(oRange.Text)
oRange.Text = _
Left(oRange.Text, _
InStr(1, oRange.Text, CStr(regMatch))) & _
CDbl(regMatch) + 1 & _
Right(oRange.Text, _
Len(CStr(regMatch)) + InStr(1, oRange.Text, CStr(regMatch)))
Next regMatch
Else
End If
Set Para = ParaNext
Loop
End Sub
To use this, remember to add the reference :
Description: Microsoft VBScript Regular Expressions 5.5
FullPath: C:\windows\SysWOW64\vbscript.dll\3
Major.Minor: 5.5
Name: VBScript_RegExp_55
GUID: {3F4DACA7-160D-11D2-A8E9-00104B365C9F}

Here is a simple VBA macro you can use to achieve this :
Sub IncrementNumbers()
Dim regExp As Object
Dim i As Integer
Dim fullMatch As String
Dim subMatch As Integer
Dim replacement As String
Const TMP_PREFIX As String = "$$$"
Set regExp = CreateObject("vbscript.regexp")
With regExp
.Pattern = "\[([0-9]{2})\]"
.Global = True
.MultiLine = True
End With
'Ensure selected text match our regex
If regExp.test(Selection.Text) Then
'Find all matches
Set matches = regExp.Execute(Selection.Text)
' Start from last match
For i = 0 To (matches.Count - 1)
fullMatch = matches(i).Value
subMatch = CInt(matches(i).SubMatches(0))
'Increment by 1
subMatch = subMatch + 1
'Create replacement. Add a temporary prefix so we ensure [30] replaced with [31]
'will not be replaced with [32] when [31] will be replaced
replacement = "[" & TMP_PREFIX & subMatch & "]"
'Replace full match with [subMatch]
Selection.Text = Replace(Selection.Text, fullMatch, replacement)
Next
End If
'Now replacements are complete, we have to remove replacement prefix
Selection.Text = Replace(Selection.Text, TMP_PREFIX, "")
End Sub

Extract four numbers without brackets from a bracketed entry, if entry exists

What I have:
A list of about 1000 titles of reports in column B.
Some of these titles have a four digit number surrounded by brackets (eg: (3672)) somewhere in a string of text and numbers.
I want to extract these four numbers - without brackets - in column C in the same row.
If there is no four digit number with brackets in column B, then to return "" in column C.
What I have so far:
I can successfully identify the cells in column B which have four digits surrounded by brackets. The problem is it returns the whole title including the four numbers.
Taken from: VBA RegEx extracting data from within a string
NB: I am Using Excel Professional Plus 2010, have checked the box next to "Microsoft VBScript Regular Expressions 5.5".
Sub ExtractTicker()
Dim regEx
Dim i As Long
Dim pattern As String
Set regEx = CreateObject("VBScript.RegExp")
regEx.IgnoreCase = True
regEx.Global = True
regEx.pattern = "(\()([0-9]{4})(\))"
For i = 2 To ActiveSheet.UsedRange.Rows.Count
If (regEx.Test(Cells(i, 2).Value)) Then
Cells(i, 3).Value = regEx.Replace(Cells(i, 2).Value, "$2")
End If
Next i
End Sub

Try
regEx.pattern = "(.*\()([0-9]{4})(\).*)"
the .* and the start and end of the string ensure you capture the entire string, then this is fully substituted by the 2nd submatch ([0-9]{4})
To fully optimise the code
use variant arrays rather than ranges
setting Global and IgnoreCase is redundant when you are running a case insensitive match on the full string
you are using late binding so you dont need the Reference
code
Sub ExtractTicker()
Dim regEx As Object
Dim pattern As String
Dim X
Dim lngCNt As Long
X = Range([b1], Cells(Rows.Count, "B").End(xlUp)).Value2
Set regEx = CreateObject("VBScript.RegExp")
With regEx
.pattern = "(.*\()([0-9]{4})(\).*)"
For lngCNt = 1 To UBound(X)
If .Test(X(lngCNt, 1)) Then
X(lngCNt, 1) = .Replace(X(lngCNt, 1), "$2")
Else
X(lngCNt, 1) = vbNullString
End If
Next
End With
[c1].Resize(UBound(X, 1), 1).Value2 = X
End Sub

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex extract data before certain text - regex

Did you mean to say that there are 4 spaces after the A040 and before the "Sheet"? If so, try this pattern: .pattern = "(A\d\d\d)\s{3}Sheet" EDIT: I thought you said 4 spaces, but you said 3. My pattern now reflects that. EDIT 2: (I need more coffee!) Change the \b to \s.

Related

How to save SubMatches as array and print not empty submatches?

VBA regex - Value used in formula is of the wrong data type

VBA - Modify sheet naming from source file

How to replace Numbers in Parentheses with some calculations in MS Word

Extract four numbers without brackets from a bracketed entry, if entry exists

Categories

Resources