Using regex to find paragraphs in VBA Excel - regex

I am trying to use regex to 'extract' paragraphs in a document. Each paragraph is preceded and followed by a '-' on separate line and each paragraph starts with a number.
For example
-
1. This is a paragraph
It may go over multiple lines
-
Ideally, I would like to not include the '-', but it doesn't really matter as I will be placing it in a string and running another regex against it (One that I know works)
The code I am trying to use is basically as follows
Dim matchPara as Object
Dim regex as Object
Dim theMatch as Object
Dim matches as Object
Dim fileName as String
Dim fileNo as Integer
Dim document as String
matchPara = "-?(\d.*?)?-"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = matchPara
regex.Global = True
regex.Multiline = True
fileName = "C:\file.txt"
fileNo = FreeFile
Open fileName For Input As #fileNo
document = Input$(LOF(fileNo), fileNo)
set matches = regex.Execute(document)
For Each theMatch in matches
MsgBox(theMatch.Value)
Next theMatch
Close #fileNo
I have tested this regex on regex101 and it appeared to do what I wanted. I have also tested it without the grouping
-?\d.*?-
However when I run the code the theMatch.Value only ever contains a single '-'. After some messing around with the regex I got it to display the first line of text but never any more than the first line.
I have checked the length of theMatch.Value with:
MsgBox(len(theMatch.Value))
and placed the contents of theMatch.Value in a cell on the worksheet to see if It was cutting off in the message box, but both theories were proved wrong.
I am at a complete loss now and I am beginning to suspect it is possibly a VBA thing and not a regex thing. There is no requirement to use regex, I just assumed it would be the easiest thing to do.
The paragraphs contain data that I am trying to extract. So the idea was to regex each paragraph out place that in a string then run other regex to get the information that I need. Some paragraphs wont contain the data that I need so the idea was to loop through each individual paragraph and then error handle better if the data I need wasn't in that paragraph (ie get what I can and drop the rest with an error message)
Here is a screenshot:

This simple approach does not use Regex. It assumes the data is in column A and the paragraphs are placed in column B:
Sub paragraph_no_regex()
Dim s As String
Dim ary
With Application.WorksheetFunction
s = .TextJoin(" ", False, Columns(1).SpecialCells(2))
End With
ary = Split(s, "-")
i = 1
For Each a In ary
Cells(i, 2) = a
i = i + 1
Next a
End Sub

Sub F()
Dim re As New RegExp
Dim sMatch As String
Dim document As String
re.Pattern = "-\n((.|\n)+?)\n-"
'Getting document
document = ...
sMatch = re.Execute(document)(0).SubMatches(0)
End Sub
If you need dashes -, then just include them into capture group (the outer parenthesis).

This RegEx matches your description and successfully extracts paragraphs (as tested on regex101.com):
matchPara = "-\n\d+\.\s*((?:.|\n)+?)\s*\n-"
It needs the 'global' flag but not the 'multiline' flag. Instead, the end-of-line token is matched in the regex. The main point is that the innermost matching group will match any character including the end-of-line (given as an alternative) but does so in a non-greedy way ("+?"). It doesn't care about word boundaries as this is not necessary here. Also, "-" is not a special character where used in the regex so it doesn't have to be escaped.
As added benefit leading and trailing whitespace is cut off ("\s*" outside the group).

Related

How to split a string in VBA to array by Split function delimited by Regular Expression

I am writing an Excel Add In to read a text file, extract values and write them to an Excel file. I need to split a line, delimited by one or more white spaces and store it in the form of array, from which I want to extract desired values.
I am trying to implement something like this:
arrStr = Split(line, "/^\s*/")
But the editor is throwing an error while compiling.
How can I do what I want?
If you are looking for the Regular Expressions route, then you could do something like this:
Dim line As String, arrStr, i As Long
line = "This is a test"
With New RegExp
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
IMPORTANT: You will need to create a reference to:
Microsoft VBScript Regular Expressions 5.5 in Tools > References
Otherwise, you can see Late Binding below
Your original implementation of your original pattern \^S*\$ had some issues:
S* was actually matching a literal uppercase S, not the whitespace character you were looking for - because it was not escaped.
Even if it was escaped, you would have matched every string that you used because of your quantifier: * means to match zero or more of \S. You were probably looking for the + quantifier (one or more of).
You were good for making it greedy (not using *?) since you were wanting to consume as much as possible.
The Pattern I used: (\S+) is placed in a capturing group (...) that will capture all cases of \S+ (all characters that are NOT a white space, + one or more times.
I also used the .Global so you will continue matching after the first match.
Once you have captured all your words, you can then loop through the match collection and place them into an array.
Late Binding:
Dim line As String, arrStr, i As Long
line = "This is a test"
With CreateObject("VBScript.RegExp")
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
Miscellaneous Notes
I would have advised just to use Split(), but you stated that there were cases where more than one consecutive space may have been an issue. If this wasn't the case, you wouldn't need regex at all, something like:
arrStr = Split(line)
Would have split on every occurance of a space

Extract text using word VBA regex then save to a variable as string

I am trying to create code in Word VBA that will automatically save (as PDF) and name a document based on it's content, which is in text and not fields. Luckily the formatting is standardized and I already know how to save it. I tested my regex elsewhere to make sure it pulls what I am looking for. The trouble is I need to extract the matched statement, convert it to a string, and save it to an object (so I have something to pass on to the code where it names the document).
The part of the document I need to match is below, from the start of "Program" through the end of the line and looks like:
Program: Program Name (abr)
and the regex I worked out for this is "Program:[^\n]"
The code I have so far is below, but I don't know how to execute the regex in the active document, convert the output to a string and save to an object:
Sub RegExProgram()
Dim regEx
Dim pattern As String
Set regEx = CreateObject("VBScript.RegExp")
regEx.IgnoreCase = True
regEx.Global = False
regEx.pattern = "Program\:[^\n]"
(missing code here)
End Sub
Any ideas are welcome, and I am sorry if this is simple and I am just overlooking something obvious. This is my first VBA project, and most of the resources I can find suggest replacing using regex, not saving extracted text as string. Thank you!
Try this:
You can find documentation for the RegExp class here.
Dim regEx as Object
Dim matchCollection As Object
Dim extractedString As String
Set regEx = CreateObject("VBScript.RegExp")
With regEx
.IgnoreCase = True
.Global = False ' Only look for 1 match; False is actually the default.
.Pattern = "Program: ([^\r]+)" ' Word separates lines with CR (\r)
End With
' Pass the text of your document as the text to search through to regEx.Execute().
' For a quick test of this statement, pass "Program: Program Name (abr)"
set matchCollection = regEx.Execute(ActiveDocument.Content.Text)
' Extract the first submatch's (capture group's) value -
' e.g., "Program Name (abr)" - and assign it to variable extractedString.
extractedString = matchCollection(0).SubMatches(0)
I've modified your regex based on the assumption that you want to capture everything after Program: through the end of the line; your original regex would only have captured Program:<space>.
Enclosing [^\r]+ (all chars. through the end of the line) in (...) defines a so-called subexpression (a.k.a. capture group), which allows selective extraction of only the substring of interest from what the overall pattern captures.
The .Execute() method, to which you pass the string to search in, always returns a collection of matches (Match objects).
Since the .Global property is set to False in your code, the output collection has (at most) 1 entry (at index 0) in this case.
If the regular expression has subexpressions (1 in our case), then each entry of the match collection has a nonempty .SubMatches collection, with one entry for each subexpression, but note that the .SubMatches entries are strings, not Match objects.
Match objects have properties .FirstIndex, .Length, and Value (the captured string). Since the .Value property is the default property, it is sufficient to access the object itself, without needing to reference the .Value property (e.g., instead of the more verbose matchCollection(0).Value to access the captured string (in full), you can use shortcut matchCollection(0) (again, by contrast, .SubMatches entries are strings only).
If you're just looking for a string that starts with "Program:" and want to go to the end of the line from there, you don't need a regular expression:
Public Sub ReadDocument()
Dim aLine As Paragraph
Dim aLineText As String
Dim start As Long
For Each aLine In ActiveDocument.Paragraphs
aLineText = aLine.Range.Text
start = InStr(aLineText, "Program:")
If start > 0 Then
my_str = Mid(aLineText, start)
End If
Next aLine
End Sub
This reads through the document line by line, and stores your match in the variable "my_str" when it encounters a line that has the match.
Lazier version:
a = Split(ActiveDocument.Range.Text, "Program:")
If UBound(a) > 0 Then
extractedString = Trim(Split(a(1), vbCr)(0))
End If
If I remember correctly, paragraphs in Word end with vbCr ( \r not \n )

Remove tweet regular expressions from string of text

I have an excel sheet filled with tweets. There are several entries which contain #blah type of strings among other. I need to keep the rest of the text and remove the #blah part. For example: "#villos hey dude" needs to be transformed into : "hey dude". This is what i ve done so far.
Sub Macro1()
'
' Macro1 Macro
'
Dim counter As Integer
Dim strIN As String
Dim newstring As String
For counter = 1 To 46
Cells(counter, "E").Select
ActiveCell.FormulaR1C1 = strIN
StripChars (strIN)
newstring = StripChars(strIN)
ActiveCell.FormulaR1C1 = StripChars(strIN)
Next counter
End Sub
Function StripChars(strIN As String) As String
Dim objRegex As Object
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Pattern = "^#?(\w){1,15}$"
.ignorecase = True
StripChars = .Replace(strIN, vbNullString)
End With
End Function
Moreover there are also entries like this one: Ÿ³é‡ï¼Ÿã€€åˆã‚ã¦çŸ¥ã‚Šã¾ã—ãŸã€‚ shiftã—ãªãŒã‚‰ã‚¨ã‚¯ã‚¹ãƒ
I need them gone too! Ideas?
For every line in the spreadsheet run the following regex on it: ^(#.+?)\s+?(.*)$
If the line matches the regex, the information you will be interested in will be in the second capturing group. (Usually zero indexed but position 0 will contain the entire match). The first capturing group will contain the twitter handle if you need that too.
Regex demo here.
However, this will not match tweets that are not replies (starting with #). In this situation the only way to distinguish between regular tweets and the junk you are not interested in is to restrict the tweet to alphanumerics - but this may mean some tweets are missed if they contain any non-alphanumerical characters. The following regex will work if that is not an issue for you:
^(?:(#.+?)\s+?)?([\w\t ]+)$
Demo 2.

how to do a replace within a replace match using vba regex

I need to replace some characters, but only when they are within brackets. So assume following example
this is a string (with comment), this is another string without comment, and this is a string (with one comment, and another one)
I need to be able to split this sentence based on the comma value. Which would work out fine apart from the annoying fact the last comment also contains a comma so my split is a bit limited. The desired result would have to be as follows
this is a string (with comment),
this is another string without comment,
and this is a string (with one comment, and another one)
I'm using access VBA, and my approach was to first isolate all the comments (content within brackets), replace the comma with a different character (say the pipe symbol) and than use the split or replace options to split the whole sentence.
What I tried is something as below, but I fail to deal with the regex match like I like to. Any alternative, or insight on how I can tacklle it best ?
Function commentFixer(s As String, t As String) As String
't = token to be replaced, eg a comma
Dim regEx As New RegExp
Dim match As String
regEx.Global = True
p = "(\([^()]*\)*)"
'match all commented substrings
regEx.Pattern = p
'below obviously doesn't work, as the match itself is not accepted as a character. Any way to deal with this ?
match = "$1" 'How can I store this in a variable to perform a replacement on the result ?
dim r as string 'replacement value
r = Replace(match, t, "|")
commentFixer = regEx.Replace(s, r)
End Function
Sub TestMe()
s = commentFixer("this is a string (with comment), this is another string without comment, and this is a string (with one comment, and another one)", ",")
Debug.Print s
'expected result : this is a string (with comment), this is another string without comment, and this is a string (with one comment| and another one)
End Sub
Here you go,
(.*?,)\s*(?![^()]*\))|(.+)$
Group index 1 and 2 contains the strings you want.
DEMO

Regex cheating in csv-parsing delimited at comma, ignore in quotes

all
So, I'm trying to figure out how to make a simple regex code for Visual Basic.net, but am not getting anywhere.
I'm parsing csv files into a list of array, but the source csv's are anything but pristine. There are extra/rogue quotes in just enough places to crash the program, and enough sets of quotes to make fixing the data manually cumbersome.
I've written in a bunch of error-checking, and it works about 99.99% of the time. However, with 10,000 lines to parse for each folder, that averages one error per set of csv files. Crash. To get that last 0.01% parsed properly, I've created an If statement that will pull out lines that have odd numbers of quotes and remove ALL of them, which triggers a manual error-check If there are zero quotes, the field processes as usual. If there's an even number of quotes, the standard Split function cannot ignore delimiters between quotes without a regex.
Could someone help me figure out a regex string that will ignore fields enclosed in quotes?
Here's the code I've been able to think up up to this point.
Thank you in advance
Using filereader1 As New Microsoft.VisualBasic.FileIO.TextFieldParser(files_(i),
System.Text.Encoding.Default) 'system text decoding adds odd characters
filereader1.TextFieldType = FieldType.Delimited
'filereader1.Delimiters = New String() {","}
filereader1.SetDelimiters(",")
filereader1.HasFieldsEnclosedInQuotes = True
For Each c As Char In whole_string
If c = """" Then cnt = cnt + 1
Next
If cnt = 0 Then 'no quotes
split_string = Split(whole_string, ",") 'split by commas
ElseIf cnt Mod 2 = 0 Then 'even number of quotes
split_string = Regex.Split(whole_string, "(?=(([^""]|.)*""([^""]|.)*"")*([^""]|.)*$)")
ElseIf cnt <> 0 Then 'odd number of quotes
whole_string = whole_string.Replace("""", " ") 'delete all quotes
split_string = Split(whole_string, ",") 'split by commas
End If
In VB.NET, there are several ways to proceed.
Option 1
You can use this regex: ,(?![^",]*")
It matches commas that are not inside quotes: a comma , that is not followed (as asserted by the negative lookahead (?![^",]*") ) by characters that are neither a comma nor a quote then a quote.
In VB.NET, something like:
Dim MyRegex As New Regex(",(?![^"",]*"")")
ResultString = MyRegex.Replace(Subject, "|")
Option 2
This uses this beautifully simple regex: "[^"]*"|(,)
This is a more general solution and easy to tweak solution. For a full description, I recommend you have a look at this question about of Regex-matching or replacing... except when.... It can make a very tidy solution that is easy to maintain if you find other cases to tweak.
The left side of the alternation | matches complete "quotes". We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
This code should work:
Imports System
Imports System.Text.RegularExpressions
Imports System.Collections.Specialized
Module Module1
Sub Main()
Dim MyRegex As New Regex("""[^""]*""|(,)")
Dim Subject As String = "LIST,410210,2-4,""PUMP, HYDRAULIC PISTON - MAIN"",1,,,"
Dim Replaced As String = myRegex.Replace(Subject,
Function(m As Match)
If (m.Groups(1).Value = "") Then
Return ""
Else
Return m.Groups(0).Value
End If
End Function)
Console.WriteLine(Replaced)
Console.WriteLine(vbCrLf & "Press Any Key to Exit.")
Console.ReadKey()
End Sub
End Module
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...