how to do a replace within a replace match using vba regex - regex

I need to replace some characters, but only when they are within brackets. So assume following example
this is a string (with comment), this is another string without comment, and this is a string (with one comment, and another one)
I need to be able to split this sentence based on the comma value. Which would work out fine apart from the annoying fact the last comment also contains a comma so my split is a bit limited. The desired result would have to be as follows
this is a string (with comment),
this is another string without comment,
and this is a string (with one comment, and another one)
I'm using access VBA, and my approach was to first isolate all the comments (content within brackets), replace the comma with a different character (say the pipe symbol) and than use the split or replace options to split the whole sentence.
What I tried is something as below, but I fail to deal with the regex match like I like to. Any alternative, or insight on how I can tacklle it best ?
Function commentFixer(s As String, t As String) As String
't = token to be replaced, eg a comma
Dim regEx As New RegExp
Dim match As String
regEx.Global = True
p = "(\([^()]*\)*)"
'match all commented substrings
regEx.Pattern = p
'below obviously doesn't work, as the match itself is not accepted as a character. Any way to deal with this ?
match = "$1" 'How can I store this in a variable to perform a replacement on the result ?
dim r as string 'replacement value
r = Replace(match, t, "|")
commentFixer = regEx.Replace(s, r)
End Function
Sub TestMe()
s = commentFixer("this is a string (with comment), this is another string without comment, and this is a string (with one comment, and another one)", ",")
Debug.Print s
'expected result : this is a string (with comment), this is another string without comment, and this is a string (with one comment| and another one)
End Sub

Here you go,
(.*?,)\s*(?![^()]*\))|(.+)$
Group index 1 and 2 contains the strings you want.
DEMO

Related

How to split a string in VBA to array by Split function delimited by Regular Expression

I am writing an Excel Add In to read a text file, extract values and write them to an Excel file. I need to split a line, delimited by one or more white spaces and store it in the form of array, from which I want to extract desired values.
I am trying to implement something like this:
arrStr = Split(line, "/^\s*/")
But the editor is throwing an error while compiling.
How can I do what I want?
If you are looking for the Regular Expressions route, then you could do something like this:
Dim line As String, arrStr, i As Long
line = "This is a test"
With New RegExp
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
IMPORTANT: You will need to create a reference to:
Microsoft VBScript Regular Expressions 5.5 in Tools > References
Otherwise, you can see Late Binding below
Your original implementation of your original pattern \^S*\$ had some issues:
S* was actually matching a literal uppercase S, not the whitespace character you were looking for - because it was not escaped.
Even if it was escaped, you would have matched every string that you used because of your quantifier: * means to match zero or more of \S. You were probably looking for the + quantifier (one or more of).
You were good for making it greedy (not using *?) since you were wanting to consume as much as possible.
The Pattern I used: (\S+) is placed in a capturing group (...) that will capture all cases of \S+ (all characters that are NOT a white space, + one or more times.
I also used the .Global so you will continue matching after the first match.
Once you have captured all your words, you can then loop through the match collection and place them into an array.
Late Binding:
Dim line As String, arrStr, i As Long
line = "This is a test"
With CreateObject("VBScript.RegExp")
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
Miscellaneous Notes
I would have advised just to use Split(), but you stated that there were cases where more than one consecutive space may have been an issue. If this wasn't the case, you wouldn't need regex at all, something like:
arrStr = Split(line)
Would have split on every occurance of a space

Using regex to find paragraphs in VBA Excel

I am trying to use regex to 'extract' paragraphs in a document. Each paragraph is preceded and followed by a '-' on separate line and each paragraph starts with a number.
For example
-
1. This is a paragraph
It may go over multiple lines
-
Ideally, I would like to not include the '-', but it doesn't really matter as I will be placing it in a string and running another regex against it (One that I know works)
The code I am trying to use is basically as follows
Dim matchPara as Object
Dim regex as Object
Dim theMatch as Object
Dim matches as Object
Dim fileName as String
Dim fileNo as Integer
Dim document as String
matchPara = "-?(\d.*?)?-"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = matchPara
regex.Global = True
regex.Multiline = True
fileName = "C:\file.txt"
fileNo = FreeFile
Open fileName For Input As #fileNo
document = Input$(LOF(fileNo), fileNo)
set matches = regex.Execute(document)
For Each theMatch in matches
MsgBox(theMatch.Value)
Next theMatch
Close #fileNo
I have tested this regex on regex101 and it appeared to do what I wanted. I have also tested it without the grouping
-?\d.*?-
However when I run the code the theMatch.Value only ever contains a single '-'. After some messing around with the regex I got it to display the first line of text but never any more than the first line.
I have checked the length of theMatch.Value with:
MsgBox(len(theMatch.Value))
and placed the contents of theMatch.Value in a cell on the worksheet to see if It was cutting off in the message box, but both theories were proved wrong.
I am at a complete loss now and I am beginning to suspect it is possibly a VBA thing and not a regex thing. There is no requirement to use regex, I just assumed it would be the easiest thing to do.
The paragraphs contain data that I am trying to extract. So the idea was to regex each paragraph out place that in a string then run other regex to get the information that I need. Some paragraphs wont contain the data that I need so the idea was to loop through each individual paragraph and then error handle better if the data I need wasn't in that paragraph (ie get what I can and drop the rest with an error message)
Here is a screenshot:
This simple approach does not use Regex. It assumes the data is in column A and the paragraphs are placed in column B:
Sub paragraph_no_regex()
Dim s As String
Dim ary
With Application.WorksheetFunction
s = .TextJoin(" ", False, Columns(1).SpecialCells(2))
End With
ary = Split(s, "-")
i = 1
For Each a In ary
Cells(i, 2) = a
i = i + 1
Next a
End Sub
Sub F()
Dim re As New RegExp
Dim sMatch As String
Dim document As String
re.Pattern = "-\n((.|\n)+?)\n-"
'Getting document
document = ...
sMatch = re.Execute(document)(0).SubMatches(0)
End Sub
If you need dashes -, then just include them into capture group (the outer parenthesis).
This RegEx matches your description and successfully extracts paragraphs (as tested on regex101.com):
matchPara = "-\n\d+\.\s*((?:.|\n)+?)\s*\n-"
It needs the 'global' flag but not the 'multiline' flag. Instead, the end-of-line token is matched in the regex. The main point is that the innermost matching group will match any character including the end-of-line (given as an alternative) but does so in a non-greedy way ("+?"). It doesn't care about word boundaries as this is not necessary here. Also, "-" is not a special character where used in the regex so it doesn't have to be escaped.
As added benefit leading and trailing whitespace is cut off ("\s*" outside the group).

Extract text using word VBA regex then save to a variable as string

I am trying to create code in Word VBA that will automatically save (as PDF) and name a document based on it's content, which is in text and not fields. Luckily the formatting is standardized and I already know how to save it. I tested my regex elsewhere to make sure it pulls what I am looking for. The trouble is I need to extract the matched statement, convert it to a string, and save it to an object (so I have something to pass on to the code where it names the document).
The part of the document I need to match is below, from the start of "Program" through the end of the line and looks like:
Program: Program Name (abr)
and the regex I worked out for this is "Program:[^\n]"
The code I have so far is below, but I don't know how to execute the regex in the active document, convert the output to a string and save to an object:
Sub RegExProgram()
Dim regEx
Dim pattern As String
Set regEx = CreateObject("VBScript.RegExp")
regEx.IgnoreCase = True
regEx.Global = False
regEx.pattern = "Program\:[^\n]"
(missing code here)
End Sub
Any ideas are welcome, and I am sorry if this is simple and I am just overlooking something obvious. This is my first VBA project, and most of the resources I can find suggest replacing using regex, not saving extracted text as string. Thank you!
Try this:
You can find documentation for the RegExp class here.
Dim regEx as Object
Dim matchCollection As Object
Dim extractedString As String
Set regEx = CreateObject("VBScript.RegExp")
With regEx
.IgnoreCase = True
.Global = False ' Only look for 1 match; False is actually the default.
.Pattern = "Program: ([^\r]+)" ' Word separates lines with CR (\r)
End With
' Pass the text of your document as the text to search through to regEx.Execute().
' For a quick test of this statement, pass "Program: Program Name (abr)"
set matchCollection = regEx.Execute(ActiveDocument.Content.Text)
' Extract the first submatch's (capture group's) value -
' e.g., "Program Name (abr)" - and assign it to variable extractedString.
extractedString = matchCollection(0).SubMatches(0)
I've modified your regex based on the assumption that you want to capture everything after Program: through the end of the line; your original regex would only have captured Program:<space>.
Enclosing [^\r]+ (all chars. through the end of the line) in (...) defines a so-called subexpression (a.k.a. capture group), which allows selective extraction of only the substring of interest from what the overall pattern captures.
The .Execute() method, to which you pass the string to search in, always returns a collection of matches (Match objects).
Since the .Global property is set to False in your code, the output collection has (at most) 1 entry (at index 0) in this case.
If the regular expression has subexpressions (1 in our case), then each entry of the match collection has a nonempty .SubMatches collection, with one entry for each subexpression, but note that the .SubMatches entries are strings, not Match objects.
Match objects have properties .FirstIndex, .Length, and Value (the captured string). Since the .Value property is the default property, it is sufficient to access the object itself, without needing to reference the .Value property (e.g., instead of the more verbose matchCollection(0).Value to access the captured string (in full), you can use shortcut matchCollection(0) (again, by contrast, .SubMatches entries are strings only).
If you're just looking for a string that starts with "Program:" and want to go to the end of the line from there, you don't need a regular expression:
Public Sub ReadDocument()
Dim aLine As Paragraph
Dim aLineText As String
Dim start As Long
For Each aLine In ActiveDocument.Paragraphs
aLineText = aLine.Range.Text
start = InStr(aLineText, "Program:")
If start > 0 Then
my_str = Mid(aLineText, start)
End If
Next aLine
End Sub
This reads through the document line by line, and stores your match in the variable "my_str" when it encounters a line that has the match.
Lazier version:
a = Split(ActiveDocument.Range.Text, "Program:")
If UBound(a) > 0 Then
extractedString = Trim(Split(a(1), vbCr)(0))
End If
If I remember correctly, paragraphs in Word end with vbCr ( \r not \n )

Regular Expression to split by comma + ignores comma within double quotes. VB.NET

I'm trying to parse csv file with VB.NET.
csv files contains value like 0,"1,2,3",4 which splits in 5 instead of 3. There are many examples with other languages in Stockoverflow but I can't implement it in VB.NET.
Here is my code so far but it doesn't work...
Dim t As String() = Regex.Split(str(i), ",(?=([^\""]*\""[^\""]*\"")*[^\""]*$)")
Assuming your csv is well-formed (ie no " besides those used to delimit string fields, or besides ones escaped like \"), you can split on a comma that's followed by an even number of non-escaped "-marks. (If you're inside a set of "" there's only an odd number left in the line).
Your regex you've tried looks like you're almost there.
The following looks for a comma followed by an even number of any sort of quote marks:
,(?=([^"]*"[^"]*")*[^"]*$)
To modify it to look for an even number of non-escaped quote marks (assuming quote marks are escaped with backslash like \"), I replace each [^"] with ([^"\\]|\\.). This means "match a character that isn't a " and isn't a blackslash, OR match a backslash and the character immediately following it".
,(?=(([^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)
See it in action here.
(The reason the backslash is doubled is I want to match a literal backslash).
Now to get it into vb.net you just need to double all your quote marks:
splitRegex = ",(?=(([^""\\]|\\.)*""([^""\\]|\\.)*"")*([^""\\]|\\.)*$)"
Instead of a regular expression, try using the TextFieldParser class for reading .csv files. It handles your situation exactly.
TextFieldParserClass
Especially look at the HasFieldsEnclosedInQuotes property.
Example:
Note: I used a string instead of a file, but the result would be the same.
Dim theString As String = "1,""2,3,4"",5"
Using rdr As New StringReader(theString)
Using parser As New TextFieldParser(rdr)
parser.TextFieldType = FieldType.Delimited
parser.Delimiters = New String() {","}
parser.HasFieldsEnclosedInQuotes = True
Dim fields() As String = parser.ReadFields()
For i As Integer = 0 To fields.Length - 1
Console.WriteLine("Field {0}: {1}", i, fields(i))
Next
End Using
End Using
Output:
Field 0: 1
Field 1: 2,3,4
Field 2: 5
This worked great for parsing a Shipping Notice .csv file we receive. Thanks for keeping this solution here.
This is my version of the code:
Try
Using rdr As New IO.StringReader(Row.FlatFile)
Using parser As New FileIO.TextFieldParser(rdr)
parser.TextFieldType = FileIO.FieldType.Delimited
parser.Delimiters = New String() {","}
parser.HasFieldsEnclosedInQuotes = True
Dim fields() As String = parser.ReadFields()
Row.Account = fields(0).ToString().Trim()
Row.AccountName = fields.GetValue(1).ToString().Trim()
Row.Status = fields.GetValue(2).ToString().Trim()
Row.PONumber = fields.GetValue(3).ToString().Trim()
Row.ErrorMessage = ""
End Using
End Using
Catch ex As Exception
Row.ErrorMessage = ex.Message
End Try
It's possible to do it with regex VB.NET in the following way:
,(?=(?:[^"]*"[^"]*")*[^"]*$)
The positive lookahead ((?= ... )) ensures that there is an even number of quotes ahead of the comma to split on (i.e. either they occur in pairs, or there are none).
[^"]* matches non-quote characters.
Given below is a VB.NET example to apply the regex.
Imports System
Imports System.Text.RegularExpressions
Public Class Test
Public Shared Sub Main()
Dim theString As String = "1,""2,3,4"",5"
Dim theStringArray As String() = Regex.Split(theString, ",(?=(?:[^""\\]*""[^""\\]*"")*[^""\\]*$)")
For i As Integer = 0 To theStringArray.Length - 1
Console.WriteLine("theStringArray {0}: {1}", i, theStringArray(i))
Next
End Sub
End Class
'Output:
'theStringArray 0: 1
'theStringArray 1: "2,3,4"
'theStringArray 2: 5

Quick Regex Matches Question

(Yes I am using regex to parse HTML, its the only solution I know)
Im having trouble creating the regex for the below piece of code, there are about 10 matches per page.
Inner Text
this is the regex ive been trying
below is the code I usually use to get a match collection
Private Function Extract(ByVal source As String) As String()
Dim mc As MatchCollection
Dim i As Integer
mc = Regex.Matches(source, _
"<A href=" & Chr(34) & "viewmessage.aspx?message_id *.</A>")
Dim results(mc.Count - 1) As String
For i = 0 To results.Length - 1
results(i) = mc(i).Value
Next
Return results
End Function
Dim str1 As String()
Dim str2 As String
Dim results As New StringBuilder
str1 = Extract(result)
For Each str2 In str1
results.Append(str2 & vbNewLine)
Next
RTBlinks.Text = results.ToString
Could anyone point out what im doing wrong ? I have spent a few hours trying different things.
I try to program mainly as a hobby, so apologies if ive made any glaring errors.
You've got *. where you'd need .*. Right now, the quantifier * is applied to the space before it, and the dot matches exactly one character. Switch the two, remove the space (it matters, and there is no space in your test string at this point) and try again.
Be aware that .* matches greedily, i. e. as many characters as possible (except newlines). So if you have no more than one <A> tag per line, it should still work. A bit safer would be .*? instead, making the dot match as few characters as possible; even safer [^<]* which would match anything except opening angle brackets, making sure we don't cross tag boundaries.
However, all of those measures fail in certain, not uncommon situations (think comments, attribute strings, nested tags, invalid markup) which is why you should let regexes loose on markup languages only if you can exactly control your inputs and know your limitations.
Also, I think that in VB.NET you can escape quotes inside a string by doubling it, so you can simply write
"<A href=""viewmessage.aspx?message_id=.*?</A>"