RegEx vb.net MatchCollection is full of Empty Strings - regex

I am very new to using Regular Expressions. (I have read Ben Forta's book and have learned from asking a previous question on here. I am trying to put my learning into practice.)
So, given this string: "[Class 4C] Physics 101 [~2] [#14 Worthington 5] FW"
I'd like these results:
"[Class 4C]"
"Physics 101"
"[~2]"
"[#14 Worthington 5]"
"FW"
I'm using this vb.net code:
Imports System.Text.RegularExpressions
---
Dim txt As String = "[Class 4C] Physics 101 [~2][#14 Worthington 5] FW"
Dim mc As MatchCollection = Regex.Matches(txt, "((?<=\])|(?=\[))")
Dim m As Match
For Each m In mc
Debug.Print(m.Value)
Next m
The result is a System.Text.RegularExpressions.MatchCollection containing 6 empty strings.
Using RegEx Storm, I see something I can work with in the 'Split List' but what I'm getting in the MatchCollection is the data in the "Table" view.
Screenshots: Table View | Split List View
How do I access the array shown in the Split List please? (Or do I need to use a different pattern?)

Thank you to the fourth bird who (in the comments) helped me reach the solution. In case it helps anyone else, here is the answer and updated code:
A RegEx.Match returns a collection that contains all of the matches produced from applying the specified Regular Expression pattern. This will normally return only part(s) of a string. As it is often what's needed, example code will often show this way of using Regular Expressions.
However, my requirement was to return ALL of the string, split into sections as defined by the RegEx. This is done using the RegEx.Split function.
The result is what is referred to in RegEx Storm as the 'Split List'.
Here is the modified vb.net code. It splits the string, filters out empty strings and trims-off any leading and trailing whitespace:
Imports System.Text.RegularExpressions
---
Dim text As String = "[Class 4C] Physics 101 [~2][#14 Worthington 5] FW"
Dim pattern As String = "((?<=\])|(?=\[))"
Dim matches() As String = Regex.Split(text, pattern)
For Each match As String In matches
Dim trimmedMatch As String = Trim(match)
If trimmedMatch.Length > 0 Then
' Do things here
Debug.Print(trimmedMatch)
End If
Next
Thanks again the fourth bird. Really appreciated.

Related

Match number between 2 date

I got a text and I need to extract a number that is between 2 dates. I can't show the full text so I will only use the part I need, but keep in mint it's part of a bigger text.
12/14/2020 355345 12/14/2020
From that, I need to get '355345', I currently don't have anything to show of what I was doing because I was working on getting the text before a sentence, until I realized it the only place where the number is between 2 dates.
Thanks!
Here's a snippet that might help:
Suppose the input is this:
Imports System.Text.RegularExpressions
'...
Dim input As New StringBuilder
input.AppendLine("12/14/2020 355345 12/14/2020")
input.AppendLine("12/13/2020 425345 12/13/2020")
input.AppendLine("12/20/2020 93488557 12/20/2020")
input.AppendLine("12/21/2020 4 12/21/2020")
input.AppendLine("12/20/2020 3443 12/20/2020")
'...
Use RegEx to extract the numbers between the two dates as follows:
Dim patt = "(\d+\/\d+\/\d+)\s?(\d+)\s?(\d+\/\d+\/\d+)"
For Each m In Regex.
Matches(input.ToString, patt, RegexOptions.Multiline).
Cast(Of Match)
Console.WriteLine(m.Groups(2).Value)
Next
This will capture three groups. Example for the first match:
m.Groups(1).Value : 12/14/202 the first date.
m.Groups(2).Value : 355345 the number in between.
m.Groups(3).Value : 12/14/2020 the second date.
If you have no use for the captured dates, then no need to get theme grouped and use the following pattern instead:
Dim patt = "\d+\/\d+\/\d+\s?(\d+)\s?\d+\/\d+\/\d+"
For Each m In Regex.
Matches(input.ToString, patt, RegexOptions.Multiline).
Cast(Of Match)
Console.WriteLine(m.Groups(1).Value)
Next
And you will get the number between the two dates in Group 1.
The output of both is:
355345
425345
93488557
4
3443
regex101
Also, using the quantifiers in RegEx patterns is a good idea as Mr. #AndrewMorton mentioned in his appreciated comments, and that to skip any possible things like 1234/239994/2293 in the input:
Dim patt = "\d{1,2}/\d{1,2}/\d{4}\s(\d{1,})\s\d{1,2}/\d{1,2}/\d{4}"
For Each m In Regex.
Matches(input.ToString, patt, RegexOptions.Multiline).
Cast(Of Match)
Console.WriteLine(m.Groups(1).Value)
Next
The quantifiers-way test is here.
If you can safely check for numbers and slashes, then a pattern like this should work:
\d\d/\d\d/\d\d\d\d +(\d+) +\d\d/\d\d/\d\d\d\d
...where capture group 1 would hold the number being sought. If you need to validate that the values are actually dates, well... you can do it with regex to a degree, but the pattern becomes very difficult to read.

Replacing everything but the matching regex string

I've searched for this answer but haven't found an answer that exactly works.
I have the following pattern where the hashes are any digit: 102###-###:#####-### or 102###-###:#####-####
It must start with 102 and the last set in the pattern can either be 3 or 4 digits.
The problem is that I can have a string with between 1-5 of these patterns in it with any sort of characters in between (spaces, letters etc). The Regex I posted below matches the patterns well but I am trying to select everything that is NOT this pattern so I can remove it. The end goal is to extract all the patterns and just have all the patterns comma delimited as the output. (Pattern, Pattern, Pattern) How do I accomplish this with regex?Perhaps there is a better way than trying to take this line? Thanks. This is using VBA.
Regex For Pattern:(\D102\d{3}-\d{3}:\d{5}-\d{3,4}\D)
String Example: type:102456-345:56746-234 102456-345:56746-2343 FollowingCell#:102456-345:56746-234 exampletext##$% 102456-345:56746-2345 stuff
No need to grab everything you don't need to remove it: That's more difficult. Just grab everything you need and do whatever you want with it.
See regex in use here
(?<!\d)102\d{3}-\d{3}:\d{5}-\d{3,4}(?!\d)
See code in use here
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim sourcestring as String = "type:102456-345:56746-234 102456-345:56746-2343 FollowingCell#:102456-345:56746-234 exampletext##$% 102456-345:56746-2345 stuff"
Dim re As Regex = New Regex("(?<!\d)102\d{3}-\d{3}:\d{5}-\d{3,4}(?!\d)")
Dim mc as MatchCollection = re.Matches(sourcestring)
For each m as Match in mc
Console.WriteLine(m.Groups(0).Value)
Next
End Sub
End Module
Result:
102456-345:56746-234
102456-345:56746-2343
102456-345:56746-234
102456-345:56746-2345
I am trying to select everything that is NOT this pattern so I can remove it. The end goal is to extract all the patterns and just have all the patterns comma delimited as the output
If you want to extract the patterns, then just do that, without removing everything around them. Example in Python: (Posted before the question's language was specified, but I'm sure the same can be done in VBA.)
>>> import re
>>> p = r"102\d{3}-\d{3}:\d{5}-\d{3,4}"
>>> text = "type:102456-345:56746-234 102456-345:56746-2343 FollowingCell#:102456-345:56746-234 exampletext##$% 102456-345:56746-2345 stuff"
>>> ",".join(re.findall(p, text))
'102456-345:56746-234,102456-345:56746-2343,102456-345:56746-234,102456-345:56746-2345'

Using regex to find paragraphs in VBA Excel

I am trying to use regex to 'extract' paragraphs in a document. Each paragraph is preceded and followed by a '-' on separate line and each paragraph starts with a number.
For example
-
1. This is a paragraph
It may go over multiple lines
-
Ideally, I would like to not include the '-', but it doesn't really matter as I will be placing it in a string and running another regex against it (One that I know works)
The code I am trying to use is basically as follows
Dim matchPara as Object
Dim regex as Object
Dim theMatch as Object
Dim matches as Object
Dim fileName as String
Dim fileNo as Integer
Dim document as String
matchPara = "-?(\d.*?)?-"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = matchPara
regex.Global = True
regex.Multiline = True
fileName = "C:\file.txt"
fileNo = FreeFile
Open fileName For Input As #fileNo
document = Input$(LOF(fileNo), fileNo)
set matches = regex.Execute(document)
For Each theMatch in matches
MsgBox(theMatch.Value)
Next theMatch
Close #fileNo
I have tested this regex on regex101 and it appeared to do what I wanted. I have also tested it without the grouping
-?\d.*?-
However when I run the code the theMatch.Value only ever contains a single '-'. After some messing around with the regex I got it to display the first line of text but never any more than the first line.
I have checked the length of theMatch.Value with:
MsgBox(len(theMatch.Value))
and placed the contents of theMatch.Value in a cell on the worksheet to see if It was cutting off in the message box, but both theories were proved wrong.
I am at a complete loss now and I am beginning to suspect it is possibly a VBA thing and not a regex thing. There is no requirement to use regex, I just assumed it would be the easiest thing to do.
The paragraphs contain data that I am trying to extract. So the idea was to regex each paragraph out place that in a string then run other regex to get the information that I need. Some paragraphs wont contain the data that I need so the idea was to loop through each individual paragraph and then error handle better if the data I need wasn't in that paragraph (ie get what I can and drop the rest with an error message)
Here is a screenshot:
This simple approach does not use Regex. It assumes the data is in column A and the paragraphs are placed in column B:
Sub paragraph_no_regex()
Dim s As String
Dim ary
With Application.WorksheetFunction
s = .TextJoin(" ", False, Columns(1).SpecialCells(2))
End With
ary = Split(s, "-")
i = 1
For Each a In ary
Cells(i, 2) = a
i = i + 1
Next a
End Sub
Sub F()
Dim re As New RegExp
Dim sMatch As String
Dim document As String
re.Pattern = "-\n((.|\n)+?)\n-"
'Getting document
document = ...
sMatch = re.Execute(document)(0).SubMatches(0)
End Sub
If you need dashes -, then just include them into capture group (the outer parenthesis).
This RegEx matches your description and successfully extracts paragraphs (as tested on regex101.com):
matchPara = "-\n\d+\.\s*((?:.|\n)+?)\s*\n-"
It needs the 'global' flag but not the 'multiline' flag. Instead, the end-of-line token is matched in the regex. The main point is that the innermost matching group will match any character including the end-of-line (given as an alternative) but does so in a non-greedy way ("+?"). It doesn't care about word boundaries as this is not necessary here. Also, "-" is not a special character where used in the regex so it doesn't have to be escaped.
As added benefit leading and trailing whitespace is cut off ("\s*" outside the group).

How to get Opposite result of Regex.Split VB.NET?

I have some string, like this one:[H]GOODYEAR[/H] [H]TIRE[/H] & RUBBER COMPANY
I need to get words that inside [H] [/H] node inside this string.
I created this Regex Pattern: \[H](.*?)\[\/H]
I've tried to use Regex.Split Method to get this words. Here's my code:
Dim pattern As String = "\[H](.*?)\[\/H]"
Dim input As String = "[H]GOODYEAR[/H] [H]TIRE[/H] & RUBBER COMPANY"
Dim SearchedResult() As String = Regex.Split(input, pattern, RegexOptions.IgnoreCase)
But then I realized, that this Split gives me everything, which is not words I need.
My question: How to get correct words? Is that any way to REVERSE Regex pattern? Or any better way to get my result?
Instead of splitting the string, you should use Regex.Matches method.
Note: I used inline modifiers (?si), the s (dotAll) modifier which forces the dot . to match newline characters in case the nodes span across multiple lines, and the i modifier for case-insensitive matching.
Dim input As String = "[H]GOODYEAR[/H] [H]TIRE[/H] & RUBBER COMPANY"
For Each m As Match In Regex.Matches(input, "(?si)\[H](.*?)\[/H]")
Console.WriteLine(m.Groups(1).Value)
Next
Output
GOODYEAR
TIRE

VBA Regular Expressions - Run-Time Error 91 when trying to replace characters in string

I am doing this task as part of a larger sub in order to massively reduce the workload for a different team.
I am trying to read in a string and use Regular Expressions to replace one-to-many spaces with a single space (or another character). At the moment I am using a local string, however in the main sub this data will come from an external .txt file. The number of spaces between elements in this .txt can vary depeneding on the row.
I am using the below code, and replacing the spaces with a dash. I have tried different variations and different logic on the below code, but always get "Run-time error '91': Object Variable or with clock variable not set" on line "c = re.Replace(s, replacement)"
After using breakpoints, I have found out that my RegularExpression (re) is empty, but I can't quite figure out how to progress from here. How do I replace my spaces with dashes? I have been at this problem for hours and spent most of that time on Google to see if someone has had a similar issue.
Sub testWC()
Dim s As String
Dim c As String
Dim re As RegExp
s = "hello World"
Dim pattern As String
pattern = "\s+"
Dim replacement As String
replacement = "-"
c = re.Replace(s, replacement)
Debug.Print (c)
End Sub
Extra information: Using Excel 2010. Have successfully linked all my references (Microsoft VBScript Regular Expressions 5.5". I was sucessfully able to replace the spaces using the vanilla "Replace" function, however as the number of spaces between elements vary I am unable to use that to solve my issue.
Ed: My .txt file is not fixed either, there are a number of rows that are different lengths so I am unable to use the MID function in excel to dissect the string either
Please help
Thanks,
J.H.
You're not setting up the RegExp object correctly.
Dim pattern As String
pattern = "\s+" ' pattern is just a local string, not bound to the RegExp object!
You need to do this:
Dim re As RegExp
Set re = New RegExp
re.Pattern = "\s+" ' Now the pattern is bound to the RegExp object
re.Global = True ' Assuming you want to replace *all* matches
s = "hello World"
Dim replacement As String
replacement = "-"
c = re.Replace(s, replacement)
Try setting the pattern inside your Regex object. Right now, re is just a regex with no real pattern assigned to it. Try adding in re.Pattern = pattern after you initialize your pattern string.
You initialized the pattern but didn't actually hook it into the Regex. When you ended up calling replace it didn't know what it was looking for pattern wise, and threw the error.
Try also setting the re as a New RegExp.
Sub testWC()
Dim s As String
Dim c As String
Dim re As RegExp
Set re = New RegExp
s = "hello World"
Dim pattern As String
pattern = "\s+"
re.Pattern = pattern
Dim replacement As String
replacement = "-"
c = re.Replace(s, replacement)
Debug.Print (c)
End Sub