Quick Regex Matches Question - regex

(Yes I am using regex to parse HTML, its the only solution I know)
Im having trouble creating the regex for the below piece of code, there are about 10 matches per page.
Inner Text
this is the regex ive been trying
below is the code I usually use to get a match collection
Private Function Extract(ByVal source As String) As String()
Dim mc As MatchCollection
Dim i As Integer
mc = Regex.Matches(source, _
"<A href=" & Chr(34) & "viewmessage.aspx?message_id *.</A>")
Dim results(mc.Count - 1) As String
For i = 0 To results.Length - 1
results(i) = mc(i).Value
Next
Return results
End Function
Dim str1 As String()
Dim str2 As String
Dim results As New StringBuilder
str1 = Extract(result)
For Each str2 In str1
results.Append(str2 & vbNewLine)
Next
RTBlinks.Text = results.ToString
Could anyone point out what im doing wrong ? I have spent a few hours trying different things.
I try to program mainly as a hobby, so apologies if ive made any glaring errors.

You've got *. where you'd need .*. Right now, the quantifier * is applied to the space before it, and the dot matches exactly one character. Switch the two, remove the space (it matters, and there is no space in your test string at this point) and try again.
Be aware that .* matches greedily, i. e. as many characters as possible (except newlines). So if you have no more than one <A> tag per line, it should still work. A bit safer would be .*? instead, making the dot match as few characters as possible; even safer [^<]* which would match anything except opening angle brackets, making sure we don't cross tag boundaries.
However, all of those measures fail in certain, not uncommon situations (think comments, attribute strings, nested tags, invalid markup) which is why you should let regexes loose on markup languages only if you can exactly control your inputs and know your limitations.
Also, I think that in VB.NET you can escape quotes inside a string by doubling it, so you can simply write
"<A href=""viewmessage.aspx?message_id=.*?</A>"

Related

How to split a string in VBA to array by Split function delimited by Regular Expression

I am writing an Excel Add In to read a text file, extract values and write them to an Excel file. I need to split a line, delimited by one or more white spaces and store it in the form of array, from which I want to extract desired values.
I am trying to implement something like this:
arrStr = Split(line, "/^\s*/")
But the editor is throwing an error while compiling.
How can I do what I want?
If you are looking for the Regular Expressions route, then you could do something like this:
Dim line As String, arrStr, i As Long
line = "This is a test"
With New RegExp
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
IMPORTANT: You will need to create a reference to:
Microsoft VBScript Regular Expressions 5.5 in Tools > References
Otherwise, you can see Late Binding below
Your original implementation of your original pattern \^S*\$ had some issues:
S* was actually matching a literal uppercase S, not the whitespace character you were looking for - because it was not escaped.
Even if it was escaped, you would have matched every string that you used because of your quantifier: * means to match zero or more of \S. You were probably looking for the + quantifier (one or more of).
You were good for making it greedy (not using *?) since you were wanting to consume as much as possible.
The Pattern I used: (\S+) is placed in a capturing group (...) that will capture all cases of \S+ (all characters that are NOT a white space, + one or more times.
I also used the .Global so you will continue matching after the first match.
Once you have captured all your words, you can then loop through the match collection and place them into an array.
Late Binding:
Dim line As String, arrStr, i As Long
line = "This is a test"
With CreateObject("VBScript.RegExp")
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
Miscellaneous Notes
I would have advised just to use Split(), but you stated that there were cases where more than one consecutive space may have been an issue. If this wasn't the case, you wouldn't need regex at all, something like:
arrStr = Split(line)
Would have split on every occurance of a space

Using regex to find paragraphs in VBA Excel

I am trying to use regex to 'extract' paragraphs in a document. Each paragraph is preceded and followed by a '-' on separate line and each paragraph starts with a number.
For example
-
1. This is a paragraph
It may go over multiple lines
-
Ideally, I would like to not include the '-', but it doesn't really matter as I will be placing it in a string and running another regex against it (One that I know works)
The code I am trying to use is basically as follows
Dim matchPara as Object
Dim regex as Object
Dim theMatch as Object
Dim matches as Object
Dim fileName as String
Dim fileNo as Integer
Dim document as String
matchPara = "-?(\d.*?)?-"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = matchPara
regex.Global = True
regex.Multiline = True
fileName = "C:\file.txt"
fileNo = FreeFile
Open fileName For Input As #fileNo
document = Input$(LOF(fileNo), fileNo)
set matches = regex.Execute(document)
For Each theMatch in matches
MsgBox(theMatch.Value)
Next theMatch
Close #fileNo
I have tested this regex on regex101 and it appeared to do what I wanted. I have also tested it without the grouping
-?\d.*?-
However when I run the code the theMatch.Value only ever contains a single '-'. After some messing around with the regex I got it to display the first line of text but never any more than the first line.
I have checked the length of theMatch.Value with:
MsgBox(len(theMatch.Value))
and placed the contents of theMatch.Value in a cell on the worksheet to see if It was cutting off in the message box, but both theories were proved wrong.
I am at a complete loss now and I am beginning to suspect it is possibly a VBA thing and not a regex thing. There is no requirement to use regex, I just assumed it would be the easiest thing to do.
The paragraphs contain data that I am trying to extract. So the idea was to regex each paragraph out place that in a string then run other regex to get the information that I need. Some paragraphs wont contain the data that I need so the idea was to loop through each individual paragraph and then error handle better if the data I need wasn't in that paragraph (ie get what I can and drop the rest with an error message)
Here is a screenshot:
This simple approach does not use Regex. It assumes the data is in column A and the paragraphs are placed in column B:
Sub paragraph_no_regex()
Dim s As String
Dim ary
With Application.WorksheetFunction
s = .TextJoin(" ", False, Columns(1).SpecialCells(2))
End With
ary = Split(s, "-")
i = 1
For Each a In ary
Cells(i, 2) = a
i = i + 1
Next a
End Sub
Sub F()
Dim re As New RegExp
Dim sMatch As String
Dim document As String
re.Pattern = "-\n((.|\n)+?)\n-"
'Getting document
document = ...
sMatch = re.Execute(document)(0).SubMatches(0)
End Sub
If you need dashes -, then just include them into capture group (the outer parenthesis).
This RegEx matches your description and successfully extracts paragraphs (as tested on regex101.com):
matchPara = "-\n\d+\.\s*((?:.|\n)+?)\s*\n-"
It needs the 'global' flag but not the 'multiline' flag. Instead, the end-of-line token is matched in the regex. The main point is that the innermost matching group will match any character including the end-of-line (given as an alternative) but does so in a non-greedy way ("+?"). It doesn't care about word boundaries as this is not necessary here. Also, "-" is not a special character where used in the regex so it doesn't have to be escaped.
As added benefit leading and trailing whitespace is cut off ("\s*" outside the group).

Regex to determine if a string is a name of a range or a cell's address

I'm struggling to come up with a regular expression pattern that can help me determine if a string is a cell's address or if it is a cell's name.
Here are some examples of cell addresses:
"E5"
"AA55:E5"
"DD5555:DDD55555, E5, F5:AA55"
"$F7:$G$7"
Here are some examples of cell names:
"bis_document_id"
"PCR1MM_YPCVolume"
"sheet_error7"
"blahE5"
"training_A1"
"myNameIsGeorgeJR"
Is there a regex pattern you guys can come up with that will match all of either group and none of the other?
I have been able to think of a couple of ways to determine what a string is not:
If it has any other character than "$" or ":" in it, I know it is not a cell's name and is most likely a cell's address.
If it has more than three consecutive numbers, it is most likely not a cell's address.
A cell's address is extremely unlikely to have more than 2 letters preceding a number, 99.9% of the cell addresses will be in columns A to ZZ.
Alas, these three small tests can hardly prove what this string is.
Thanks for the help!
OK, this one's fun:
^\$?[A-Z]+\$?\d+(?::\$?[A-Z]+\$?\d+)?(?:,\s*(?:\$?[A-Z]+\$?\d+(?::\$?[A-Z]+\$?\d+)?))*$
Let's break it down, because it's rather nasty. The magic subpattern, really, is this:
\$?[A-Z]+\$?\d+
This little thing will match any single valid cell address, with optional absolute-value $s. The next bit,
(?::\$?[A-Z]+\$?\d+)?
will match the same thing optionally (the ? quantifier at the end), but preceded by a colon (:). That lets us get ranges. The next bit,
(?:,\s*(?:\$?[A-Z]+\$?\d+(?::\$?[A-Z]+\$?\d+)?))*
matches the same thing as the first, but zero or more times (using the * quantifier), and preceded by a comma and optional spaces using the special \s token (which means "any whitespace").
Demo on Regex101
If we want to get really fancy (and, mind you, I have no idea whether Excel's regex engine supports this; I just wrote it for fun), we can use recursion to accomplish the same thing:
^((\$?[A-Z]+\$?\d+)(?::(?2))?)(?:,\s*(?1))*$
In this case, the magic \$?[A-Z]+\$?\d+ is inside the second capturing group, which is used recursively by the (?2) token. The entire subpattern for a single address or range of them is contained within the first capture group, and is then used to match additional addresses or ranges in a list.
Demo on Regex101
So here's a regex for VBA which will find any cell reference irrespective where it is.
NOTE: I've assumed you're performing this on a Formula object and thus doesn't require being at the start or end of the string; so you can have a string with cell references and cell names and it will only pick up the cell references as below:
(?:\W|^)(\$?[A-Z]{1,3}\$?[0-9]{1,7}(:\$?[A-Z]{1,3}\$?[0-9]{1,7})?)(?!\w)
(?:\W|^) is at the start and ensures that there is a non-word character before it or the start of the string (remove |^ if it there is always a = at the start as in Formula objects) --- VBA I found out regrettably does not have a functioning negative lookbehind)
(\$?[A-Z]{1,3}\$?[0-9]{1,7}(:\$?[A-Z]{1,3}\$?[0-9]{1,7})?) finds the actual cell reference and is broken down below:
\$?[A-Z]{1,3}\$?[0-9]{1,7} matches to one to three capital letters (as applicable to Excel's possible current ranges;
(:\$?[A-Z]{1,3}\$?[0-9]{1,7})? is the same as above except it adds the option of a second cell reference after a column ? makes it optional.
(?!\w) is a negative look forward and says that the character after it must not be a word character (presumably in functions the only things you can have around a cell references are parentheses and operators).
I wrote a VBA function in Excel and it returned the following with the above RegEx:
NB: It doesn't pick up obviously if the characters are in the right order as the reference $AZO113:A4 is returned despite it being impossible.
After trying several solutions I had to modify a regex so it works for me. my version only support non-named ranges.
((?![\=,\(\);])(\w+!)|('.+'!))?((\$?[A-Z]{1,3}\$?[0-9]{1,7}(:\$?[A-Z]{1,3}\$?[0-9]{1,7})?)|(\$?[A-Z]{1,3}(:\$?[A-Z]{1,3}\$?)))
It will capture ranges in all of the following situations
=FUNCTION(F:F)
=FUNCTION($B22,G$5)
=SUM($F$10:$F$11)
=$J10-$K10
=SUMMARY!D4
I created the following function for RegEx. but first tick the reference to "Microsoft VBScript Regular Expressions 5.5" from Tools>References
Function RegExp(ByVal sText As String, ByVal sPattern, Optional bGlobal As Boolean = True, Optional bIgnoreCase As Boolean = False, Optional bArray As Boolean = False) As Variant
Dim objRegex As New RegExp
Dim Matches As MatchCollection
Dim Match As Match
Dim i As Integer
objRegex.IgnoreCase = bIgnoreCase
objRegex.Global = bGlobal
objRegex.Pattern = sPattern
If objRegex.test(sText) Then
Set Matches = objRegex.Execute(sText)
If Matches.count <> 0 Then
If bArray Then ' if we want to return array instead of MatchCollection
ReDim aMatches(Matches.count - 1) As Variant
For Each Match In Matches
aMatches(i) = Match.value
i = i + 1
Next
RegExp = aMatches
Else
Set RegExp = Matches
End If
End If
End If
End Function

how to do a replace within a replace match using vba regex

I need to replace some characters, but only when they are within brackets. So assume following example
this is a string (with comment), this is another string without comment, and this is a string (with one comment, and another one)
I need to be able to split this sentence based on the comma value. Which would work out fine apart from the annoying fact the last comment also contains a comma so my split is a bit limited. The desired result would have to be as follows
this is a string (with comment),
this is another string without comment,
and this is a string (with one comment, and another one)
I'm using access VBA, and my approach was to first isolate all the comments (content within brackets), replace the comma with a different character (say the pipe symbol) and than use the split or replace options to split the whole sentence.
What I tried is something as below, but I fail to deal with the regex match like I like to. Any alternative, or insight on how I can tacklle it best ?
Function commentFixer(s As String, t As String) As String
't = token to be replaced, eg a comma
Dim regEx As New RegExp
Dim match As String
regEx.Global = True
p = "(\([^()]*\)*)"
'match all commented substrings
regEx.Pattern = p
'below obviously doesn't work, as the match itself is not accepted as a character. Any way to deal with this ?
match = "$1" 'How can I store this in a variable to perform a replacement on the result ?
dim r as string 'replacement value
r = Replace(match, t, "|")
commentFixer = regEx.Replace(s, r)
End Function
Sub TestMe()
s = commentFixer("this is a string (with comment), this is another string without comment, and this is a string (with one comment, and another one)", ",")
Debug.Print s
'expected result : this is a string (with comment), this is another string without comment, and this is a string (with one comment| and another one)
End Sub
Here you go,
(.*?,)\s*(?![^()]*\))|(.+)$
Group index 1 and 2 contains the strings you want.
DEMO

Regex cheating in csv-parsing delimited at comma, ignore in quotes

all
So, I'm trying to figure out how to make a simple regex code for Visual Basic.net, but am not getting anywhere.
I'm parsing csv files into a list of array, but the source csv's are anything but pristine. There are extra/rogue quotes in just enough places to crash the program, and enough sets of quotes to make fixing the data manually cumbersome.
I've written in a bunch of error-checking, and it works about 99.99% of the time. However, with 10,000 lines to parse for each folder, that averages one error per set of csv files. Crash. To get that last 0.01% parsed properly, I've created an If statement that will pull out lines that have odd numbers of quotes and remove ALL of them, which triggers a manual error-check If there are zero quotes, the field processes as usual. If there's an even number of quotes, the standard Split function cannot ignore delimiters between quotes without a regex.
Could someone help me figure out a regex string that will ignore fields enclosed in quotes?
Here's the code I've been able to think up up to this point.
Thank you in advance
Using filereader1 As New Microsoft.VisualBasic.FileIO.TextFieldParser(files_(i),
System.Text.Encoding.Default) 'system text decoding adds odd characters
filereader1.TextFieldType = FieldType.Delimited
'filereader1.Delimiters = New String() {","}
filereader1.SetDelimiters(",")
filereader1.HasFieldsEnclosedInQuotes = True
For Each c As Char In whole_string
If c = """" Then cnt = cnt + 1
Next
If cnt = 0 Then 'no quotes
split_string = Split(whole_string, ",") 'split by commas
ElseIf cnt Mod 2 = 0 Then 'even number of quotes
split_string = Regex.Split(whole_string, "(?=(([^""]|.)*""([^""]|.)*"")*([^""]|.)*$)")
ElseIf cnt <> 0 Then 'odd number of quotes
whole_string = whole_string.Replace("""", " ") 'delete all quotes
split_string = Split(whole_string, ",") 'split by commas
End If
In VB.NET, there are several ways to proceed.
Option 1
You can use this regex: ,(?![^",]*")
It matches commas that are not inside quotes: a comma , that is not followed (as asserted by the negative lookahead (?![^",]*") ) by characters that are neither a comma nor a quote then a quote.
In VB.NET, something like:
Dim MyRegex As New Regex(",(?![^"",]*"")")
ResultString = MyRegex.Replace(Subject, "|")
Option 2
This uses this beautifully simple regex: "[^"]*"|(,)
This is a more general solution and easy to tweak solution. For a full description, I recommend you have a look at this question about of Regex-matching or replacing... except when.... It can make a very tidy solution that is easy to maintain if you find other cases to tweak.
The left side of the alternation | matches complete "quotes". We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
This code should work:
Imports System
Imports System.Text.RegularExpressions
Imports System.Collections.Specialized
Module Module1
Sub Main()
Dim MyRegex As New Regex("""[^""]*""|(,)")
Dim Subject As String = "LIST,410210,2-4,""PUMP, HYDRAULIC PISTON - MAIN"",1,,,"
Dim Replaced As String = myRegex.Replace(Subject,
Function(m As Match)
If (m.Groups(1).Value = "") Then
Return ""
Else
Return m.Groups(0).Value
End If
End Function)
Console.WriteLine(Replaced)
Console.WriteLine(vbCrLf & "Press Any Key to Exit.")
Console.ReadKey()
End Sub
End Module
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...