Regex cheating in csv-parsing delimited at comma, ignore in quotes - regex

all
So, I'm trying to figure out how to make a simple regex code for Visual Basic.net, but am not getting anywhere.
I'm parsing csv files into a list of array, but the source csv's are anything but pristine. There are extra/rogue quotes in just enough places to crash the program, and enough sets of quotes to make fixing the data manually cumbersome.
I've written in a bunch of error-checking, and it works about 99.99% of the time. However, with 10,000 lines to parse for each folder, that averages one error per set of csv files. Crash. To get that last 0.01% parsed properly, I've created an If statement that will pull out lines that have odd numbers of quotes and remove ALL of them, which triggers a manual error-check If there are zero quotes, the field processes as usual. If there's an even number of quotes, the standard Split function cannot ignore delimiters between quotes without a regex.
Could someone help me figure out a regex string that will ignore fields enclosed in quotes?
Here's the code I've been able to think up up to this point.
Thank you in advance
Using filereader1 As New Microsoft.VisualBasic.FileIO.TextFieldParser(files_(i),
System.Text.Encoding.Default) 'system text decoding adds odd characters
filereader1.TextFieldType = FieldType.Delimited
'filereader1.Delimiters = New String() {","}
filereader1.SetDelimiters(",")
filereader1.HasFieldsEnclosedInQuotes = True
For Each c As Char In whole_string
If c = """" Then cnt = cnt + 1
Next
If cnt = 0 Then 'no quotes
split_string = Split(whole_string, ",") 'split by commas
ElseIf cnt Mod 2 = 0 Then 'even number of quotes
split_string = Regex.Split(whole_string, "(?=(([^""]|.)*""([^""]|.)*"")*([^""]|.)*$)")
ElseIf cnt <> 0 Then 'odd number of quotes
whole_string = whole_string.Replace("""", " ") 'delete all quotes
split_string = Split(whole_string, ",") 'split by commas
End If

In VB.NET, there are several ways to proceed.
Option 1
You can use this regex: ,(?![^",]*")
It matches commas that are not inside quotes: a comma , that is not followed (as asserted by the negative lookahead (?![^",]*") ) by characters that are neither a comma nor a quote then a quote.
In VB.NET, something like:
Dim MyRegex As New Regex(",(?![^"",]*"")")
ResultString = MyRegex.Replace(Subject, "|")
Option 2
This uses this beautifully simple regex: "[^"]*"|(,)
This is a more general solution and easy to tweak solution. For a full description, I recommend you have a look at this question about of Regex-matching or replacing... except when.... It can make a very tidy solution that is easy to maintain if you find other cases to tweak.
The left side of the alternation | matches complete "quotes". We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
This code should work:
Imports System
Imports System.Text.RegularExpressions
Imports System.Collections.Specialized
Module Module1
Sub Main()
Dim MyRegex As New Regex("""[^""]*""|(,)")
Dim Subject As String = "LIST,410210,2-4,""PUMP, HYDRAULIC PISTON - MAIN"",1,,,"
Dim Replaced As String = myRegex.Replace(Subject,
Function(m As Match)
If (m.Groups(1).Value = "") Then
Return ""
Else
Return m.Groups(0).Value
End If
End Function)
Console.WriteLine(Replaced)
Console.WriteLine(vbCrLf & "Press Any Key to Exit.")
Console.ReadKey()
End Sub
End Module
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...

Related

How to split a string in VBA to array by Split function delimited by Regular Expression

I am writing an Excel Add In to read a text file, extract values and write them to an Excel file. I need to split a line, delimited by one or more white spaces and store it in the form of array, from which I want to extract desired values.
I am trying to implement something like this:
arrStr = Split(line, "/^\s*/")
But the editor is throwing an error while compiling.
How can I do what I want?
If you are looking for the Regular Expressions route, then you could do something like this:
Dim line As String, arrStr, i As Long
line = "This is a test"
With New RegExp
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
IMPORTANT: You will need to create a reference to:
Microsoft VBScript Regular Expressions 5.5 in Tools > References
Otherwise, you can see Late Binding below
Your original implementation of your original pattern \^S*\$ had some issues:
S* was actually matching a literal uppercase S, not the whitespace character you were looking for - because it was not escaped.
Even if it was escaped, you would have matched every string that you used because of your quantifier: * means to match zero or more of \S. You were probably looking for the + quantifier (one or more of).
You were good for making it greedy (not using *?) since you were wanting to consume as much as possible.
The Pattern I used: (\S+) is placed in a capturing group (...) that will capture all cases of \S+ (all characters that are NOT a white space, + one or more times.
I also used the .Global so you will continue matching after the first match.
Once you have captured all your words, you can then loop through the match collection and place them into an array.
Late Binding:
Dim line As String, arrStr, i As Long
line = "This is a test"
With CreateObject("VBScript.RegExp")
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
Miscellaneous Notes
I would have advised just to use Split(), but you stated that there were cases where more than one consecutive space may have been an issue. If this wasn't the case, you wouldn't need regex at all, something like:
arrStr = Split(line)
Would have split on every occurance of a space

Using regex to find paragraphs in VBA Excel

I am trying to use regex to 'extract' paragraphs in a document. Each paragraph is preceded and followed by a '-' on separate line and each paragraph starts with a number.
For example
-
1. This is a paragraph
It may go over multiple lines
-
Ideally, I would like to not include the '-', but it doesn't really matter as I will be placing it in a string and running another regex against it (One that I know works)
The code I am trying to use is basically as follows
Dim matchPara as Object
Dim regex as Object
Dim theMatch as Object
Dim matches as Object
Dim fileName as String
Dim fileNo as Integer
Dim document as String
matchPara = "-?(\d.*?)?-"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = matchPara
regex.Global = True
regex.Multiline = True
fileName = "C:\file.txt"
fileNo = FreeFile
Open fileName For Input As #fileNo
document = Input$(LOF(fileNo), fileNo)
set matches = regex.Execute(document)
For Each theMatch in matches
MsgBox(theMatch.Value)
Next theMatch
Close #fileNo
I have tested this regex on regex101 and it appeared to do what I wanted. I have also tested it without the grouping
-?\d.*?-
However when I run the code the theMatch.Value only ever contains a single '-'. After some messing around with the regex I got it to display the first line of text but never any more than the first line.
I have checked the length of theMatch.Value with:
MsgBox(len(theMatch.Value))
and placed the contents of theMatch.Value in a cell on the worksheet to see if It was cutting off in the message box, but both theories were proved wrong.
I am at a complete loss now and I am beginning to suspect it is possibly a VBA thing and not a regex thing. There is no requirement to use regex, I just assumed it would be the easiest thing to do.
The paragraphs contain data that I am trying to extract. So the idea was to regex each paragraph out place that in a string then run other regex to get the information that I need. Some paragraphs wont contain the data that I need so the idea was to loop through each individual paragraph and then error handle better if the data I need wasn't in that paragraph (ie get what I can and drop the rest with an error message)
Here is a screenshot:
This simple approach does not use Regex. It assumes the data is in column A and the paragraphs are placed in column B:
Sub paragraph_no_regex()
Dim s As String
Dim ary
With Application.WorksheetFunction
s = .TextJoin(" ", False, Columns(1).SpecialCells(2))
End With
ary = Split(s, "-")
i = 1
For Each a In ary
Cells(i, 2) = a
i = i + 1
Next a
End Sub
Sub F()
Dim re As New RegExp
Dim sMatch As String
Dim document As String
re.Pattern = "-\n((.|\n)+?)\n-"
'Getting document
document = ...
sMatch = re.Execute(document)(0).SubMatches(0)
End Sub
If you need dashes -, then just include them into capture group (the outer parenthesis).
This RegEx matches your description and successfully extracts paragraphs (as tested on regex101.com):
matchPara = "-\n\d+\.\s*((?:.|\n)+?)\s*\n-"
It needs the 'global' flag but not the 'multiline' flag. Instead, the end-of-line token is matched in the regex. The main point is that the innermost matching group will match any character including the end-of-line (given as an alternative) but does so in a non-greedy way ("+?"). It doesn't care about word boundaries as this is not necessary here. Also, "-" is not a special character where used in the regex so it doesn't have to be escaped.
As added benefit leading and trailing whitespace is cut off ("\s*" outside the group).

Excel VBA Regex Check For Repeated Strings

I have some user input that I want to validate for correctness. The user should input 1 or more sets of characters, separated by commas.
So these are valid input
COM1
COM1,COM2,1234
these are invalid
COM -- only 3 characters
COM1,123 -- one set is only 3 characters
COM1.1234,abcd -- a dot separator not comma
I googled for a regex pattern to this and found a possible pattern that tested for a recurring instance of any 3 characters, and I modified like so
/^(.{4,}).*\1$/
but this is not finding matches.
I can manage the last comma that may or may not be there before passing to the test so that it is always there.
Preferably, I would like to test for letters (any case) and numbers only, but I can live with any characters.
I know I could easily do this in straight VBA splitting the input on a comma delimiter and looping through each character of each array element, but regex seems more efficient, and I will have more cases than have slightly different patterns, so parameterising the regex for that would be better design.
TIA
I believe this does what you want:
^([A-Z|a-z|0-9]{4},)*[A-Z|a-z|0-9]{4}$
It's a line beginning followed by zero or more groups of four letters or numbers ending with a comma, followed by one group of four letters or number followed by an end-of-line.
You can play around with it here: https://regex101.com/r/Hdv65h/1
The regular expression
"^[\w]{4}(,[\w]{4})*$"
should work.
You can try this to see whether it works for all your cases using the following function. Assuming your test strings are in cells A1 thru A5 on the spreadsheet:
Sub findPattern()
Dim regEx As New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.Pattern = "^[\w]{4}(,[\w]{4})*$"
Dim i As Integer
Dim val As String
For i = 1 To 5:
val = Trim(Cells(i, 1).Value)
Set mat = regEx.Execute(val)
If mat.Count = 0 Then
MsgBox ("No match found for " & val)
Else
MsgBox ("Match found for " & val)
End If
Next
End Sub

Regular Expression to split by comma + ignores comma within double quotes. VB.NET

I'm trying to parse csv file with VB.NET.
csv files contains value like 0,"1,2,3",4 which splits in 5 instead of 3. There are many examples with other languages in Stockoverflow but I can't implement it in VB.NET.
Here is my code so far but it doesn't work...
Dim t As String() = Regex.Split(str(i), ",(?=([^\""]*\""[^\""]*\"")*[^\""]*$)")
Assuming your csv is well-formed (ie no " besides those used to delimit string fields, or besides ones escaped like \"), you can split on a comma that's followed by an even number of non-escaped "-marks. (If you're inside a set of "" there's only an odd number left in the line).
Your regex you've tried looks like you're almost there.
The following looks for a comma followed by an even number of any sort of quote marks:
,(?=([^"]*"[^"]*")*[^"]*$)
To modify it to look for an even number of non-escaped quote marks (assuming quote marks are escaped with backslash like \"), I replace each [^"] with ([^"\\]|\\.). This means "match a character that isn't a " and isn't a blackslash, OR match a backslash and the character immediately following it".
,(?=(([^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)
See it in action here.
(The reason the backslash is doubled is I want to match a literal backslash).
Now to get it into vb.net you just need to double all your quote marks:
splitRegex = ",(?=(([^""\\]|\\.)*""([^""\\]|\\.)*"")*([^""\\]|\\.)*$)"
Instead of a regular expression, try using the TextFieldParser class for reading .csv files. It handles your situation exactly.
TextFieldParserClass
Especially look at the HasFieldsEnclosedInQuotes property.
Example:
Note: I used a string instead of a file, but the result would be the same.
Dim theString As String = "1,""2,3,4"",5"
Using rdr As New StringReader(theString)
Using parser As New TextFieldParser(rdr)
parser.TextFieldType = FieldType.Delimited
parser.Delimiters = New String() {","}
parser.HasFieldsEnclosedInQuotes = True
Dim fields() As String = parser.ReadFields()
For i As Integer = 0 To fields.Length - 1
Console.WriteLine("Field {0}: {1}", i, fields(i))
Next
End Using
End Using
Output:
Field 0: 1
Field 1: 2,3,4
Field 2: 5
This worked great for parsing a Shipping Notice .csv file we receive. Thanks for keeping this solution here.
This is my version of the code:
Try
Using rdr As New IO.StringReader(Row.FlatFile)
Using parser As New FileIO.TextFieldParser(rdr)
parser.TextFieldType = FileIO.FieldType.Delimited
parser.Delimiters = New String() {","}
parser.HasFieldsEnclosedInQuotes = True
Dim fields() As String = parser.ReadFields()
Row.Account = fields(0).ToString().Trim()
Row.AccountName = fields.GetValue(1).ToString().Trim()
Row.Status = fields.GetValue(2).ToString().Trim()
Row.PONumber = fields.GetValue(3).ToString().Trim()
Row.ErrorMessage = ""
End Using
End Using
Catch ex As Exception
Row.ErrorMessage = ex.Message
End Try
It's possible to do it with regex VB.NET in the following way:
,(?=(?:[^"]*"[^"]*")*[^"]*$)
The positive lookahead ((?= ... )) ensures that there is an even number of quotes ahead of the comma to split on (i.e. either they occur in pairs, or there are none).
[^"]* matches non-quote characters.
Given below is a VB.NET example to apply the regex.
Imports System
Imports System.Text.RegularExpressions
Public Class Test
Public Shared Sub Main()
Dim theString As String = "1,""2,3,4"",5"
Dim theStringArray As String() = Regex.Split(theString, ",(?=(?:[^""\\]*""[^""\\]*"")*[^""\\]*$)")
For i As Integer = 0 To theStringArray.Length - 1
Console.WriteLine("theStringArray {0}: {1}", i, theStringArray(i))
Next
End Sub
End Class
'Output:
'theStringArray 0: 1
'theStringArray 1: "2,3,4"
'theStringArray 2: 5

Quick Regex Matches Question

(Yes I am using regex to parse HTML, its the only solution I know)
Im having trouble creating the regex for the below piece of code, there are about 10 matches per page.
Inner Text
this is the regex ive been trying
below is the code I usually use to get a match collection
Private Function Extract(ByVal source As String) As String()
Dim mc As MatchCollection
Dim i As Integer
mc = Regex.Matches(source, _
"<A href=" & Chr(34) & "viewmessage.aspx?message_id *.</A>")
Dim results(mc.Count - 1) As String
For i = 0 To results.Length - 1
results(i) = mc(i).Value
Next
Return results
End Function
Dim str1 As String()
Dim str2 As String
Dim results As New StringBuilder
str1 = Extract(result)
For Each str2 In str1
results.Append(str2 & vbNewLine)
Next
RTBlinks.Text = results.ToString
Could anyone point out what im doing wrong ? I have spent a few hours trying different things.
I try to program mainly as a hobby, so apologies if ive made any glaring errors.
You've got *. where you'd need .*. Right now, the quantifier * is applied to the space before it, and the dot matches exactly one character. Switch the two, remove the space (it matters, and there is no space in your test string at this point) and try again.
Be aware that .* matches greedily, i. e. as many characters as possible (except newlines). So if you have no more than one <A> tag per line, it should still work. A bit safer would be .*? instead, making the dot match as few characters as possible; even safer [^<]* which would match anything except opening angle brackets, making sure we don't cross tag boundaries.
However, all of those measures fail in certain, not uncommon situations (think comments, attribute strings, nested tags, invalid markup) which is why you should let regexes loose on markup languages only if you can exactly control your inputs and know your limitations.
Also, I think that in VB.NET you can escape quotes inside a string by doubling it, so you can simply write
"<A href=""viewmessage.aspx?message_id=.*?</A>"