Check if line matches regex - regex

I have a file that has been generated by a server - I have no control over how this file is generated or formatted. I need to check each line begins with a string of set length (in this case 21 numerical chars). If a line doesn't match that condition, I need to join it to the previous line and, after reading and correcting the whole file, save it. I am doing this for a lot of files in a directory.
So far I have:
Dim rgx As New Regex("^[0-9]{21}$")
Dim linesList As New List(Of String)(File.ReadAllLines(finfo.FullName))
If linesList(0).Contains("BlackBerry Messenger") Then
linesList.RemoveAt(0)
For i As Integer = 0 To linesList.Count
If Not rgx.IsMatch(i.ToString) Then
linesList.Concat(linesList(i-1))
End If
Next
End If
File.WriteAllLines(finfo.FullName, linesList.ToArray())[code]
There's a for statement before and after that code block to loop over all files in the source directory, which works fine.
Hope this isn't too bad to read :/

I didn't think your solution was any good, you were failing on concatenating the lines. Here's a different approach:
Dim rgx As New Regex("^[0-9]{21}")
Dim linesList As New List(Of String)(File.ReadAllLines(finfo.FullName))
' We will create a new list to store the new lines data
Dim newLinesList As New List(Of String)()
If linesList(0).Contains("BlackBerry Messenger") Then
Dim i As Integer = 1
Dim newLine As String
While i < linesList.Count
newLine = linesList(i)
i += 1
' Keep going until the "real" line is over
While i < linesList.Count AndAlso Not rgx.IsMatch(linesList(i))
newLine += linesList(i)
i += 1
End While
newLinesList.Add(newLine)
End While
End If
File.WriteAllLines(finfo.FullName, newLinesList.ToArray())

Related

VB.Net Search for files matching REGEX

Hi I have a really basic question that the answer completely escapes me. I want to search in a given directory for a file REGEX match. I've tried all kinds of iterations but nothing is working for me. My REGEX is "*_Ch[0-9]+.sgm" and it should work. My files are named "Bld1_Ch1.sgm" and iterates.
The error I get is "System.IO.DirectoryNotFoundException: 'Could not find a part of the path 'C:\Test\06-GCS Bursting Script\TO 33D1-8-2-2-2 RAMTS FI\Bld1'.'"
Thank you for your patience and help.
Maxine
Private Sub btnImport_Click(sender As Object, e As EventArgs) Handles btnImport.Click
Dim searchDir As String = txtSGMFile.Text & "\" & txtUnique.Text
Dim searchFolder As String = "\" & txtUnique.Text
Dim searchPattern = "*_Ch[0-9]+.sgm"
Dim files = Directory.GetFiles(searchDir, searchPattern)
For Each file In files
MsgBox(file)
Next
End Sub
I was able to get it working use this code! Thank you everyone for your help.
Dim files = Directory.GetFiles(path, "*.sgm")
Dim rx = New Regex(".*_Ch\d\.sgm") ' or Dim rx = new Regex(".*_v[0-9]\.pdf")
For Each file In files
If rx.IsMatch(file) Then
' do something with the file
MsgBox(file)
End If
Next file

Changing a pipe delimited file to comma delimited in VB.net

So I have a set of pipe delimited inputs which are something like this:
"787291 | 3224325523" | 37826427 | 2482472 | "46284729|46246" | 24682
| 82524 | 6846419 | 68247
and I am converting them to comma delimited using the code given below:
Dim line As String
Dim fields As String()
Using sw As New StreamWriter("c:\test\output.txt")
Using tfp As New FileIO.TextFieldParser("c:\test\test.txt")
tfp.TextFieldType = FileIO.FieldType.Delimited
tfp.Delimiters = New String() {"|"}
tfp.HasFieldsEnclosedInQuotes = True
While Not tfp.EndOfData
fields = tfp.ReadFields
line = String.Join(",", fields)
sw.WriteLine(line)
End While
End Using
End Using
So far so good. It only considers the delimiters that are present outside the quotes and changes them to the comma delimiter. But trouble starts when I have input with a stray quotation like below:
"787291 | 3224325523" | 37826427 | 2482472 | "46284729|46246" | 24682
| "82524 | 6846419 | 68247
Here the code gives
MalformeLineExcpetion
Which I realize is due to the stray quotation in my input and since i am like a total noob in RegEx so i am not able to use it here(or I am incapable of). If anyone has any idea, it would be much appreciated.
Here is the coded procedure described in the comments:
Read all the lines of the original input file,
fix the faulty lines (with Regex or anything else that fits),
use TextFieldParser to perform the parsing of the correct input
Join() the input parts created by TextFieldParser using , as separator
save the fixed, reconstructed input lines to the final output file
I'm using Wiktor Stribiżew Regex pattern: it looks like it should work given the description of the problem.
Note:
Of course I don't know whether a specific Encoding should be used.
Here, the Encoding is the default UTF-8 no-BOM, in and out.
"FaultyInput.txt" is the corrupted source file.
"FixedInput.txt" is the file containing the input lines fixed (hopefully) by the Regex. You could also use a MemoryStream.
"FixedOutput.txt" is the final CSV file, containing comma separated fields and the correct values.
These files are all read/written in the executable startup path.
Dim input As List(Of String) = File.ReadAllLines("FaultyInput.txt").ToList()
For line As Integer = 0 To input.Count - 1
input(line) = Regex.Replace(input(line), "(""\b.*?\b"")|""", "$1")
Next
File.WriteAllLines("FixedInput.txt", input)
Dim output As List(Of String) = New List(Of String)
Using tfp As New FileIO.TextFieldParser("FixedInput.txt")
tfp.TextFieldType = FileIO.FieldType.Delimited
tfp.Delimiters = New String() {"|"}
tfp.HasFieldsEnclosedInQuotes = True
While Not tfp.EndOfData
Dim fields As String() = tfp.ReadFields
output.Add(String.Join(",", fields))
End While
End Using
File.WriteAllLines("FixedOutput.txt", output)
'Eventually...
'File.Delete("FixedInput.txt")
Sub ReadMalformedCSV()
Dim s$
Dim pattern$ = "(?x)" + vbCrLf +
"\b #word boundary" + vbCrLf +
"(?'num'\d+) #any number of digits" + vbCrLf +
"\b #word boundary"
'// Use "ReadLines" as it will lazily read one line at time
For Each line In File.ReadLines("c:\test\output.txt")
s = String.Join(",", Regex.Matches(line, pattern).
Select(Function(e) e.Groups("num").Value))
WriteLine(s)
Next
End Sub

vb.net regex parse paragraph from report

I have a report I am given in plain text that a coworker typically has to manually edit out various headers. I know the top line and bottom line of the header - they do not differ throughout the document, but the various lines of text between does.
Formatting looks like this:
BEGIN REPORT FOR CLIENT XXYYZZ
RANDOM BODY TEXT
RANDOM BODY TEXT
RANDOM BODY TEXT
RANDOM BODY TEXT
RANDOM BODY TEXT
FINAL REPORT
I am attempting to use regular expressions to highlight this text within a rich text box. If I use the below code I can highlight every occurrence of the top line without issue:
Dim mystring As String = "(BEGIN)(.+?)(XXYYZZ)"
Dim regHeader As New Regex(mystring)
Dim regMatch As Match = regHeader.Match(rtbMain.Text)
While regMatch.Success
rtbMain.Select(regMatch.Index, regMatch.Length)
rtbMain.SelectionColor = Color.Blue
regMatch = regMatch.NextMatch()
End While
However, once I attempt to change the code to find the entire paragraph it no longer will highlight anything. Below is what I was expecting it to be but it does not seam to like it for whatever reason and will not highlight anything:
Dim mystring As String = "(BEGIN REPORT FOR CLIENT XXYYZZ)(.+?)(FINAL REPORT)"
Dim regHeader As New Regex(mystring)
Dim regMatch As Match = regHeader.Match(rtbMain.Text)
While regMatch.Success
rtbMain.Select(regMatch.Index, regMatch.Length)
rtbMain.SelectionColor = Color.Blue
regMatch = regMatch.NextMatch()
End While
Any help would be greatly appreciated.
What you need is singleline mode, in order to let . match even newlines.
Try this:
Dim mystring As String = "(BEGIN REPORT FOR CLIENT XXYYZZ)(.+?)(FINAL REPORT)"
Dim regHeader As New Regex(mystring, RegexOptions.Singleline)
Dim regMatch As Match = regHeader.Match(rtbMain.Text)
While regMatch.Success
rtbMain.Select(regMatch.Index, regMatch.Length)
rtbMain.SelectionColor = Color.Blue
regMatch = regMatch.NextMatch()
End While
Notice RegexOptions.Singleline.

How to capture value between two strings in VB.NET

i'm trying to capture a value between two strings using VB.NET
Each line from the file i'm reading in from can contain many different parameters, in any order, and I'd like to store the values of these parameters in their own variables. Two sample lines would be:
identifier="121" messagecount="112358" timestamp="11:31:41.622" column="5" row="98" colour="ORANGE" value="Hello"
or it could be:
identifier="1121" messagecount="1123488" timestamp="19:14:41.568" valid="true" state="running"
Also, this may not be the sole text in the string, there may be other values before and after (and in between) the parameters i would like to capture.
So essentially i'd need to store everything between 'identifier="' and it's closing '"' into an identifier variable, and so on... As the order of these parameters within each line can change, i can't simply stick the first value in one variable each time, I have to refer to them specifically by what their name is (identifier, messagecount) etc.
Can anyone help? Thanks. I guess it would be via a regular expression, but i'm not too hot on those. I'd prefer to have each expression for each paramater within it's own statement, rather than being all in one, thanks.
Here is a sample how you can go about that. It converts one line into a dictionary.
This will capture any string consisting of a-z-characters (case-insensitive) as the attribute name, and then catch any character other than " in the value string. (If " can occur in the string as "" you need to add some treatment for that.)
Imports System.Text.RegularExpressions
[...]
Dim s As String =
"identifier=""121"" messagecount=""112358"" " &
"timestamp=""11:31:41.622"" column=""5"" row=""98"" " &
"colour=""ORANGE"" value=""Hello"""
Dim d As New Dictionary(Of String, String)
Dim rx As New Regex("([a-z]+)=""(.*?)""", RegexOptions.IgnoreCase)
Dim rxM As MatchCollection = rx.Matches(s)
For Each M As Match In rxM
d.Add(M.Groups(1).Value, M.Groups(2).Value)
Next
' Dictionary is ready
' test output
For Each k As String In d.Keys
MsgBox(String.Format("{0} => {1}", k, d(k)))
Next
You just need to split the data into manageable clumps, and then go through it. Something like this to start you off.
Private Sub ProcessMyData(LineOfData As String)
' NOTE! This assumes all your 'names' have no spaces in!
Dim vElements = LineOfData.Split({" "c}, StringSplitOptions.RemoveEmptyEntries)
For Each vElement In vElements
Dim vPair = vElement.Split({"="c})
Dim vResult = vPair(1).Trim(Convert.ToChar(34))
Select Case vPair(0).ToLower
Case "identifier"
MyIDVariable = CInt(vResult)
Case "colour"
MyColourVariable = vResult
' etc., etc.
End Select
Next
End Sub
You can define the variables you want locally in the sub [function], and then return a list/dictionary/custom class of the things you're interested in.

.txt filename iteration vb.net

I've got a little problem with regards to iterating the filename of the txt files. I've got a filename format that goes like this: <date>-<year>_filename-<number>.txt. The problem is that when <number> reaches 9, the filename stops iterating.
The filenames goes like this:
31-2014_filename-1
31-2014_filename-2
31-2014_filename-3
31-2014_filename-4
31-2014_filename-5
31-2014_filename-6
31-2014_filename-7
31-2014_filename-8
31-2014_filename-9
31-2014_filename-10
The function only detects up to 9. Anything beyond that number is ignored.
Below is the code
Dim lastreport As Integer = 1
Public Sub GetLastNo(ByVal filePath As String)
Dim lastFile As String = 1
Dim files() As String = Directory.GetFiles(filePath, "*.txt")
For Each File As String In files
File = Path.GetFileNameWithoutExtension(File)
Dim numbers As MatchCollection = Regex.Matches(File, "(?<num>[\d]+)")
For Each number In numbers
number = CInt(number.ToString())
If number > 0 And number < 1000 And number > lastFile Then
lastFile = number
End If
lastreport = number
Next
Next
End Sub
Here it is:
(?<num>\d+(?=$))
This would make sure that the digits are followed by a > and $(End of line). This would make sure that it is the last set of digits.
It would really help to see some real filenames, including some that fail to match (your description is not completely unambiguous: for example what is <date> if it does not include the year?).
But assuming files like:
30May-2014_Stuff-1.txt
30May-2014_Stuff-3.txt
30May-2014_Stuff-5.txt
30May-2014_Stuff-7.txt
30May-2014_Stuff-9.txt
30May-2014_Stuff-11.txt
then using the .NET regex engine (from PowerShell (PSH) here as quicker to test with):
(?<num>\d+)$
should match the final digits ($ matches the end of the string) of the filename without extension: BaseName in PSH):
dir | foreach { if ($_.BaseName -match '(?<num>\d+)$') { $matches['num'] } }
gives:
1
11
3
5
7
9
So all filenames are matched, and the final number of their basenames is matched by group "num" of the regex.
I think there is something else going on in your approach: I would suggest changing to only get a single match per filename (and use Regex.Match rather than Matches to be consistent).