So I have a set of pipe delimited inputs which are something like this:
"787291 | 3224325523" | 37826427 | 2482472 | "46284729|46246" | 24682
| 82524 | 6846419 | 68247
and I am converting them to comma delimited using the code given below:
Dim line As String
Dim fields As String()
Using sw As New StreamWriter("c:\test\output.txt")
Using tfp As New FileIO.TextFieldParser("c:\test\test.txt")
tfp.TextFieldType = FileIO.FieldType.Delimited
tfp.Delimiters = New String() {"|"}
tfp.HasFieldsEnclosedInQuotes = True
While Not tfp.EndOfData
fields = tfp.ReadFields
line = String.Join(",", fields)
sw.WriteLine(line)
End While
End Using
End Using
So far so good. It only considers the delimiters that are present outside the quotes and changes them to the comma delimiter. But trouble starts when I have input with a stray quotation like below:
"787291 | 3224325523" | 37826427 | 2482472 | "46284729|46246" | 24682
| "82524 | 6846419 | 68247
Here the code gives
MalformeLineExcpetion
Which I realize is due to the stray quotation in my input and since i am like a total noob in RegEx so i am not able to use it here(or I am incapable of). If anyone has any idea, it would be much appreciated.
Here is the coded procedure described in the comments:
Read all the lines of the original input file,
fix the faulty lines (with Regex or anything else that fits),
use TextFieldParser to perform the parsing of the correct input
Join() the input parts created by TextFieldParser using , as separator
save the fixed, reconstructed input lines to the final output file
I'm using Wiktor Stribiżew Regex pattern: it looks like it should work given the description of the problem.
Note:
Of course I don't know whether a specific Encoding should be used.
Here, the Encoding is the default UTF-8 no-BOM, in and out.
"FaultyInput.txt" is the corrupted source file.
"FixedInput.txt" is the file containing the input lines fixed (hopefully) by the Regex. You could also use a MemoryStream.
"FixedOutput.txt" is the final CSV file, containing comma separated fields and the correct values.
These files are all read/written in the executable startup path.
Dim input As List(Of String) = File.ReadAllLines("FaultyInput.txt").ToList()
For line As Integer = 0 To input.Count - 1
input(line) = Regex.Replace(input(line), "(""\b.*?\b"")|""", "$1")
Next
File.WriteAllLines("FixedInput.txt", input)
Dim output As List(Of String) = New List(Of String)
Using tfp As New FileIO.TextFieldParser("FixedInput.txt")
tfp.TextFieldType = FileIO.FieldType.Delimited
tfp.Delimiters = New String() {"|"}
tfp.HasFieldsEnclosedInQuotes = True
While Not tfp.EndOfData
Dim fields As String() = tfp.ReadFields
output.Add(String.Join(",", fields))
End While
End Using
File.WriteAllLines("FixedOutput.txt", output)
'Eventually...
'File.Delete("FixedInput.txt")
Sub ReadMalformedCSV()
Dim s$
Dim pattern$ = "(?x)" + vbCrLf +
"\b #word boundary" + vbCrLf +
"(?'num'\d+) #any number of digits" + vbCrLf +
"\b #word boundary"
'// Use "ReadLines" as it will lazily read one line at time
For Each line In File.ReadLines("c:\test\output.txt")
s = String.Join(",", Regex.Matches(line, pattern).
Select(Function(e) e.Groups("num").Value))
WriteLine(s)
Next
End Sub
Related
I have an app which returns data in the form of a table copied into the clipboard.
the table takes the form of:
table name
other info
-------------------------------
|heading 1|heading 2|heading 3|
-------------------------------
|data|date|other Data|
|data|date|other Data|
-------------------------------
time stamp
etc
I'm looking to pull back only the heading and data rows, minus the horizontal rows which are represented by dashes (---) in my data.
I need the pipes (|) as they are used to split the rows for passing back to excel.
I've used the following regex attempts
strPattern = "(?<=\|)[^|]++(?=\|)"
strPattern = "(\|[^|]++(\|)"
strPattern = "(^\s\|[\d\D]+?\|\s$)"
strPattern = "(^\s\|[\d\D]*\|\s$)"
strReplace = "$1"
thinking that the above uses the pipes as bookends and returns any digit or non digit character between the pipes. none of these work and at best it returns the entire string (I know I don't have anything removing the dashes yet)
looking for:
|heading 1|heading 2|heading 3|
|data|date|other Data|
|data|date|other Data|
Thanks in advance for any help
To answer your question, for a regex that will take your text as a block (multi-line variable) and only return the desired lines, try:
^(?:(?:(?:(?=-).)+)|(?:[^|]+))\n?
There may be better ways to accomplish your overall goal, but this accomplishes what you requested.
Option Explicit
Function PipedLines(S As String)
Dim RE As Object
Const sPat As String = "^(?:(?:(?:(?=-).)+)|(?:[^|]+))\n?"
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.MultiLine = True
.Pattern = sPat
PipedLines = .Replace(S, "")
End With
End Function
Hi #tsuimark have you treid copying Clipboard data to directly to excel.?
tried and attched screenshot. and remove unwanted rows in sheet.
Thanks.
I am relatively new to R and struggling with file extraction.
I have a list of CSV files (i.e. 001.csv, 002.csv, ....) in my directory xyz and need to extract a specific file based on the input given by user.
User input is in the form of 1, 2 ... (stored in y) which I tried converting by leading 0's.
When I run the code
filename = as.character(formatC(y, width=3, flag=0))
list.files(directory,pattern = "^",filename,"\\.csv$")
I get the result
character[0]
which implies my pattern code is incorrect, I want the file for eg: 001.csv to be extracted
Can anybody help me out?
It seems you miss the pattern that will match any file that starts with the filename then can match any 0+ characters and ends with .csv.
To build it, use paste0:
files <- list.files(directory, pattern = paste0("^", filename, ".*\\.csv$"))
Where:
"^" - start of the file name string
filename - the filename you pass
".*\\.csv$" - any 0+ characters (.*) followed by .csv (\\.csv) at the end of the string ($).
filename = as.character(formatC(y, width=3, flag=0))
The formatC flag 0 seems to work only for numerical objects; if you read the user input y with e. g. y = readline(), y is of type "character". You get the desired formatting with
filename = formatC(as.integer(y), width=3, flag=0)
(as.character() isn't needed because the formatC() value already has that type).
list.files(directory,pattern = "^",filename,"\\.csv$")
This isn't a correct usage of
list.files(path = ".", pattern = NULL, all.files = FALSE,
full.names = FALSE, recursive = FALSE,
ignore.case = FALSE)
- surely you meant to concatenate "^", filename and "\\.csv$".
All told I suggest to build the whole filename pattern with sprintf(), i. e.:
filename = sprintf("%03d\\.csv", as.integer(y))
list.files(directory, filename)
I have a list like this
"Boring makes sense!"
"http://www.someurl.com/listsolo.php?username=fgt&id=46229&code="
"http://www.someurl2.com/members/listearn.php?username=mprogram&id=465301"
"All is there?"
"http://www.someurl.com/listsolo.php?username=loopa&id=46228&code="
"http://www.someurl3.com/members/mem.php?&mprogram"
"http://someurl4.com/members/mem.php?&loop"
I need to remove any kind of text on particular line including double quots with RegEx in vb.net
Dim fileName As String = "C:\Downloads\Links.txt"
Dim sr As New StreamReader(fileName)
While Not sr.EndOfStream
Dim re As String = sr.ReadLine()
If Not re.StartsWith("http") Then
re = Regex.Replace(re, "(^[A-Za-z]+)", "", RegexOptions.Multiline)
lblTest.Text += re.ToString()
End if
End While
sr.Close()
How to do it ...in simple way?
Using Linq, reading from file, filtering and re-writing back to it :
File.WriteAllLines("some path", From line In File.ReadAllLines("some path")
Where line.StartsWith("http"))
I figured it out :-), this regex
.[A-Za-z]\w+ .*
remove whole line of text with double quotas. I test regex here. Anyway, thanks for help.
I've got a little problem with regards to iterating the filename of the txt files. I've got a filename format that goes like this: <date>-<year>_filename-<number>.txt. The problem is that when <number> reaches 9, the filename stops iterating.
The filenames goes like this:
31-2014_filename-1
31-2014_filename-2
31-2014_filename-3
31-2014_filename-4
31-2014_filename-5
31-2014_filename-6
31-2014_filename-7
31-2014_filename-8
31-2014_filename-9
31-2014_filename-10
The function only detects up to 9. Anything beyond that number is ignored.
Below is the code
Dim lastreport As Integer = 1
Public Sub GetLastNo(ByVal filePath As String)
Dim lastFile As String = 1
Dim files() As String = Directory.GetFiles(filePath, "*.txt")
For Each File As String In files
File = Path.GetFileNameWithoutExtension(File)
Dim numbers As MatchCollection = Regex.Matches(File, "(?<num>[\d]+)")
For Each number In numbers
number = CInt(number.ToString())
If number > 0 And number < 1000 And number > lastFile Then
lastFile = number
End If
lastreport = number
Next
Next
End Sub
Here it is:
(?<num>\d+(?=$))
This would make sure that the digits are followed by a > and $(End of line). This would make sure that it is the last set of digits.
It would really help to see some real filenames, including some that fail to match (your description is not completely unambiguous: for example what is <date> if it does not include the year?).
But assuming files like:
30May-2014_Stuff-1.txt
30May-2014_Stuff-3.txt
30May-2014_Stuff-5.txt
30May-2014_Stuff-7.txt
30May-2014_Stuff-9.txt
30May-2014_Stuff-11.txt
then using the .NET regex engine (from PowerShell (PSH) here as quicker to test with):
(?<num>\d+)$
should match the final digits ($ matches the end of the string) of the filename without extension: BaseName in PSH):
dir | foreach { if ($_.BaseName -match '(?<num>\d+)$') { $matches['num'] } }
gives:
1
11
3
5
7
9
So all filenames are matched, and the final number of their basenames is matched by group "num" of the regex.
I think there is something else going on in your approach: I would suggest changing to only get a single match per filename (and use Regex.Match rather than Matches to be consistent).
I have a file that has been generated by a server - I have no control over how this file is generated or formatted. I need to check each line begins with a string of set length (in this case 21 numerical chars). If a line doesn't match that condition, I need to join it to the previous line and, after reading and correcting the whole file, save it. I am doing this for a lot of files in a directory.
So far I have:
Dim rgx As New Regex("^[0-9]{21}$")
Dim linesList As New List(Of String)(File.ReadAllLines(finfo.FullName))
If linesList(0).Contains("BlackBerry Messenger") Then
linesList.RemoveAt(0)
For i As Integer = 0 To linesList.Count
If Not rgx.IsMatch(i.ToString) Then
linesList.Concat(linesList(i-1))
End If
Next
End If
File.WriteAllLines(finfo.FullName, linesList.ToArray())[code]
There's a for statement before and after that code block to loop over all files in the source directory, which works fine.
Hope this isn't too bad to read :/
I didn't think your solution was any good, you were failing on concatenating the lines. Here's a different approach:
Dim rgx As New Regex("^[0-9]{21}")
Dim linesList As New List(Of String)(File.ReadAllLines(finfo.FullName))
' We will create a new list to store the new lines data
Dim newLinesList As New List(Of String)()
If linesList(0).Contains("BlackBerry Messenger") Then
Dim i As Integer = 1
Dim newLine As String
While i < linesList.Count
newLine = linesList(i)
i += 1
' Keep going until the "real" line is over
While i < linesList.Count AndAlso Not rgx.IsMatch(linesList(i))
newLine += linesList(i)
i += 1
End While
newLinesList.Add(newLine)
End While
End If
File.WriteAllLines(finfo.FullName, newLinesList.ToArray())