Create data frame from patterns in a string from a file - regex

I have a very large file that has some sort of titles in the begginning, then a lot of data in eight columns but this data is not separated in a regular way by spaces (they decided to spit the columns separated by spaces but if some column breaks the "normal" size, the columns end up separated by more or less space characters.
What I did is, I can read the file using a connection and reading line by line using gsub by applying a certain regular expression, something like this:
conn <- file("my_file.dat", open="rt")
y <- gsub("a_ver_large_regexp",
"\\1, \\2, \\3, \\4, \\5, \\6, \\7, \\8", #the columns I want csv'd
perl = TRUE,
readLines(conn, n=-1L))
then I end up with y, a vector of characters where I have each element in character class but at least now comma separated too.
Now I want to convert that y vector to a data frame, I suppose it could be somehow easy given that each element is an string but it has commas so I can read them easily, any idea on how to do this?

It's somewhat difficult to try to write a solution when we cannot see for example y or the original data. However, I think that
as.data.frame(do.call("rbind", strsplit(y, ",")))
might get you what you are after.

Related

How to read CSV file with newline and comma characters inside cells in C++

I've got a CSV file containing cells with break lines ("\n") and/or commas which are enclosed with double quotes.
When I use getline() function to get each row, it consider each line inside cell as a new row of csv file. In addition, when using splitIntoVec to get vector of each row, it condiders comma inside a cell as a new vector element.
I want to store the content of csv file into a vector of vectors which each row is a vector of strings inside its cells.
for instance, for the following csv file content
"Row 1 cell 1
With break line","Row1 cell2, with comma"
"Row 2 cell 1
With break line","Row2 cell2, with comma"
Row 3 cell 1,Row3 cell 2
I get the result vector of 4 string vectors which the first one has only one element and the second one has 3 elements.
Here is my code :
vector<vector<string>> readFromCsv(string &fileName, char rowDelimiter = "\n", char colDelimiter = ",") {
ifstream file(fileName); // declare file stream
string value;
vector<vector<string>> contentVec;
vector<string> rowVec;
string rowStr;
while (getline(file, rowStr, rowDelimiter)) {
rowVec = splitIntoVec(rowStr, colDelimiter);
contentVec.push_back(rowVec);
}
return contentVec;
}
Is there any other function (in libraries like boost) available to resolve these issues? Any help would be appreciated.
In PHP , I get the content of the csv file by fgetcsv() correctly . Is there any alternative function in c++?
#Simone already said in his comment that it is not the CSV file. But seeing your problem you will need to get your hand dirty and do some text processing to get it separate. You can read complete file in a string and then break it further using loops or which ever way you see fit. For this you will need to keep track of the encountered " while traversing and breaking only when it is not inside double quotes.
For Example,
(opening apostrophes)"Row 1 cell 1
With break line"(closing apostrophes),"(opening apostrophes)Row1 cell2, with comma"(closing apostrophes)
You will have to keep track of opening and closing double apostrophes using index or number and break for rows only if '\n' is found outside the opening and closing apostrophes.
You can use regex also if you are sure there are no " in the cells.
Thanks #Alex Useful link if someone else faces the same issue : http://mybyteofcode.blogspot.nl/2010/11/parse-csv-file-with-embedded-new-lines.html
You have to completely separate by ", keeping 2 states: inside "" and outside. , and EOL have different meanings based on the states.
You can use getline(file, rowStr, '"') to read in everything up to the ", but your logic to separate in records will be a bit more complex. If numbers are allowed without quotation marks, then it becomes even more complex.

Applying a regular expression to a text file Python 3

#returns same result i.e. only the first line as many times as 'draws'
infile = open("results_from_url.txt",'r')
file =infile.read() # essential to get correct formatting
for line in islice(file, 0, draws): # allows you to limit number of draws
for line in re.split(r"Wins",file)[1].split('\n'):
mains.append(line[23:38]) # slices first five numbers from line
stars.append(line[39:44]) # slices last two numbers from line
infile.close()
I am trying to use the above code to iterate through a list of numbers to extract the bits of interest. In this attempt to learn how to use regular expressions in Python 3, I am using lottery results opened from the internet. All this does is to read one line and return it as many times as I instruct in the value of 'draws'. Could someone tell me what I have done incorrectly, please. Does re 'terminate' somehow? The strange thing is if I copy the file into a string and run this routine, it works. I am at a loss - problem 'reading' a file or in my use of the regular expression?
I can't tell you why your code doesn't work, because I cannot reproduce the result you're getting. I'm also not sure what the purpose of
for line in islice(file, 0, draws):
is, because you never use the line variable after that, you immediately overwrite it with
for line in re.split(r"Wins",file)[1].split('\n'):
Plus, you could have used file.split('Wins') instead of re.split(r"Wins",file), so you aren't really using regex at all.
Regex is a tool to find data of a certain format. Why do you use it to split the input text, when you could use it to find the data you're looking for?
What is it you're looking for? A sequence of seven numbers, separated by commas. Translated into regex:
(?:\d+,){7}
However, we want to group the first 5 numbers - the "mains" - and the last 2 numbers - the "stars". So we'll add two named capture groups, named "mains" and "stars":
(?P<mains>(?:\d+,){5})(?P<stars>(?:\d+,){2})
This pattern will find all numbers you're looking for.
import re
data= open("infile.txt",'r').read()
mains= []
stars= []
pattern= r'(?P<mains>(?:\d+,){5})(?P<stars>(?:\d+,){2})'
iterator= re.finditer(pattern, data)
for count in range(int(input('Enter number of draws to examine: '))):
try:
match= next(iterator)
except StopIteration:
print('no more matches')
break
mains.append(match.group('mains'))
stars.append(match.group('stars'))
print(mains,stars)
This will print something like ['01,03,31,42,46,'] ['04,11,']. You may want to remove the commas and convert the numbers to ints, but in essence, this is how you would use regex.

Match anything except character unless it's followed by some other character

I've got this odd string:
firstName:Paul Henry,retired:true,message:A, B & more,title:mr
which needs to be split into its <key>:<value> pairs. Unfortunately, key/value pairs are separated by , which itself can be part of the value. Hence, a simple string-split at , would not produce the correct result.
Keys contain only word characters and values can contain :.
What I need (I think) is something like
\w*:match-anything-but-comma-unless-comma-is-followed-by-space
What kind of works is
\w*:[\w ?!&%,]*(?![^,])
but of course I wouldn't want to explicitly list all characters in the character class (just listed a few for this example).
If you want to split on a comma, unless the comma is followed by a space, why not just:
,(?=\S)
Not sure what language you are using, but in C# the line might look like:
splitArray = Regex.Split(subjectString, #",(?=\S)");
You are trying to do something complicated with a regular expression that would be simple (and easy to understand) with a little code. That's usually a mistake. Just write a little code.
In your case, you want to split the input on commas. If you get a chunk that doesn't contain a colon, you want to treat it as part of the previous chunk. So just write that. For example, in Python, I'd do it like this:
chunks = input.split(',')
associations = []
for chunk in chunks:
if ':' in chunk:
associations.append(chunk)
else:
associations[-1] += ',' + chunk
map = dict(association.split(':') for association in associations)

How to split CSV line according to specific pattern

In a .csv file I have lines like the following :
10,"nikhil,khandare","sachin","rahul",viru
I want to split line using comma (,). However I don't want to split words between double quotes (" "). If I split using comma I will get array with the following items:
10
nikhil
khandare
sachin
rahul
viru
But I don't want the items between double-quotes to be split by comma. My desired result is:
10
nikhil,khandare
sachin
rahul
viru
Please help me to sort this out.
The character used for separating fields should not be present in the fields themselves. If possible, replace , with ; for separating fields in the csv file, it'll make your life easier. But if you're stuck with using , as separator, you can split each line using this regular expression:
/((?:[^,"]|"[^"]*")+)/
For example, in Python:
import re
s = '10,"nikhil,khandare","sachin","rahul",viru'
re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]
=> ['10', '"nikhil,khandare"', '"sachin"', '"rahul"', 'viru']
Now to get the exact result shown in the question, we only need to remove those extra " characters:
[e.strip('" ') for e in re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]]
=> ['10', 'nikhil,khandare', 'sachin', 'rahul', 'viru']
If you really have such a simple structure always, you can use splitting with "," (yes, with quotes) after discarding first number and comma
If no, you can use a very simple form of state machine parsing your input from left to right. You will have two states: insides quotes and outside. Regular expressions is a also a good (and simpler) way if you already know them (as they are basically an equivalent of state machine, just in another form)

parse text with Matlab

I have a text file (output from an old program) that I'd like to clean. Here's an example of the file contents.
*|V|0|0|0|t|0|1|1|4|11|T4|H01||||||||||||||||||||||
P|40|0.01|10|1|1|0|40|1|1|1||1|*||0|0|0||||||||||||||||
*|A1|A1|A7|A16|F|F|F|F|F|F|F|||||||||||||||||||||||
*|||||kV|kV|kV|MW|MVAR|S|S||||||||||||||||||||||||
N|I|01|H01N01|H01N01|132|125.4|138.6|0|0|||||||||||||||||||||
N|I|01|H01N02|H01N02|20|19|21|0|0|||||||||||||||||||||||
N|I|01|H01N03|H01N03|20|19|21|0.42318823|0.204959433|||||||||||||||||||||
|||||||||||||||||
|||||||||||||||||
L|I|H010203|H01N02|H01N03|1.884|360|0.41071|0.207886957||3.19E-08|3.19E-08|||||||||||
L|I|H010304|H01N03|H01N04|1.62|360|0.35316|0.1787563||3.19E-08||3.19E-08||||||||||||
L|I|H010405|H01N04|H01N05|0.532|360|0.11598|0.058702686||3.19E-08||3.19E-08|||||||||||
L|I|H010506|H01N05|H01N06|1.284|360|0.27991|0.14168092||3.19E-08||3.19E-08||||||||||||
S|SH01|SEZIONE01|1|-3|+3|-100|+100|||||||||||||||||||
S|SH02|SEZIONE02|1|-3|+3|-100|+100|||||||||||||||||||
S|SH03|SEZIONE03|1|-3|+3|-100|+100|||||||||||||||||||
||||||||||||asasasas
S|SH04|SEZIONE04|1|-3|+3|-100|+100|||||||||||||||||||
*|comment
S|SH05|SEZIONE05|1|-3|+3|-100|+100|||||||||||||||||||
I'd like it to look like:
*|V|0|0|0|t|0|1|1|4|11|T4|H01||||||||||||||||||||||
*|comment
*|comment
P|40|0.01|10|1|1|0|40|1|1|1||1|*||0|0|0||||||||||||||||
*|A1|A1|A7|A16|F|F|F|F|F|F|F|||||||||||||||||||||||
*|||||kV|kV|kV|MW|MVAR|S|S||||||||||||||||||||||||
N|I|01|H01N01|H01N01|132|125.4|138.6|0|0|||||||||||||||||||||
N|I|01|H01N02|H01N02|20|19|21|0|0|||||||||||||||||||||||
N|I|01|H01N03|H01N03|20|19|21|0.42318823|0.204959433|||||||||||||||||||||
*|comment||||||||||||||||
*|comment|||||||||||||||||
L|I|H010203|H01N02|H01N03|1.884|360|0.41071|0.207886957||3.19E-08||3.19E-08|||||||||||
L|I|H010304|H01N03|H01N04|1.62|360|0.35316|0.1787563||3.19E-08||3.19E-08||||||||||||||
L|I|H010405|H01N04|H01N05|0.532|360|0.11598|0.058702686||3.19E-08||3.19E-08|||||||||||
L|I|H010506|H01N05|H01N06|1.284|360|0.27991|0.14168092||3.19E-08||3.19E-08||||||||||||
*|comment
*|comment
S|SH01|SEZIONE01|1|-3|+3|-100|+100|||||||||||||||||||
S|SH02|SEZIONE02|1|-3|+3|-100|+100|||||||||||||||||||
S|SH03|SEZIONE03|1|-3|+3|-100|+100|||||||||||||||||||
S|SH04|SEZIONE04|1|-3|+3|-100|+100|||||||||||||||||||
S|SH05|SEZIONE05|1|-3|+3|-100|+100|||||||||||||||||||
The data are divided into 'packages' distinct from the first letter (PNLS). Each package must have at least two dedicated lines (* |) which is then read as a comment. The white lines between different letters are filled with character * |. The lines between various letters that do not begin with * | to be added. The white lines and characters 'random' between identical letters are removed.
Perhaps it is clearer in the example files.
How do I manipulate the text? Thank you in advance for the help.
Use fileread to get your file into MATLAB.
text = fileread('my file to clean.txt');
Split the resulting character string up by splitting on the new lines. (The newlines characters depend on your operating system.)
lines = regexp(text, '\r\n', 'split');
It isn't entirely clear exactly how you want the file cleaned, but these things might get you started.
% Replace blank lines with comment string
blanks = cellfun(#isempty, lines);
comment = '*|comment';
lines(blanks) = cellstr(repmat(comment, sum(blanks), 1))
% Prepend comment string to lines that start with a pipe
lines = regexprep(lines, '^\|', '\*\|comment\|')
You'll be needing to know your way around regular expressions. There's a good guide to them at regular-expressions.info.