Reg ex searching of csv file, - regex

I have huge task to do, seperating Voltage data from recorded .csv files of the format.
13/03/2014 18:48,71.556671,71.651062,71.639755,72.130692,71.961441,72.646423,72.262756,72.334511,7.812012
I am new to RegExpressions, how do i get data from column 10, repeatedly?
I have over 10,000,000 files to reduce and average to 32,000 for exel to graph. Any advice greatly welcome, trying to use PowerGrep to get up to speed.

Not that I would say that regex is the tool for it, but here goes:
(?:[^,]*,){9}([^,]*)
I.e. nine "columns" of non-commas, separated by commas, then capture the tenth in group 1.
E.g. use it with a Perl one-liner:
perl -ne 'chomp; /(?:[^,]*,){9}([^,]*)/ and print "$1\n"'

Related

Issues while processing zeroes found in CSV input file with Perl

Friends:
I have to process a CSV file, using Perl language and produce an Excel as output, using the Excel::Writer::XSLX module. This is not a homework but a real life problem, where I cannot download whichever Perl version (actually, I need to use Perl 5.6), or whichever Perl module (I have a limited set of them). My OS is UNIX. I can also use (embedding in Perl) ksh and csh (with some limitation, as I have found so far). Please, limit your answers to the tools I have available. Thanks in advance!
Even though I am not a Perl developer, but coming from other languages, I have already done my work. However, the customer is asking for extra processing where I am getting stuck on.
1) The stones in the road I found are coming from two sides: from Perl and from Excel particular styles of processing data. I already found a workaround to handle the Excel, but -as mentioned in the subject- I have difficulties while processing zeroes found in CSV input file. To handle the Excel, I am using the '0 way which is the final way for data representation that Excel seems to have while using the # formatting style.
2) Scenario:
I need to catch standalone zeroes which might be present in whichever line / column / cell of the CSV input file and put them as such (as zeroes) in the Excel output file.
I will go directly to the point of my question to avoid loosing your valuable time. I am providing more details after my question:
Research and question:
I tried to use Perl regex to find standalone "0" and replace them by whichever string, planning to replace them back to "0" at the end of processing.
perl -p -i -e 's/\b0\b/string/g' myfile.csv`
and
perl -i -ple 's/\b0\b/string/g' myfile.csv
Are working; but only from command line. They aren't working when I call them from the Perl script as follows:
system("perl -i -ple 's/\b0\b/string/g' myfile.csv")
Do not know why... I have already tried using exec and eval, instead of system, with the same results.
Note that I have a ton of regex that work perfectly with the same structure, such as the following:
system("perl -i -ple 's/input/output/g' myfile.csv")
I have also tried using backticks and qx//, without success. Note that qx// and backticks have not the same behavior, since qx// is complaining about the boundaries \b because of the forward slash.
I have tried using sed -i, but my System is rejecting -i as invalid flag (do not know if this happens in all UNIX, but at least happens in the one at work. However is accepting perl -i).
I have tried embedding awk (which is working from command line), in this way:
system `awk -F ',' -v OFS=',' '$1 == \"0\" { $1 = "string" }1' myfile.csv > myfile_copy.csv
But this works only for the first column (in command line) and, other than having the disadvantage of having extra copy file, Perl is complaining for > redirection, assuming it as "greater than"...
system(q#awk 'BEGIN{FS=OFS=",";split("1 2 3 4 5",A," ") } { for(i in A)sub(0,"string",$A[i] ) }1' myfile.csv#);
This awk is working from command line, but only 5 columns. But not in Perl using #.
All the combinations of exec and eval have also been tested without success.
I have also tried passing to system each one of the awk components, as arguments, separated by commas, but did not find any valid way to pass the redirector (>), since Perl is rejecting it because of the mentioned reason.
Using another approach, I noticed that the "standalone zeroes" seem to be "swallowed" by the Text::CSV module, thus, I get rid off it, and turned back to a traditional looping in csv line by line and a spliter for commas, preserving the zeroes in that way. However I found the "mystery" of isdual in Perl, and because of the limitation of modules I have, I cannot use the Dumper. Then, I also explored the guts of binaries in Perl and tried the $x ^ $x, which was deprecated since version 5.22 but valid till that version (I said mine is 5.6). This is useful to catch numbers vs strings. However, while if( $x ^ $x ) returns TRUE for strings, if( !( $x ^ $x ) ) does not returns TRUE when $x = 0. [UPDATE: I tried this in a devoted Perl script, just for this purpose, and it is working. I believe that my probable wrong conclusion ("not returning TRUE") was obtained when I did not still realize that Text::CSV was swallowing my zeroes. Doing new tests...].
I will appreciate very much your help!
MORE DETAILS ON MY REQUIREMENTS:
1) This is a dynamic report coming from a database which is handover to me and I pickup programmatically from a folder. Dynamic means that it might have whichever amount of tables, whichever amount of columns in each table, whichever names as column headers, whichever amount of rows in each table.
2) I do not know, and cannot know, the column names, because they vary from report to report. So, I cannot be guided by column names.
A sample input:
Alfa,Alfa1,Beta,Gamma,Delta,Delta1,Epsilon,Dseta,Heta,Zeta,Iota,Kappa
0,J5,alfa,0,111.33,124.45,0,0,456.85,234.56,798.43,330000.00
M1,0,X888,ZZ,222.44,111.33,12.24,45.67,0,234.56,0,975.33
3) Input Explanation
a) This is an example of a random report with 12 columns and 3 rows. Fist row is header.
b) I call "standalone zeroes" those "clean" zeroes which are coming in the CSV file, from second row onwards, between commas, like 0, (if the case is the first position in the row) or like ,0, in subsequent positions.
c) In the second row of the example you can read, from the beginning of the row: 0,J5,alfa,0, which in this particular case, are "words" or "strings". In this case, 4 names (note that two of them are zeroes, which required to be treated as strings). Thus, we have a 4 names-columns example (Alfa,Alfa1,Beta,Gamma are headers for those columns, but only in this scenario). From that point onwards, in the second row, you can see floating point (*.00) numbers and, among them, you can see 2 zeroes, which are numbers. Finally, in the third line, you can read M1,0,X888,Z, which are the names for the first 4 columns. Note, please, that the 4th column in the second row has 0 as name, while the 4th column in the third row has ZZ as name.
Summary: as a general picture, I have a table-report divided in 2 parts, from left to right: 4 columns for names, and 8 columns for numbers.
Always the first M columns are names and the last N columns are numbers.
- It is unknown which number is M: which amount of columns devoted for words / strings I will receive.
- It is unknown which number is N: which amount of columns devoted for numbers I will receive.
- It is KNOWN that, after the M amount of columns ends, always starts N, and this is constant for all the rows.
I have done a quick research on Perl boundaries for regex ( \b ), and I have not found any relevant information regarding if it applies or not in Perl 5.6.
However, since you are using and old Perl version, try the traditional UNIX / Linux style (I mean, what Perl inherits from Shell), like this:
system("perl -i -ple 's/^0/string/g' myfile.csv");
The previous regex should do the work doing the change at the start of the each line in your CSV file, if matches.
Or, maybe better (if you have those "standalone" zeroes, and want avoid any unwanted change in some "leading zeroes" string):
system("perl -i -ple 's/^0,/string,/g' myfile.csv");
[Note that I have added the comma, after the zero; and, of course, after the string].
Note that the first regex should work; the second one is just a "caveat", to be cautious.

How to write generic regex to extract the data in ExtractText?

My present data like below,It contains 100 rows
1,Ads,,12,CDMA,,12
2,,12,14,CDMA,,12
..
...
100,DVS,13,,CDMA,12,22
i have using GetFile-->SplitText-->ExtractText to split the data in row using 10 regex attributes for my present data.
For example my one of the input regex is (.+),(.+),,(.+),(.+),(.+) It will split the regex.1,regex.2 upto regex.5
For this data in ExtractText processor i have given 10 regex attributes to match all values in present data.
In Future there is another 100 rows will be added to present data.So i have to write regex attribute for future 100 lines also.
I need to add expression language support for all columns in extracted data in Processor also.
Is it possible to give common regex for all data in ExtractText processor?
Is there is anyother way to extract the data by delimiter like comma,pipe symbol in NIFI?
Any help appreciated.
Please anyone help me to solve this
I just find common regex for extract my data from csv file.,
([^,]*?),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)
It could be huge expensive if it might be better than this (.+),(.+),,(.+),(.+),(.+)
It may be helpful for someone.

On Cygwin (or windows 7), match a word, look backwards, skip a word and print x number of comma separated words

Have a headache trying to understand squiggly awks and greps but not gotten far.
I have 100 thousand files from which I'm trying to extract a single line.
A sample set of lines of the file is:
Revenue,876.08,,9361.000,444.000,333.000,222.000,111.00,485.000,"\t\t",178.90,9008.98
EV to Revenue,6.170,0.65,3.600,2.60,1.520,1.7,"\t\t",190.9,9008.98,80.9,87
(there are two tabs between the double quotes. I'm representing them with \t here. They are actual whitespace tabs)
I'm trying to output just this line that starts with Revenue:
Revenue,444.000,333.000,222.000,111.000
This output line outputs the first word of the line and the comma (ie: Revenue,) It then finds the two tabs ensconced in double quotes, looks backwards skipping the first set of comma separated numbers (also assume that instead of numbers, there could be nothing ie: just a comma separated blank) and then outputs the 4 set of comma separated numbers.
Is this doable in a simple grep or awk or cut or tr command on cygwin that won't be a bear to run on 100K files ?
To clarify, there are 100K files that look very similar. Each file will contain lots of lines (separated by new line/carriage return). Some lines will contain the word Revenue at the start, some at the middle (as in the 2nd sample line I had paste above) etc. I'm only interested in those lines that start with Revenue followed by the comma and then the sequence above. Each file will contain that specific line.
As a completion to this kind of task (because working on 100K files would require this too), what would have to be added to sed to print out the current file name being operated on too?
ie: output like this:
FileName1: Revenue,444.000,333.000,222.000,111.000
[I'll post the answer here if I find it]
Thank you!
Thanks to Sputnick for editing my question so it looks neat and thanks to shellter for responding.
Ed, your solution looks really good. I'm testing it out and will reply back with info plus my understanding of how that regex works. Thank you very much for taking time to write this out!
Since this is just a simple subsitution on a single line it's really most appropriate for sed:
$ sed -n -r 's/(^Revenue)(,[^,]*){3}(.*),[^,]*,"\t\t".*/\1\3/p' file
Revenue,444.000,333.000,222.000,111.00
but you can do the same in awk with gensub() (gawk) or match()/substr() or similar. It will run in the blink of an eye no matter what tool you use.

Parsing a CSV quoted string CSV file in Perl using split

I have a CSV file in the format shown below, and I'm using the Perl split command as shown, based on comma as delimiter. The problem is I have a quoted string "HTTP Large, GMS, ZMS: Large Files" with embedded commas and it fails. The array values will have only less elements. How can I modify the split command.
my #values = split('\,', $line);
CSV File
10852,800 Mob to Int'l,235341739,573047,84475.40,0.0003,Inbound,Ber unit
10880,"HTTP Large, GMS, ZMS: Large Files",52852810,128,13712.68,0.0002,,Rer unit
13506,Presence National,2716766818,2447643,309116.40,0.0001,Presence,per Cnit
Issues like embedded commas are precisely why modules such as Text::CSV were created. If, but only if, the data does not have embedded commas, then you can make regular expressions work. When the data has embedded commas, it is time to move to a tool designed to handle CSV with embedded commas, and that would be Text::CSV in Perl (and its relatives Text::CSV_PP and Text::CSV_XS).
I have also used the same approach as yours and it works fine with me. Try this code.
my #values = split(/(?<="),(?=")/, $line);
hope it helps

Automatically finding numbering patterns in filenames

Intro
I work in a facility where we have microscopes. These guys can be asked to generate 4D movies of a sample: they take e.g. 10 pictures at different Z position, then wait a certain amount of time (next timepoint) and take 10 slices again.
They can be asked to save a file for each slice, and they use an explicit naming pattern, something like 2009-11-03-experiment1-Z07-T42.tif. The file names are numbered to reflect the Z position and the time point
Question
Once you have all these file names, you can use a regex pattern to extract the Z and T value, if you know the backbone pattern of the file name. This I know how to do.
The question I have is: do you know a way to automatically generate regex pattern from the file name list? For instance, there is an awesome tool on the net that does similar thing: txt2re.
What algorithm would you use to parse all the file name list and generate a most likely regex pattern?
There is a Perl module called String::Diff which has the ability to generate a regular expression for two different strings. The example it gives is
my $diff = String::Diff::diff_regexp('this is Perl', 'this is Ruby');
print "$diff\n";
outputs:
this\ is\ (?:Perl|Ruby)
Maybe you could feed pairs of filenames into this kind of thing to get an initial regex. However, this wouldn't give you capturing of numbers etc. so it wouldn't be completely automatic. After getting the diff you would have to hand-edit or do some kind of substitution to get a working final regex.
First of all, you are trying to do this the hard way. I suspect that this may not be impossible but you would have to apply some artificial intelligence techniques and it would be far more complicated than it is worth. Either neural networks or a genetic algorithm system could be trained to recognize the Z numbers and T numbers, assuming that the format of Z[0-9]+ and T[0-9]+ is always used somewhere in the regex.
What I would do with this problem is to write a Python script to process all of the filenames. In this script, I would match twice against the filename, one time looking for Z[0-9]+ and one time looking for T[0-9]+. Each time I would count the matches for Z-numbers and T-numbers.
I would keep four other counters with running totals, two for Z-numbers and two for T-numbers. Each pair would represent the count of filenames with 1 match, and the ones with multiple matches. And I would count the total number of filenames processed.
At the end, I would report as follows:
nnnnnnnnnn filenames processed
Z-numbers matched only once in nnnnnnnnnn filenames.
Z-numbers matched multiple times in nnnnnn filenames.
T-numbers matched only once in nnnnnnnnnn filenames.
T-numbers matched multiple times in nnnnnn filenames.
If you are lucky, there will be no multiple matches at all, and you could use the regexes above to extract your numbers. However, if there are any significant number of multiple matches, you can run the script again with some print statements to show you example filenames that provoke a multiple match. This would tell you whether or not a simple adjustment to the regex might work.
For instance, if you have 23,768 multiple matches on T-numbers, then make the script print every 500th filename with multiple matches, which would give you 47 samples to examine.
Probably something like [ -/.=]T[0-9]+[ -/.=] would be enough to get the multiple matches down to zero, while also giving a one-time match for every filename. Or at worst, [0-9][ -/.=]T[0-9]+[ -/.=]
For Python, see this question about TemplateMaker.