Issues while processing zeroes found in CSV input file with Perl - regex

Friends:
I have to process a CSV file, using Perl language and produce an Excel as output, using the Excel::Writer::XSLX module. This is not a homework but a real life problem, where I cannot download whichever Perl version (actually, I need to use Perl 5.6), or whichever Perl module (I have a limited set of them). My OS is UNIX. I can also use (embedding in Perl) ksh and csh (with some limitation, as I have found so far). Please, limit your answers to the tools I have available. Thanks in advance!
Even though I am not a Perl developer, but coming from other languages, I have already done my work. However, the customer is asking for extra processing where I am getting stuck on.
1) The stones in the road I found are coming from two sides: from Perl and from Excel particular styles of processing data. I already found a workaround to handle the Excel, but -as mentioned in the subject- I have difficulties while processing zeroes found in CSV input file. To handle the Excel, I am using the '0 way which is the final way for data representation that Excel seems to have while using the # formatting style.
2) Scenario:
I need to catch standalone zeroes which might be present in whichever line / column / cell of the CSV input file and put them as such (as zeroes) in the Excel output file.
I will go directly to the point of my question to avoid loosing your valuable time. I am providing more details after my question:
Research and question:
I tried to use Perl regex to find standalone "0" and replace them by whichever string, planning to replace them back to "0" at the end of processing.
perl -p -i -e 's/\b0\b/string/g' myfile.csv`
and
perl -i -ple 's/\b0\b/string/g' myfile.csv
Are working; but only from command line. They aren't working when I call them from the Perl script as follows:
system("perl -i -ple 's/\b0\b/string/g' myfile.csv")
Do not know why... I have already tried using exec and eval, instead of system, with the same results.
Note that I have a ton of regex that work perfectly with the same structure, such as the following:
system("perl -i -ple 's/input/output/g' myfile.csv")
I have also tried using backticks and qx//, without success. Note that qx// and backticks have not the same behavior, since qx// is complaining about the boundaries \b because of the forward slash.
I have tried using sed -i, but my System is rejecting -i as invalid flag (do not know if this happens in all UNIX, but at least happens in the one at work. However is accepting perl -i).
I have tried embedding awk (which is working from command line), in this way:
system `awk -F ',' -v OFS=',' '$1 == \"0\" { $1 = "string" }1' myfile.csv > myfile_copy.csv
But this works only for the first column (in command line) and, other than having the disadvantage of having extra copy file, Perl is complaining for > redirection, assuming it as "greater than"...
system(q#awk 'BEGIN{FS=OFS=",";split("1 2 3 4 5",A," ") } { for(i in A)sub(0,"string",$A[i] ) }1' myfile.csv#);
This awk is working from command line, but only 5 columns. But not in Perl using #.
All the combinations of exec and eval have also been tested without success.
I have also tried passing to system each one of the awk components, as arguments, separated by commas, but did not find any valid way to pass the redirector (>), since Perl is rejecting it because of the mentioned reason.
Using another approach, I noticed that the "standalone zeroes" seem to be "swallowed" by the Text::CSV module, thus, I get rid off it, and turned back to a traditional looping in csv line by line and a spliter for commas, preserving the zeroes in that way. However I found the "mystery" of isdual in Perl, and because of the limitation of modules I have, I cannot use the Dumper. Then, I also explored the guts of binaries in Perl and tried the $x ^ $x, which was deprecated since version 5.22 but valid till that version (I said mine is 5.6). This is useful to catch numbers vs strings. However, while if( $x ^ $x ) returns TRUE for strings, if( !( $x ^ $x ) ) does not returns TRUE when $x = 0. [UPDATE: I tried this in a devoted Perl script, just for this purpose, and it is working. I believe that my probable wrong conclusion ("not returning TRUE") was obtained when I did not still realize that Text::CSV was swallowing my zeroes. Doing new tests...].
I will appreciate very much your help!
MORE DETAILS ON MY REQUIREMENTS:
1) This is a dynamic report coming from a database which is handover to me and I pickup programmatically from a folder. Dynamic means that it might have whichever amount of tables, whichever amount of columns in each table, whichever names as column headers, whichever amount of rows in each table.
2) I do not know, and cannot know, the column names, because they vary from report to report. So, I cannot be guided by column names.
A sample input:
Alfa,Alfa1,Beta,Gamma,Delta,Delta1,Epsilon,Dseta,Heta,Zeta,Iota,Kappa
0,J5,alfa,0,111.33,124.45,0,0,456.85,234.56,798.43,330000.00
M1,0,X888,ZZ,222.44,111.33,12.24,45.67,0,234.56,0,975.33
3) Input Explanation
a) This is an example of a random report with 12 columns and 3 rows. Fist row is header.
b) I call "standalone zeroes" those "clean" zeroes which are coming in the CSV file, from second row onwards, between commas, like 0, (if the case is the first position in the row) or like ,0, in subsequent positions.
c) In the second row of the example you can read, from the beginning of the row: 0,J5,alfa,0, which in this particular case, are "words" or "strings". In this case, 4 names (note that two of them are zeroes, which required to be treated as strings). Thus, we have a 4 names-columns example (Alfa,Alfa1,Beta,Gamma are headers for those columns, but only in this scenario). From that point onwards, in the second row, you can see floating point (*.00) numbers and, among them, you can see 2 zeroes, which are numbers. Finally, in the third line, you can read M1,0,X888,Z, which are the names for the first 4 columns. Note, please, that the 4th column in the second row has 0 as name, while the 4th column in the third row has ZZ as name.
Summary: as a general picture, I have a table-report divided in 2 parts, from left to right: 4 columns for names, and 8 columns for numbers.
Always the first M columns are names and the last N columns are numbers.
- It is unknown which number is M: which amount of columns devoted for words / strings I will receive.
- It is unknown which number is N: which amount of columns devoted for numbers I will receive.
- It is KNOWN that, after the M amount of columns ends, always starts N, and this is constant for all the rows.

I have done a quick research on Perl boundaries for regex ( \b ), and I have not found any relevant information regarding if it applies or not in Perl 5.6.
However, since you are using and old Perl version, try the traditional UNIX / Linux style (I mean, what Perl inherits from Shell), like this:
system("perl -i -ple 's/^0/string/g' myfile.csv");
The previous regex should do the work doing the change at the start of the each line in your CSV file, if matches.
Or, maybe better (if you have those "standalone" zeroes, and want avoid any unwanted change in some "leading zeroes" string):
system("perl -i -ple 's/^0,/string,/g' myfile.csv");
[Note that I have added the comma, after the zero; and, of course, after the string].
Note that the first regex should work; the second one is just a "caveat", to be cautious.

Related

Is there a way to match strings:numbers with variable positioning within the string?

We are using a simple curl to get metrics via an API. The problem is, that the output is fixed in the amount of arguments but not their position within the output.
We need to do this with a "simple" regex since the tool only accepts this.
/"name":"(.*)".*?"memory":(\d+).*?"consumer_utilisation":(\w+|\d+).*?"messages_unacknowledged":(\d+).*?"messages_ready":(\d+).*?"messages":(\d+)/s
It works fine for:
{"name":"queue1","memory":89048,"consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0}
However if the output order is changed, then it doesn't match any more:
{"name":"queue2","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0,"memory":21944}
{"name":"queue3","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"memory":21944,"messages":0}
I need a relative definition of the strings to match, since I never know at which position they will appear. Its in total 9 different queue-metric-groups.
The simple option is to use a regex for each key-value pair instead of one large regex.
/"name":"((?:[^\\"]|\\.)*)"/
/"memory":(\d+)/
This other option is not a regex, but might be sufficient. Instead of using regex, you could simply transform the resulting response before reading it. Since you say "We are using a simple curl" I'm guessing you're talking about the Curl command line tool. You could pipe the result into a simple Perl command.
perl -ne 'use JSON; use Text::CSV qw(csv); $hash = decode_json $_; csv (sep_char=> ";", out => *STDOUT, in => [[$hash->{name}, $hash->{memory}, $hash->{consumer_utilisation}, $hash->{messages_unacknowledged}, $hash->{messages_ready}, $hash->{messages}]]);'
This will keep the order the same, making it easier to use a regex to read out the data.
input
{"name":"queue1","memory":89048,"consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0}
{"name":"queue2","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0,"memory":21944}
{"name":"queue3","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"memory":21944,"messages":0}
output
queue1;89048;;0;0;0
queue2;21944;;0;0;0
queue3;21944;;0;0;0
For this to work you need Perl and the packages JSON and Text::CSV installed. On my system they are present in perl, libjson-perl and libtext-csv-perl.
note: I'm currently using ; as separator. If this is included into one of the output will be surrounded by double quotes. "name":"que;ue1" => "que;ue1";89048;;0;0;0 If the value includes both a ; and a " the " will be escaped by placing another one before it. "name":"q\"ue;ue1" => "q""ue;ue1";89048;;0;0;0

How to improve regexp based replace on cells to operate per line instead of needing to loop through each cell?

If you need to clean content in the "cells" of a tsv (tab separated values) file, where the cleaning operation needs include removing certain characters at the beginning and end of each column per line as well as in some cases characters in the middle of the content, then this approach works (ubuntu/linux):
(the example below is using awk but any tool/utility like sed,tr, core bash functionality that can be run from a bash script is the goal)
awk 'BEGIN {FS="\t";OFS="\t"} \
{for (i=1; i <= NF ; i++) \
gsub(/^[[:space:]]+|[[:space:]]+$|[[:cntrl:]]+|(\\|\/)+$/, "", $i) \
gsub(/(\+$|^\+)/, "", $i) \
# and so on (could add more replacement conditions here)
;print $0 }' ${file} > ${file}.cleaned
However when these files are very large, it can take a long time to process because it is breaking each line up into the cells and processing each cell. I want to see if the processing time for the same cleanup operations can be improved.
Can this be rewritten so that instead of breaking up the columns per line and doing the replacement per cell, a regular expression could be constructed to do the same cleanup operations (more could be added later to the above example) on all columns in between the column separator (tabs) for the entire line all at once?
The approach should consider the files to be cleaned have a dynamic/varying number of columns for the file contents (i.e. its not just 2 or 3 columns, could be 10 or more) so building a fixed regular expression based on a known number of columns isn't an option. However, the script could build the regular expression dynamically based on first determining how many columns there are and then execute the result. I am curious though if a statically constructed regular expression could be built to do this for the entire line where it knows how to remove certain characters between the content and the tabs for each cell without removing the tabs.
Update to clarify, in the above example there are some characters that need to be removed anywhere in the cell content (defined as in between the tabs) and others that need to be removed at beginning or end of content (i.e. [[:space:]]) and some of the characters to replace would include a tab (i.e. [[:cntrl:]] and [::space:]]). This update is to clear up any confusion I may have suggested that all cases for removal are just at beginning or end of cell content.
Sed has a rather more cryptic syntax than does Awk, but for simple substitutions where field-splitting is not required, it's not that cryptic. Certainly the regex syntax itself is the hardest part about it, and that does not change appreciably among the various regex-based tools. For example,
sed \
-e 's,^[ \r\v\f]\+\|\([ \r\v\f]\+\|[\\/]+\)$,,g' \
-e 's,\([ \r\v\f]*\|[\\/]*\)\t[ \r\v\f]*,\t,g' \
-e 's,[\x00-\x08\x0a-\x1f\x7f]\+,,g' \
-e 's,^\+|\+$,,g' \
-e 's,+\?\t+\?,\t,g' \
${file} > ${file}.cleaned
That performs the same transformation that your Awk script does (for reals now, I think). There are some catches there, however, large among them:
because you get no field splitting, you have to match not only at the beginning and end of each input, but also around the field delimiters;
yet you need to watch out for matching the field delimiter elsewhere (the first version of this answer did not do an adequate job of that, because the tab is included in both [[:space:]] and [[:cntrl:]]);
you will want to perform at least some of your substitutions with the g (global) flag to replace all matches instead of just the first. In Awk, that's the difference between gsub and sub. In Sed and several other languages it's the difference between supplying the 'g' flag or not.
You could of course package up exactly the same thing with Awk, Perl, etc. without much substantive change. Whether any variation is a bona fide improvement over your original per-field Awk approach is a matter of style preference and performance testing.

Bash: Regex for SVN Conflicts

So I'm trying to write a regex to use for a grep command on an SVN status command. I want only files with conflicts to be displayed, and if it's a tree conflict, the extra information SVN provides about it (which is on a line with a > character).
So, here's my description of how SVN outputs lines with conflicts, and then I'll show my regex:
[Single Char Code][Spaces][Letter "C"][Space]Filename
[Spaces][Letter "C"][Space]Filename
[Letter "C"][Space]Filename
This is what I have so far to try and get the proper regex. The second part, after the OR condition, works fine to get the tree conflict extra line. It's the first part, where I'm trying to get lines with the letter C under very specific conditions.
Anyway, I'm not exactly the greatest with Regex, so some help here (plus an explanation of what I'm doing wrong, so I can learn from this) would be great.
CONFLICTS=($(svn status | grep "^(.)*C\s\|>"))
Thanks.
This regex should match your lines :
CONFLICTS=$(svn status | grep '^[ADMRCXI?!~ ]\? *C')
^[ADMRCXI?!~ ]\?: lines starting with zero or one \?status character ^[ADMRCXI?!~ ]
*zero or more spaces
character C
I removed the extra parenthesis surrounding the command substitution.
You have to read description of svn st output more deeply and try to get at least one Tree Conflict.
I'll start it for you:
> The first seven columns in the output are each one character wide:
>...
> Seventh column: Whether the item is the victim of a tree conflict
>...
> 'C' tree-Conflicted
and note: theoretically any of these 7 columns can be non-empty
status for tree-conflict
M wc/bar.c
! C wc/qaz.c
> local missing, incoming edit upon update
D wc/qax.c
Dirty lazy draft of regexp
^[enumerate_all_chars_here]{6}C\s

Using Sed or Script to Inline Edit Values in Data Files With Variable Spacing

I have a number of scripts that replace variables separate by white space.
e.g.
sed -i 's/old/new/g' filename.conf
But say I have
#NAME Weight Age Name
Boss 160.000 43 BOB
The below data is made more readable if it stays within the current alignment, so to speak. So if I'm writing a new double, I'd like to only overwrite the width of each of the fields.
My questions are:
1. How to I capture the patterns between values to preserve spaces?
2. Does sed feature a way to force a shell variable say ${FOOBAR} to be a certain width?
3a. If so how do I define this replace field width?
3b. If not what program in Linux is best suited for this truncation assuming I use a mix of number and string data?
EDIT 1
Let me give a couple more examples.
Let's say my file is:
#informative info on this config var.
VAR1 131 comment second_comment
#more informative info
VAR2 3.4 13132 yet_another_comment
#FOO THE VALUE WARNING
Foo 5.6 donteditthis_comment
#BAR ANOTHER VALUE WARNING
Bar 6.5 donteditthis_comment
#Yet another informative comment
VAR3 321
in my bash script I have:
#!/bin/bash
#Vars -- real script will have vals in arrays as
#multiple identically tagged config files will be altered
FOO='Foo'
BAR='Bar'
FOO_VAL_NEW='33.3333'
BAR_VAL_NEW='22.1111'
FILENAME='file.conf'
#Define sed patterns
#These could be inline, but are defined here for readability as they're long.
FOO_MATCH=${FOO}<...whatever special character can be used to capture whitespace...>'[0-9]*.*[0-9]*'
FOO_REPLACE=${FOO}<...whatever special characters to output captured whitespace...>${FOOD_VAL_NEW}
BAR_MATCH=${BAR}<...whatever special character can be used to capture whitespace...>'[0-9]*.*[0-9]*'
BAR_REPLACE=${FOO}<...whatever special characters to output captured whitespace...>${BAR_VAL_NEW}
#Do the inline edit ... will be in a loop to handle multiple
#identically tagged config files in full-fledged script.
sed -i "s/${FOO_MATCH}/${FOO_REPLACE}/g" ${FILENAME}
sed -i "s/${BAR_MATCH}/${BAR_REPLACE}/g" ${FILENAME}
My expected output is:
#informative info on this config var.
VAR1 131 comment second_comment
#more informative info
VAR2 3.4 13132 yet_another_comment
#FOO THE VALUE WARNING
Foo 33.3333 donteditthis_comment
#BAR ANOTHER VALUE WARNING
Bar 22.1111 donteditthis_comment
#Yet another informative comment
VAR3 321
Currently my script works... but there's a couple of annoyances/dangers.
PROBLEM 1
Currently to match the tag, I include the exact whitespace characters after it. E.g. for the given example I would define
FOO='Foo '
...as I'm unsure of how to capture ws characters and then output them in the the replace field.
This is nice for me, as I know I'm going to keep the spaces to the first field the same, to maintain readability. But if one of my users (this is for a public project) writes their own file and writes:
#FOO THE VALUE WARNING
Foo 22.0
Now my script is broken for them. I need to capture the whitespace chars in my match pattern, then output them in my output pattern. That way it will play nice with my file (optimally spaced for readability) but if someone wants to muck things up and not space things nicely it will still work for them as well, preserving their current spaces.
PROBLEM 2
Okay so we've read a tag and injected a consistent amount of spaces after it for the replace, based on what we found with a regex in the match.
But now I need to replace fields within the string.
Currently my script does this. However it isn't the clean style I show above in my desired input. For the above script, for example, I'd get:
#informative info on this config var.
VAR1 131 comment second_comment
#more informative info
VAR2 3.4 13132 yet_another_comment
#FOO THE VALUE WARNING
Foo 33.3333 donteditthis_comment
#BAR ANOTHER VALUE WARNING
Bar 22.1111 donteditthis_comment
#Yet another informative comment
VAR3 321
Well the values are right, but all that work for readability is ruined.... argghhh. Now if I opened the files in emacs and pressed the insert key I would be able to arrow over to the '3' in the Foo tagged value and then start typing the new value and get the output file I listed as desired. I want my sed inline edit to do the same thing... (Maybe as Kent showed this is possible with column?)
I want it to only overwrite on the trailing end. Further, I want it to start the next field (let's say I do end up editing the warning) at the same column it started at in the old file.
Put more simply I want to do a variant sed -i "s/${MATCH}/${REPLACE}/g" ${FILENAME} that writes replacement variables to a tagged line, starting at the same column that entry is at in the CURRENT version of the config. file.
This requires both saving the spaces and somehow coding to only write on the trailing end and pad the output so that the next entry stays in the same starting column if my new value's string is shorter than the old one.
In order to improve upon my current solution it is crucial to both maintain the column start position for each piece of data in a tagged entry and secondly to be able to match a tag with an arbitrary amount of trailing whitespace (which must be preserved)... these are trivial operations in a text editor (see the emacs example above) with the help of the insert key, but more complicated in the script scenario.
This way:
1. I make sure the values can be written no matter how other users space their file.
2. If users (like myself) do bother to match the fields column-wise to the comment above to improve readability, then the script won't mess this up, as it only writes on the trailing side.
Let me know if this is unclear at all.
If this can't be done or is overly onerous with sed alone, I'd be open to an efficient perl or python subscript that my bashscript would call, although obviously an inline solution (if concise and understandable) is preferable, if possible.
the column may help you, see the example below if it was you are looking for:
kent$ cat f
#NAME Weight Age Name
Boss 160.000 43 BOB
kent$ sed 's/160.000/7.0/' f|column -t
#NAME Weight Age Name
Boss 7.0 43 BOB
kent$ sed 's/160.000/7.7777777777/' f|column -t
#NAME Weight Age Name
Boss 7.7777777777 43 BOB
Using one of your sample datasets, you can get
$ doit Weight 160 7.555555 <<\EOD
#NAME Weight Age Name
Boss 160.000 43 BOB
Me 180 25 JAKE
EOD
#NAME Weight Age Name
Boss 7.555555555555 43 BOB
Me 180 25 JAKE
$
with this function:
$ doit ()
{
awk -v tag=$1 -v old=$2 -v new=$3 '
NR==1 { for (i=0;i++<NF;) field[$i]=i } # read column headers
$field[tag] == old {
$field[tag] = new
}
{print}
' | column -t
}
the useful part being loading the column headers into the field name->column map. With tag being "Weight", field[tag] evaluates to 2 for this input so $field[tag] is $2 i.e. the second field, the Weight column.
To answer your questions as asked:
My questions are:
How do I capture the patterns between values to preserve spaces?
Because of what Kent pointed out, it's probably best to regenerate spacing correct for the new data. If preserving the exact input spacing where at all possible, forcing lines with replacement values to have different alignment for some values, I'd say ask that again as a separate "no, really, help me here" question.
Does sed feature a way to force a shell variable say ${FOOBAR} to be a certain width?
sed's Turing complete, but that's as close to a feature as it's got for this. Sardonic humor aside, the only correct answer here is "no".
3b. If not what program in Linux is best suited for this truncation assuming I use a mix of number and string data?
Kent got that one. I didn't know about column, I get questions answered here I didn't even know to ask. For the value location and substitution awk should do you just fine.

Is there a c++ library that reads named columns from files?

I regularly deal with files that look like this (for compatibility with R):
# comments
# more comments
col1 col2 col3
1 a hi
2 b there
. . .
Very often, I will want to read col2 into a vector or other container. It's not hard to write a function that parses this kind of file, but I would be surprised if there were no well tested library to do it for me. Does such a library exist? (As I say, it's not hard to roll your own, but as I am not a C++ expert, it would be some trouble for me to do use the templates that would allow me to use an arbitrary container to contain arbitrary data types.)
EDIT:
I know the name of the column I want, but not what order the columns in this particular file will be in. Columns are separated by an unknown amount white space which may be tabs or spaces (probably not both). The first entry on each line may or may not be preceded by white space, sometimes that will change within one file, e.g.
number letter
8 g
9 h
10 i
Boost split may do what you want, providing you can consistently split on whitespace.
I am not aware of any C++ library that will do this. A simple solution, however, would be to use linux cut. You would have to remove the comments first, which is easily done with sed:
sed -e '/^#/d' <your_file>
Then you could apply the following command which would select just the text from the third column:
cut -d' ' -f3 <your_file>
You could combine those together with a pipe to make it a single command:
sed -e '/^#/d' <your_file> | cut -d' ' -f3 <your_file>
You could run this command programmatically, then rather simply append each line to a stl container.
// pseudocode
while(file.hasNextLine())
{
container << file.readNextLine();
}
For how to actually run cut from within code, see this answer.