find integer from Nth field in awk

find integer from Nth field in awk - regex

I've got a script producing output from Twitter's streaming API into a format like this
semmelracet_dev | 450587667 | 1 semla till idag! #semmelreport | 569866960802062336 | 1424701845728
Where field 3 is the actual tweet.
What I want to do was to grab the integer from that field and insert it into a database as a separate field/column.
To just insert those fields is not a problem, but getting the INT and handling it separately is. Could I enforce usage and split the field after the INT?
Sorry about not including expexted output. Basically i'm constructing a mysql insert like
"... insert into report values ("semmelracet_dev", 450587667, "1 semla till idag! #semmelreport", 1, 569866960802062336, 1424701845728)"
Any ideas?
EDIT again, or if it's something that's not doable, maybe keep all the columns and in field 3 just keep the int when inserting them into the database?
EDIT 2
Tried the solution from jeanrjc below with mixed success
cat tweetReport.txt | awk -F"\|" '{n=split($3,s," "); for (i=1;i<=n;i++) if
(s[i] + 0 == s[i]) int_val = s[i]}{print "\""$1"\","$2", \""$3"\",
"int_val", "$4", "$5}')
-bash: syntax error near unexpected token `)'
I then removed the trailing ) and got
cat tweetReport.txt | awk -F"\|" '{n=split($3,s," "); for (i=1;i<=n;i++) if
(s[i] + 0 == s[i]) int_val = s[i]}{print "\""$1"\","$2", \""$3"\",
"int_val", "$4", "$5}'
awk: warning: escape sequence `\|' treated as plain `|'
"semmelracet_dev ", 450587667 , " 1 semla till idag! #semmelreport ", 1,
569866960802062336 , 1424701845728 "",, "", 1, ,
Which is better, but with some jibberish i don't quite understand..

I'm not sure I fully understand what you want, but I guessed that you wanted to extract (or get rid of) the int value of the 3rd field, is that right ?
To do so:
awk -F"|" '{print $3}' file | awk '{for (i=1; i<=NF; i++) if ($i + 0 == $i) print $i}'
where ($i + 0 == $i) tests whether this word is an int or not, then print it.
I hope that from that, you'll manage to get what you want. Precise your expected output otherwise.
EDIT : To obtain desired output:
$ cat tweet.txt
semmelracet_dev | 999999999 | 2 foo bar! #fooreport | 999996696080209999 | 1429999845728
semmelracet_dev | 450587667 | 1 semla till idag! #semmelreport | 569866960802062336 | 1424701845728
$ awk -F"\|" '{n=split($3,s," "); for (i=1;i<=n;i++) if (s[i] + 0 == s[i]) int_val = s[i]}{print "\""$1"\","$2", \""$3"\", "int_val", "$4", "$5}' tweet.txt
"semmelracet_dev ", 999999999 , " 2 foo bar! #fooreport ", 2, 999996696080209999 , 1429999845728
"semmelracet_dev ", 450587667 , " 1 semla till idag! #semmelreport ", 1, 569866960802062336 , 1424701845728
Which you can capture in a variable and then pass it to construct your mysql insert.
HTH

I'm using a bashism to feed data to awk, you can use something else:
$ t="semmelracet_dev | 450587667 | 1 semla till idag! #semmelreport | 569866960802062336 | 1424701845728"
$ awk -F'|' '{n=$3;sub(/^ */,"",n);sub(/ .*/,"",n);print n;}' <<<"$t"
1
This simply does a couple of substitutions to "trim" data around the pipe, then remove anything after the first space.
If you want help inserting this number into a database, you'll have to be a bit more explicit about what tools you're using. For example, this might work:
$ n=$(awk -F'|' '{n=$3;sub(/^ */,"",n);sub(/ .*/,"",n);print n;}' <<<"$t")
$ psql -c $(printf 'INSERT INTO table (n) VALUES (%d);' "$n")
Or if you'd prefer to get these data from a log file and pipe thing through psql, you could do it this way:
awk -F'|' -vfmt="INSERT INTO table (n) VALUES (%d);" '
{
n=$3; sub(/^ */,"",n); sub(/ .*/,"",n);
printf(fmt,n);
}' input.txt \
| psql

awk 'BEGIN{FS="|";} {print($3);}' | sed -r 's/([0-9]+)(.*)/\1/'

Related

Awk if-statement to count the number of characters (wc -m) coming from a pipe

I tried to scratch my head around this issue and couldn't understand what it wrong about my one liner below.
Given that
echo "5" | wc -m
2
and that
echo "55" | wc -m
3
I tried to add a zero in front of all numbers below 9 with an awk if-statement as follow:
echo "5" | awk '{ if ( wc -m $0 -eq 2 ) print 0$1 ; else print $1 }'
05
which is "correct", however with 2 digits numbers I get the same zero in front.
echo "55" | awk '{ if ( wc -m $0 -eq 2 ) print 0$1 ; else print $1 }'
055
How come? I assumed this was going to return only 55 instead of 055. I now understand I'm constructing the if-statement wrong.
What is then the right way (if it ever exists one) to ask awk to evaluate if whatever comes from the | has 2 characters as one would do with wc -m?
I'm not interested in the optimal way to add leading zeros in the command line (there are enough duplicates of that).
Thanks!

I suggest to use printf:
printf "%02d\n" "$(echo 55 | wc -m)"
03
printf "%02d\n" "$(echo 123456789 | wc -m)"
10
Note: printf is available as a bash builtin. It mainly follows the conventions from the C function printf().. Check
help printf # For the bash builtin in particular
man 3 printf # For the C function

Facts:
In AWK strings or variables are concatenated just by placing them side by side.
For example: awk '{b="v" ; print "a" b}'
In AWK undefined variables are equal to an empty string or 0.
For example: awk '{print a "b", -a}'
In AWK non-zero strings are true inside if.
For example: awk '{ if ("a") print 1 }'
wc -m $0 -eq 2 is parsed as (i.e. - has more precedence then string concatenation):
wc -m $0 -eq 2
( wc - m ) ( $0 - eq ) 2
^ - integer value 2, converted to string "2"
^^ - undefined variable `eq`, converted to integer 0
^^ - input line, so string "5" converted to integer 5
^ - subtracts 5 - 0 = 5
^^^^^^^^^^^ - integer 5, converted to string "5"
^ - undefined variable "m", converted to integer 0
^^ - undefined variable "wc" converted to integer 0
^^^^^^^^^ - subtracts 0 - 0 = 0, converted to a string "0"
^^^^^^^^^^^^^^^^^^^^^^^^^ - string concatenation, results in string "052"
The result of wc -m $0 -eq 2 is string 052 (see awk '{ print wc -m $0 -eq 2 }' <<<'5'). Because the string is not empty, if is always true.
It should return only 55 instead of 055
No, it should not.
Am I constructing the if statement wrong?
No, the if statement has valid AWK syntax. Your expectations to how it works do not match how it really works.

To actually make it work (not that you would want to):
echo 5 | awk '
{
cmd = "echo " $1 " | wc -m"
cmd | getline len
if (len == 2)
print "0"$1
else
print $1
}'
But why when you can use this instead:
echo 5 | awk 'length($1) == 1 { $1 = "0"$1 } 1'
Or even simpler with the various printf solutions seen in the other answers.

grep for a particular string and count the number of fatals and errors

I am having a file called violations.txt as below:
column1 column2 column3 column4 Situation
Data is preesnt | Bgn | Status (!) | There are no current runs | Critical level
Data is not existing | Nbgn | Status (*) | There are runs | Medium level
Data limit is exceeded | Gnp | Status (!) | The runs are not present | Higher level
Dats existing|present | Esp | Status (*) | The runs are present | Normal|Higher level
I need the output like this:
violations.txt:
Fatal:
Bgn : 1
Gnp : 1
Total number of fatals : 2
Errors:
Nbgn : 1
Esp : 1
Total number of errors : 2
I am trying to execute if the file violations.txt conatins in the column3 the word Status (!) as a fatal and if it contains the word Status(*) as a warning and also the count of it.
I tried the below code but not getting the exact output:
#!/bin/bash
pwd
echo " " ;
File="violations.txt"
for g in $File;
do
awk -F' +\\| +'
if "$3"== "Status (!) /" "$File" ; then
'BEGIN{ getline; getline }
truncate -s -1 "$File"
echo "$g:";
{ a[$2]++ }
END{ for(i in a){ print i, a[i]; s=s+a[i] };
print "Total numer of fatals:", s}' violations.txt
else
echo "$g:";
'BEGIN{ getline; getline }
truncate -s -1 "$File"
echo "$g:";
{ a[$2]++ }
END{ for(i in a){ print i, a[i]; s=s+a[i] };
print "Total numer of errors:", s}' violations.txt
fi
done

Haven't we already covered this in a somewhat different reincarnation?
$ cat tst.awk
BEGIN {
FS="[[:blank:]][|][[:blank:]]"
OFS=" : "
}
FNR>1{
gsub(/[[:blank:]]/, "", $2)
gsub(/[[:blank:]]/, "", $3)
a[$3][$2]++
}
END {
#PROCINFO["sorted_in"]="#ind_str_desc"
print "Out" OFS
for(i in a) {
print ($i~/*/?"Fatal":"Error") OFS
t=0
for(j in a[i]) {
print "\t" j, a[i][j]
t+=a[i][j]
}
print "Total", t
t=0
}
}
running awk -f tst.awk myFile results in:
Out :
Fatal :
Gnp : 1
Bgn : 1
Total : 2
Fatal :
Esp : 1
Nbgn : 1
Total : 2

Could you please try following, written and tested with shown samples. Written and tested in
https://ideone.com/rsVIV4
awk '
BEGIN{
FS="\\|"
}
FNR==1{ next }
/Status \(\!\)/{
match($0,/\| +[a-zA-Z]+ +\| Status/)
val=substr($0,RSTART,RLENGTH)
gsub(/\| +| +\| Status/,"",val)
countEr[val]++
val=""
}
/Status \(\*\)/{
match($0,/\| +[a-zA-Z]+ +\| Status/)
val=substr($0,RSTART,RLENGTH)
gsub(/\| +| +\| Status/,"",val)
countSu[val]++
val=""
}
END{
print "Fatal:"
for(i in countEr){
print "\t"i,countEr[i]
sumEr+=countEr[i]
}
print "Total number of fatal:" sumEr
for(i in countSu){
print "\t"i,countSu[i]
sumSu+=countSu[i]
}
print "Total number of errors:"sumSu
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS="\\|" ##Setting field separator as | for all lines here.
}
FNR==1{ next } ##Checking condition if FNR==1 then go next and do not do anything on this line.
/Status \(\!\)/{ ##Checking condition if line contains Status (!) then do following.
match($0,/\| +[a-zA-Z]+ +\| Status/) ##Using match function to match pipe space letters space and | space and Status string here.
val=substr($0,RSTART,RLENGTH) ##Creating sub-string from current line here.
gsub(/\| +| +\| Status/,"",val) ##Globally substituting pipe space and Status keyword with NULL in val here.
countEr[val]++ ##Creating array countEr with index of val and increment its count with 1 here.
val="" ##Nullifying val here.
}
/Status \(\*\)/{ ##Checking condition if line contains Status (*) then do following.
match($0,/\| +[a-zA-Z]+ +\| Status/) ##Using match function to match pipe space letters space and | space and Status string here.
val=substr($0,RSTART,RLENGTH) ##Creating sub-string from current line here.
gsub(/\| +| +\| Status/,"",val) ##Globally substituting pipe space and Status keyword with NULL in val here.
countSu[val]++ ##Creating array countSu with index of val and increment its count with 1 here.
val="" ##Nullifying val here.
}
END{ ##Starting END block of this program from here.
print "Fatal:" ##Printing Fatal keyword here.
for(i in countEr){ ##Traversing through countEr here.
print "\t"i,countEr[i] ##Printing tab i and value of countEr with index i here.
sumEr+=countEr[i] ##Creating sumEr and keep adding value of countEr here.
}
print "Total number of fatal:" sumEr ##Printing string Total number of fatal/l and value of sumEr here.
for(i in countSu){ ##Traversing through countSu here.
print "\t"i,countSu[i] ##Printing tab i and value of countSu with index i here.
sumSu+=countSu[i] ##Creating sumSu and keep adding value of countSu here.
}
print "Total number of errors:"sumSu ##Printing string Total number of errors: with value of sumSu here.
}
' Input_file ##Mentioning Input_file name here.

With GNU awk for various extensions and using the fact that your input is fixed-width fields:
$ cat tst.awk
BEGIN {
FIELDWIDTHS="24 1 11 1 15 1 27 1 *"
}
NR>1 {
type = ($5 ~ /!/ ? "Fatal" : "Error")
keyTot[type][gensub(/\s/,"","g",$3)]++
tot[type]++
}
END {
for (type in tot) {
print type ":"
for (key in keyTot[type]) {
print " " key " : " keyTot[type][key]
}
print "Total number of " type " : " tot[type]+0
}
}
.
$ awk -f tst.awk file
Error:
Esp : 1
Nbgn : 1
Total number of Error : 2
Fatal:
Gnp : 1
Bgn : 1
Total number of Fatal : 2

Your file looks very badly formatted, from a computer point of view, let me explain you why:
column1 column2 column3 column4 Situation
Data is preesnt | Bgn | Status (!) | There are no current runs | Critical level
Data is not existing | Nbgn | Status (*) | There are runs | Medium level
Data limit is exceeded | Gnp | Status (!) | The runs are not present | Higher level
Dats existing|present | Esp | Status (*) | The runs are present | Normal|Higher level
The locations of the first character of the headers of columns 1, 3 and 4 are equal to the first characters of the contents, but for columns 2 and 5, this is not the case.
You are using the pipe character "|" as a separator for your columns, but also for a separator within the columns themselves. This combination is very bad for automatic parsin, based on "|" character as a separator.
Therefore I have following proposals for improving your file:
First let's take care of the column headings first characters:
column1 column2 column3 column4 Situation
Data is preesnt | Bgn | Status (!) | There are no current runs | Critical level
Data is not existing | Nbgn | Status (*) | There are runs | Medium level
Data limit is exceeded | Gnp | Status (!) | The runs are not present | Higher level
Dats existing|present | Esp | Status (*) | The runs are present | Normal|Higher level
If you agree on this, you might use the amount of characters for reading your columns.
Second, let's change the internal separator (replace it by a slash character):
column1 column2 column3 column4 Situation
Data is preesnt | Bgn | Status (!) | There are no current runs | Critical level
Data is not existing | Nbgn | Status (*) | There are runs | Medium level
Data limit is exceeded | Gnp | Status (!) | The runs are not present | Higher level
Dats existing/present | Esp | Status (*) | The runs are present | Normal/Higher level
Do you agree with my first or second proposal? If yes, please adapt your question (by adding the agreed proposal), this will make everything easier to handle.

Parse default Salt highstate output

I'm trying to parse the highstate output of Salt has proven to be difficult. Without changing the output to json due to the fact that I still want it to be human legible.
What's the best way to convert the Summary into something machine readable?
Summary for app1.domain.com
--------------
Succeeded: 278 (unchanged=12, changed=6)
Failed: 0
--------------
Total states run: 278
Total run time: 7.383 s
--
Summary for app2.domain.com
--------------
Succeeded: 278 (unchanged=12, changed=6)
Failed: 0
--------------
Total states run: 278
Total run time: 7.448 s
--
Summary for app0.domain.com
--------------
Succeeded: 293 (unchanged=13, changed=6)
Failed: 0
--------------
Total states run: 293
Total run time: 7.510 s
Without a better idea I'm trying to grep and awk the output and insert it into a csv.
These two work:
cat ${_FILE} | grep Summary | awk '{ print $3} ' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
cat ${_FILE} | grep -oP '(?<=unchanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
But this one fails but works in Reger
cat ${_FILE} | grep -oP '(?<=\schanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
EDIT1: #vintnes #ikegami I agree I'd much rather take the json output parse the output but Salt doesn't offer a summary of changes when outputting to josn. So far this is what I have and while very ugly, it's working.
cat ${_FILE} | grep Summary | awk '{ print $3} ' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | grep -oP '(?<=unchanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | grep unchanged | awk -F' ' '{ print $4}' | \
grep -oP '(?<=changed=)[0-9]+' | tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | { grep "Warning" || true; } | awk -F: '{print $2+0} END { if (!NR) print "null" }' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | { grep "Failed" || true; } | awk -F: '{print $2+0} END { if (!NR) print "null" }' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
csvtool transpose /tmp/highstate_tmp.csv > /tmp/highstate.csv;
sed -i '1 i\instance,unchanged,changed,warning,failed' /tmp/highstate.csv;
Output:
instance,unchanged,changed,warning,failed
app1.domain.com,12,6,,0
app0.domain.com,13,6,,0
app2.domain.com,12,6,,0

Here you go. This will also work if your output contains warnings. Please note that the output is in a different order than you specified; it's the order in which each record occurs in the file. Don't hesitate with any questions.
$ awk -v OFS=, '
BEGIN { print "instance,unchanged,changed,warning,failed" }
/^Summary/ { instance=$NF }
/^Succeeded/ { split($3 $4 $5, S, /[^0-9]+/) }
/^Failed/ { print instance, S[2], S[3], S[4], $2 }
' "$_FILE"
split($3 $4 $5, S, /[^0-9]+/) handles the possibility of warnings by disregarding the first two "words" Succeeded: ### and using any number of non-digits as a separator.
edit: Printed on /^Fail/ instead of using /^Summ/ and END.

perl -e'
use strict;
use warnings qw( all );
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
$csv->say(select(), [qw( instance unchanged change warning failed )]);
my ( $instance, $unchanged, $changed, $warning, $failed );
while (<>) {
if (/^Summary for (\S+)/) {
( $instance, $unchanged, $changed, $warning, $failed ) = $1;
}
elsif (/^Succeeded:\s+\d+ \(unchanged=(\d+), changed=(\d+)\)/) {
( $unchanged, $changed ) = ( $1, $2 );
}
elsif (/^Warning:\s+(\d+)/) {
$warning = $1;
}
elsif (/^Failed:\s+(\d+)/) {
$failed = $1;
$csv->say(select(), [ $instance, $unchanged, $changed, $warning, $failed ]);
}
}
'
Provide input via STDIN, or provide path to file(s) from which to read as arguments.
Terse version:
perl -MText::CSV_XS -ne'
BEGIN {
$csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
$csv->say(select(), [qw( instance unchanged change warning failed )]);
}
/^Summary for (\S+)/ and #row=$1;
/^Succeeded:\s+\d+ \(unchanged=(\d+), changed=(\d+)\)/ and #row[1,2]=($1,$2);
/^Warning:\s+(\d+)/ and $row[3]=$1;
/^Failed:\s+(\d+)/ and ($row[4]=$1), $csv->say(select(), \#row);
'

Improving answer from #vintnes.
Producing output as tab separated CSV
Write awk script that reads values from lines by their order.
Print each record as it is read.
script.awk
BEGIN {print("computer","succeeded","unchanged","changed","failed","states run","run time");}
FNR%8 == 1 {arr[1] = $3}
FNR%8 == 3 {arr[2] = $2; arr[3] = extractNum($3); arr[4] = extractNum($4)}
FNR%8 == 4 {arr[5] = $2;}
FNR%8 == 6 {arr[6] = $4;}
FNR%8 == 7 {arr[7] = $4; print arr[1],arr[2],arr[3],arr[4],arr[5],arr[6],arr[7];}
function extractNum(str){match(str,/[[:digit:]]+/,m);return m[0];}
run script
Tab separated CSV output
awk -v OFS="\t" -f script.awk input-1.txt input-2.txt ...
Comma separated CSV output
awk -v OFS="," -f script.awk input-1.txt input-2.txt ...
Output
computer succeeded unchanged changed failed states run run time
app1.domain.com 278 12 6 0 278 7.383
app2.domain.com 278 12 6 0 278 7.448
app0.domain.com 293 13 6 0 293 7.510
computer,succeeded,unchanged,changed,failed,states run,run time
app1.domain.com,278,12,6,0,278,7.383
app2.domain.com,278,12,6,0,278,7.448
app0.domain.com,293,13,6,0,293,7.510
Explanation
BEGIN {print("computer","succeeded","unchanged","changed","failed","states run","run time");}
Print the heading CSV line
FNR%8 == 1 {arr[1] = $3}
Extract the arr[1] value from 3rd field in (first line from 8 lines)
FNR%8 == 3 {arr[2] = $2; arr[3] = extractNum($3); arr[4] = extractNum($4)}
Extract the arr[2,3,4] values from 2nd,3rd,4th fields in (third line from 8 lines)
FNR%8 == 4 {arr[5] = $2;}
Extract the arr[5] value from 2nd field in (4th line from 8 lines)
FNR%8 == 6 {arr[6] = $4;}
Extract the arr[6] value from 4th field in (6th line from 8 lines)
FNR%8 == 7 {arr[7] = $4;
Extract the arr[7] value from 4th field in (7th line from 8 lines)
print arr[1],arr[2],arr[3],arr[4],arr[5],arr[6],arr[7];}
print the array elements for the extracted variable at the completion of reading 7th line from 8 lines.
function extractNum(str){match(str,/[[:digit:]]+/,m);return m[0];}
Utility function to extract numbers from text field.

Search strings from bulk data

I have a folder with many files containing text like the following:
blabla
chargeableDuration 00 01 03
...
timeForStartOfCharge 14 55 41
blabla
...
blabla
calledPartyNumber 123456789
blabla
...
blabla
callingPartyNumber 987654321
I require the output like:
987654321 123456789 145541 000103
I have been trying with following awk:
awk -F '[[:blank:]:=,]+' '/findstr chargeableDuration|dateForStartOfCharge|calledPartyNumber|callingPartyNumber/ && $4{
if (calledPartyNumber != "")
print dateForStartOfCharge, "NIL"
dateForStartOfCharge=$5
next
}
/calledPartyNumber/ {
for(i=1; i<=NF; i++)
if ($i ~ /calledPartyNumber/)
break
print chargeableDuration, $i
chargeableDuration=""
}' file
Cannot make it work. Please help.

Assuming you have a file with text named "test.txt", below linux shell command will do the work for you.
egrep -o "[0-9 ]{1,}" test.txt | tr -d ' \t\r\f' | sort -nr | tr "\n" "\t"

Pretty much like Manishs answer:
tac test_regex.txt | grep -oP '(?<=chargeableDuration|timeForStartOfCharge|calledPartyNumber|callingPartyNumber)\s+([^\n]+)' | tr -d " \t\r\f" | tr "\n" " "
Only difference is, you keep the preceding order instead of sorting the result. So for your example both solutions would produce the same output, but you could end up with different results.

awk '/[0-9 ]+$/{
x=substr($0,( index($0," ") + 1 ) );
gsub(" ","",x);
a[$1]=x
}
END {
split("callingPartyNumber calledPartyNumber timeForStartOfCharge chargeableDuration",b," ");
for (i=1;i<=4;i++){
printf a[(b[i])]" "
}
}'
/[0-9 ]+$/ : Find lines end with number separated with/without spaces.
x=substr($0,( index($0," ") + 1 ) ) : Get the index after the first space match in $0 and save the substring after the first space match(ie digits) to a variable x
gsub(" ","",x) : Remove white spaces in x
a[$1]=x : Create an array a with index as $0 and assign x to it
END:
split("callingPartyNumber calledPartyNumber timeForStartOfCharge chargeableDuration",b," ") : Create array b where index 1,2,3 and 4 has value of your required field in the order you need
for (i=1;i<=4;i++){
printf a[(b[i])]" "
} : for loop to get the value in array a with index as value in array b[1],b[2],b[3] and b[4]

Parsing a .csv-like file in bash

I have a file formatted as follows:
string1,string2,string3,...
...
I have to analyze the second column, counting the occurrences of each string, and producing a file formatted as follows:
"number of occurrences of x",x
"number of occurrences of y",y
...
I managed to write the following script, that works fine:
#!/bin/bash
> output
regExp='^\s*([0-9]+) (.+)$'
while IFS= read -r line
do
if [[ "$line" =~ $regExp ]]
then
printf "${BASH_REMATCH[1]},${BASH_REMATCH[2]}\n" >> output
fi
done <<< "`gawk -F , '!/^$/ {print $2}' $1 | sort | uniq -c`"
My question is:
There is a better and simpler way to do the job?
In particular I don't know how to fix that:
gawk -F , '!/^$/ {print $2}' miocsv.csv | sort | uniq -c | gawk '{print $1","$2}'
The problem is that string2 can contain whitespaces and, if so, the second call on gawk will truncate the string.
Neither i know how to print all the field "from 2 to NF", maintaining the delimiter, which can occur several times in succession.
Thank very much,
Goodbye
EDIT:
As asked, here there is some sample data:
(It is an exercise, sorry for the inventive)
Input:
*,*,*
test, test ,test
prova, * , prova
test,test,test
prova, prova ,prova
leonardo,da vinci,leonardo
in,o u t ,pr
, spaces ,
, spaces ,
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
in,o u t ,pr
test, test ,test
, tabs ,
, tabs ,
po,po,po
po,po,po
po,po,po
prova, * , prova
prova, * , prova
*,*,*
*,*,*
*,*,*
, spaces ,
, tabs ,
Output:
3, *
4,*
4,da vinci
2,o u t
3,po
1, prova
3, spaces
3, tabs
1,test
2, test

A one-liner in awk:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv
It stores the count for each 2nd column string in the associative array x, and in the end loops through the array and prints the results.
To get the exact output you showed for this example, you need to pipe it to sort(1), setting the field delimiter to , and the sort key to the 2nd field:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv | sort -t, -k2,2
The only condition, of course, is that the 2nd column of each line doesn't contain a ,

You can make your final awk:
gawk '{ sub(" *","",$0); sub(" ",",",$0); print }'
or use sed for this sort of thing:
sed 's/ *\([0-9]*\) /\1,/'

Here is a Perl one-liner, similar to Filipe's awk solution:
perl -F, -lane '$x{$F[1]}++; END{ for $i (sort keys %x) { print "$x{$i},$i" } }' input.csv
The output is sorted alphabetically according to the second column.
The #F autosplit array starts at index $F[0] while awk fields start with $1

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

find integer from Nth field in awk - regex

awk 'BEGIN{FS="|";} {print($3);}' | sed -r 's/([0-9]+)(.*)/\1/'

Related

Awk if-statement to count the number of characters (wc -m) coming from a pipe

grep for a particular string and count the number of fatals and errors

Parse default Salt highstate output

Search strings from bulk data

Parsing a .csv-like file in bash

Categories

Resources