Related
I'm trying to parse the highstate output of Salt has proven to be difficult. Without changing the output to json due to the fact that I still want it to be human legible.
What's the best way to convert the Summary into something machine readable?
Summary for app1.domain.com
--------------
Succeeded: 278 (unchanged=12, changed=6)
Failed: 0
--------------
Total states run: 278
Total run time: 7.383 s
--
Summary for app2.domain.com
--------------
Succeeded: 278 (unchanged=12, changed=6)
Failed: 0
--------------
Total states run: 278
Total run time: 7.448 s
--
Summary for app0.domain.com
--------------
Succeeded: 293 (unchanged=13, changed=6)
Failed: 0
--------------
Total states run: 293
Total run time: 7.510 s
Without a better idea I'm trying to grep and awk the output and insert it into a csv.
These two work:
cat ${_FILE} | grep Summary | awk '{ print $3} ' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
cat ${_FILE} | grep -oP '(?<=unchanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
But this one fails but works in Reger
cat ${_FILE} | grep -oP '(?<=\schanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
EDIT1: #vintnes #ikegami I agree I'd much rather take the json output parse the output but Salt doesn't offer a summary of changes when outputting to josn. So far this is what I have and while very ugly, it's working.
cat ${_FILE} | grep Summary | awk '{ print $3} ' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | grep -oP '(?<=unchanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | grep unchanged | awk -F' ' '{ print $4}' | \
grep -oP '(?<=changed=)[0-9]+' | tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | { grep "Warning" || true; } | awk -F: '{print $2+0} END { if (!NR) print "null" }' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | { grep "Failed" || true; } | awk -F: '{print $2+0} END { if (!NR) print "null" }' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
csvtool transpose /tmp/highstate_tmp.csv > /tmp/highstate.csv;
sed -i '1 i\instance,unchanged,changed,warning,failed' /tmp/highstate.csv;
Output:
instance,unchanged,changed,warning,failed
app1.domain.com,12,6,,0
app0.domain.com,13,6,,0
app2.domain.com,12,6,,0
Here you go. This will also work if your output contains warnings. Please note that the output is in a different order than you specified; it's the order in which each record occurs in the file. Don't hesitate with any questions.
$ awk -v OFS=, '
BEGIN { print "instance,unchanged,changed,warning,failed" }
/^Summary/ { instance=$NF }
/^Succeeded/ { split($3 $4 $5, S, /[^0-9]+/) }
/^Failed/ { print instance, S[2], S[3], S[4], $2 }
' "$_FILE"
split($3 $4 $5, S, /[^0-9]+/) handles the possibility of warnings by disregarding the first two "words" Succeeded: ### and using any number of non-digits as a separator.
edit: Printed on /^Fail/ instead of using /^Summ/ and END.
perl -e'
use strict;
use warnings qw( all );
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
$csv->say(select(), [qw( instance unchanged change warning failed )]);
my ( $instance, $unchanged, $changed, $warning, $failed );
while (<>) {
if (/^Summary for (\S+)/) {
( $instance, $unchanged, $changed, $warning, $failed ) = $1;
}
elsif (/^Succeeded:\s+\d+ \(unchanged=(\d+), changed=(\d+)\)/) {
( $unchanged, $changed ) = ( $1, $2 );
}
elsif (/^Warning:\s+(\d+)/) {
$warning = $1;
}
elsif (/^Failed:\s+(\d+)/) {
$failed = $1;
$csv->say(select(), [ $instance, $unchanged, $changed, $warning, $failed ]);
}
}
'
Provide input via STDIN, or provide path to file(s) from which to read as arguments.
Terse version:
perl -MText::CSV_XS -ne'
BEGIN {
$csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
$csv->say(select(), [qw( instance unchanged change warning failed )]);
}
/^Summary for (\S+)/ and #row=$1;
/^Succeeded:\s+\d+ \(unchanged=(\d+), changed=(\d+)\)/ and #row[1,2]=($1,$2);
/^Warning:\s+(\d+)/ and $row[3]=$1;
/^Failed:\s+(\d+)/ and ($row[4]=$1), $csv->say(select(), \#row);
'
Improving answer from #vintnes.
Producing output as tab separated CSV
Write awk script that reads values from lines by their order.
Print each record as it is read.
script.awk
BEGIN {print("computer","succeeded","unchanged","changed","failed","states run","run time");}
FNR%8 == 1 {arr[1] = $3}
FNR%8 == 3 {arr[2] = $2; arr[3] = extractNum($3); arr[4] = extractNum($4)}
FNR%8 == 4 {arr[5] = $2;}
FNR%8 == 6 {arr[6] = $4;}
FNR%8 == 7 {arr[7] = $4; print arr[1],arr[2],arr[3],arr[4],arr[5],arr[6],arr[7];}
function extractNum(str){match(str,/[[:digit:]]+/,m);return m[0];}
run script
Tab separated CSV output
awk -v OFS="\t" -f script.awk input-1.txt input-2.txt ...
Comma separated CSV output
awk -v OFS="," -f script.awk input-1.txt input-2.txt ...
Output
computer succeeded unchanged changed failed states run run time
app1.domain.com 278 12 6 0 278 7.383
app2.domain.com 278 12 6 0 278 7.448
app0.domain.com 293 13 6 0 293 7.510
computer,succeeded,unchanged,changed,failed,states run,run time
app1.domain.com,278,12,6,0,278,7.383
app2.domain.com,278,12,6,0,278,7.448
app0.domain.com,293,13,6,0,293,7.510
Explanation
BEGIN {print("computer","succeeded","unchanged","changed","failed","states run","run time");}
Print the heading CSV line
FNR%8 == 1 {arr[1] = $3}
Extract the arr[1] value from 3rd field in (first line from 8 lines)
FNR%8 == 3 {arr[2] = $2; arr[3] = extractNum($3); arr[4] = extractNum($4)}
Extract the arr[2,3,4] values from 2nd,3rd,4th fields in (third line from 8 lines)
FNR%8 == 4 {arr[5] = $2;}
Extract the arr[5] value from 2nd field in (4th line from 8 lines)
FNR%8 == 6 {arr[6] = $4;}
Extract the arr[6] value from 4th field in (6th line from 8 lines)
FNR%8 == 7 {arr[7] = $4;
Extract the arr[7] value from 4th field in (7th line from 8 lines)
print arr[1],arr[2],arr[3],arr[4],arr[5],arr[6],arr[7];}
print the array elements for the extracted variable at the completion of reading 7th line from 8 lines.
function extractNum(str){match(str,/[[:digit:]]+/,m);return m[0];}
Utility function to extract numbers from text field.
I have a file formatted as follows:
string1,string2,string3,...
...
I have to analyze the second column, counting the occurrences of each string, and producing a file formatted as follows:
"number of occurrences of x",x
"number of occurrences of y",y
...
I managed to write the following script, that works fine:
#!/bin/bash
> output
regExp='^\s*([0-9]+) (.+)$'
while IFS= read -r line
do
if [[ "$line" =~ $regExp ]]
then
printf "${BASH_REMATCH[1]},${BASH_REMATCH[2]}\n" >> output
fi
done <<< "`gawk -F , '!/^$/ {print $2}' $1 | sort | uniq -c`"
My question is:
There is a better and simpler way to do the job?
In particular I don't know how to fix that:
gawk -F , '!/^$/ {print $2}' miocsv.csv | sort | uniq -c | gawk '{print $1","$2}'
The problem is that string2 can contain whitespaces and, if so, the second call on gawk will truncate the string.
Neither i know how to print all the field "from 2 to NF", maintaining the delimiter, which can occur several times in succession.
Thank very much,
Goodbye
EDIT:
As asked, here there is some sample data:
(It is an exercise, sorry for the inventive)
Input:
*,*,*
test, test ,test
prova, * , prova
test,test,test
prova, prova ,prova
leonardo,da vinci,leonardo
in,o u t ,pr
, spaces ,
, spaces ,
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
in,o u t ,pr
test, test ,test
, tabs ,
, tabs ,
po,po,po
po,po,po
po,po,po
prova, * , prova
prova, * , prova
*,*,*
*,*,*
*,*,*
, spaces ,
, tabs ,
Output:
3, *
4,*
4,da vinci
2,o u t
3,po
1, prova
3, spaces
3, tabs
1,test
2, test
A one-liner in awk:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv
It stores the count for each 2nd column string in the associative array x, and in the end loops through the array and prints the results.
To get the exact output you showed for this example, you need to pipe it to sort(1), setting the field delimiter to , and the sort key to the 2nd field:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv | sort -t, -k2,2
The only condition, of course, is that the 2nd column of each line doesn't contain a ,
You can make your final awk:
gawk '{ sub(" *","",$0); sub(" ",",",$0); print }'
or use sed for this sort of thing:
sed 's/ *\([0-9]*\) /\1,/'
Here is a Perl one-liner, similar to Filipe's awk solution:
perl -F, -lane '$x{$F[1]}++; END{ for $i (sort keys %x) { print "$x{$i},$i" } }' input.csv
The output is sorted alphabetically according to the second column.
The #F autosplit array starts at index $F[0] while awk fields start with $1
I've got a script producing output from Twitter's streaming API into a format like this
semmelracet_dev | 450587667 | 1 semla till idag! #semmelreport | 569866960802062336 | 1424701845728
Where field 3 is the actual tweet.
What I want to do was to grab the integer from that field and insert it into a database as a separate field/column.
To just insert those fields is not a problem, but getting the INT and handling it separately is. Could I enforce usage and split the field after the INT?
Sorry about not including expexted output. Basically i'm constructing a mysql insert like
"... insert into report values ("semmelracet_dev", 450587667, "1 semla till idag! #semmelreport", 1, 569866960802062336, 1424701845728)"
Any ideas?
EDIT again, or if it's something that's not doable, maybe keep all the columns and in field 3 just keep the int when inserting them into the database?
EDIT 2
Tried the solution from jeanrjc below with mixed success
cat tweetReport.txt | awk -F"\|" '{n=split($3,s," "); for (i=1;i<=n;i++) if
(s[i] + 0 == s[i]) int_val = s[i]}{print "\""$1"\","$2", \""$3"\",
"int_val", "$4", "$5}')
-bash: syntax error near unexpected token `)'
I then removed the trailing ) and got
cat tweetReport.txt | awk -F"\|" '{n=split($3,s," "); for (i=1;i<=n;i++) if
(s[i] + 0 == s[i]) int_val = s[i]}{print "\""$1"\","$2", \""$3"\",
"int_val", "$4", "$5}'
awk: warning: escape sequence `\|' treated as plain `|'
"semmelracet_dev ", 450587667 , " 1 semla till idag! #semmelreport ", 1,
569866960802062336 , 1424701845728 "",, "", 1, ,
Which is better, but with some jibberish i don't quite understand..
I'm not sure I fully understand what you want, but I guessed that you wanted to extract (or get rid of) the int value of the 3rd field, is that right ?
To do so:
awk -F"|" '{print $3}' file | awk '{for (i=1; i<=NF; i++) if ($i + 0 == $i) print $i}'
where ($i + 0 == $i) tests whether this word is an int or not, then print it.
I hope that from that, you'll manage to get what you want. Precise your expected output otherwise.
EDIT : To obtain desired output:
$ cat tweet.txt
semmelracet_dev | 999999999 | 2 foo bar! #fooreport | 999996696080209999 | 1429999845728
semmelracet_dev | 450587667 | 1 semla till idag! #semmelreport | 569866960802062336 | 1424701845728
$ awk -F"\|" '{n=split($3,s," "); for (i=1;i<=n;i++) if (s[i] + 0 == s[i]) int_val = s[i]}{print "\""$1"\","$2", \""$3"\", "int_val", "$4", "$5}' tweet.txt
"semmelracet_dev ", 999999999 , " 2 foo bar! #fooreport ", 2, 999996696080209999 , 1429999845728
"semmelracet_dev ", 450587667 , " 1 semla till idag! #semmelreport ", 1, 569866960802062336 , 1424701845728
Which you can capture in a variable and then pass it to construct your mysql insert.
HTH
I'm using a bashism to feed data to awk, you can use something else:
$ t="semmelracet_dev | 450587667 | 1 semla till idag! #semmelreport | 569866960802062336 | 1424701845728"
$ awk -F'|' '{n=$3;sub(/^ */,"",n);sub(/ .*/,"",n);print n;}' <<<"$t"
1
This simply does a couple of substitutions to "trim" data around the pipe, then remove anything after the first space.
If you want help inserting this number into a database, you'll have to be a bit more explicit about what tools you're using. For example, this might work:
$ n=$(awk -F'|' '{n=$3;sub(/^ */,"",n);sub(/ .*/,"",n);print n;}' <<<"$t")
$ psql -c $(printf 'INSERT INTO table (n) VALUES (%d);' "$n")
Or if you'd prefer to get these data from a log file and pipe thing through psql, you could do it this way:
awk -F'|' -vfmt="INSERT INTO table (n) VALUES (%d);" '
{
n=$3; sub(/^ */,"",n); sub(/ .*/,"",n);
printf(fmt,n);
}' input.txt \
| psql
awk 'BEGIN{FS="|";} {print($3);}' | sed -r 's/([0-9]+)(.*)/\1/'
I'm trying to parse lines with fields separated by "|" and space padding. I thought it would be as simple as this:
$ echo "1 a | 2 b | 3 c " | awk -F' *| *' '{ print "-->" $2 "<--" }'
However, what I get is
-->a<--
instead of the expected
-->2 b<--
I'm using GNU Awk 4.0.1.
When you use ' *| *', awkinterprets it as space OR space. Hence the output you get is correct one. If you need to have | as a delimiter, just escape it.
$ echo "1 a | 2 b | 3 c " | awk -F' *\\| *' '{ print "-->" $2 "<--" }'
-->2 b<--
Notice that you have to escape it twice, since in awk, \| is considered | as well which will again get interpreted as logical OR.
Because of this, it is very popular to escape such special characters in character class [].
$ echo "1 a | 2 b | 3 c " | awk -F' *[|] *' '{ print "-->" $2 "<--" }'
-->2 b<--
echo "1 a | 2 b | 3 c " | awk -F '|' '{print $2}' | tr -d ' '
produces "2 b" for me
I am trying to count the number of matched terms from an input list containing one term per line with a data file and create an output file containing both the matched (grep'd) term with the number of matched terms and where there isn't match, to return a value of zero.
Input list:
+ 5S_rRNA
+ 7SK
+ AC001
+ AC000111.3
+ AC000111.6
The data.txt file:
chr10 101780038 101780209 5S_rRNA
chr10 103578280 103578430 5S_rRNA
chr10 112327234 112327297 5S_rRNA
chr10 120766459 120766601 7SK
chr10 127408228 127408317 7SK
chr10 127511874 127512063 AADAC
chr10 14614140 14614294 AC000111.3
I would like to create an output file containing all the unmatched terms and matched terms with the corresponding count to look like this:
+ 5S_rRNA 3
+ 7SK 2
+ AC001 0
+ AADAC 1
+ AC000111.3 1
+ AC000111.6 0
I can create an output file containing matched terms and the counts but I don't know how to get the zero value to be returned if there isn't a match and get it to print all the output to a separate file.
These are the codes I have used to create matched terms (thanks perreal and Mark Setchell)
#!/bin/bash
while read line
do
line=${line##+ } # Strip off leading + and space
n=$(grep "$line" data.txt 2> /dev/null | wc -l)
if [ $n -gt 0 ]; then
echo $line
echo $n
fi
done < input_list.txt > output.txt
and
cut -d' ' -f2 input.txt | grep -o -f - data.txt | sort | uniq -c | \
sed 's/\s*\([0-9]*\)\s*\(.*\)/+ \2\t\1/' > output.txt
Any suggestions would be great. Thanks
Harriet
You can use this simple loop with grep -c:
while read l; do echo -n "+ $l "; grep -c "$l" file1; done < inputs
+ 5S_rRNA 3
+ 7SK 2
+ AC001 0
+ AC000111.3 1
+ AC000111.6 0
cut -d' ' -f2 input.txt | grep -o -f - data.txt | sort | uniq -c | \
sed 's/\s*\([0-9]*\)\s*\(.*\)/+ \2 \1/' | \
join -a 1 -e 0 -j 2 input.txt - -o '1.2 2.3' | \
sed 's/ /\t/;s/^/+ /'
When working with tab, whitespace or similar delimited files, think awk. Perhaps this is what you're looking for. I have used a ternary operator, but you could use if / else statements if you find them easier to read.
awk 'FNR==NR { a[$4]++; next } { print "+", $2, $2 in a ? a[$2] : 0 }' data.txt inputlist.txt
Results:
+ 5S_rRNA 3
+ 7SK 2
+ AC001 0
+ AC000111.3 1
+ AC000111.6 0
$2 in a ? a[$2] : 0 means if column two is in the array (called a), return the value for that key. Else, return zero. HTH.