awk: splitting with a regex

awk: splitting with a regex - regex

I'm trying to parse lines with fields separated by "|" and space padding. I thought it would be as simple as this:
$ echo "1 a | 2 b | 3 c " | awk -F' *| *' '{ print "-->" $2 "<--" }'
However, what I get is
-->a<--
instead of the expected
-->2 b<--
I'm using GNU Awk 4.0.1.

When you use ' *| *', awkinterprets it as space OR space. Hence the output you get is correct one. If you need to have | as a delimiter, just escape it.
$ echo "1 a | 2 b | 3 c " | awk -F' *\\| *' '{ print "-->" $2 "<--" }'
-->2 b<--
Notice that you have to escape it twice, since in awk, \| is considered | as well which will again get interpreted as logical OR.
Because of this, it is very popular to escape such special characters in character class [].
$ echo "1 a | 2 b | 3 c " | awk -F' *[|] *' '{ print "-->" $2 "<--" }'
-->2 b<--

echo "1 a | 2 b | 3 c " | awk -F '|' '{print $2}' | tr -d ' '
produces "2 b" for me

Related

Parse default Salt highstate output

I'm trying to parse the highstate output of Salt has proven to be difficult. Without changing the output to json due to the fact that I still want it to be human legible.
What's the best way to convert the Summary into something machine readable?
Summary for app1.domain.com
--------------
Succeeded: 278 (unchanged=12, changed=6)
Failed: 0
--------------
Total states run: 278
Total run time: 7.383 s
--
Summary for app2.domain.com
--------------
Succeeded: 278 (unchanged=12, changed=6)
Failed: 0
--------------
Total states run: 278
Total run time: 7.448 s
--
Summary for app0.domain.com
--------------
Succeeded: 293 (unchanged=13, changed=6)
Failed: 0
--------------
Total states run: 293
Total run time: 7.510 s
Without a better idea I'm trying to grep and awk the output and insert it into a csv.
These two work:
cat ${_FILE} | grep Summary | awk '{ print $3} ' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
cat ${_FILE} | grep -oP '(?<=unchanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
But this one fails but works in Reger
cat ${_FILE} | grep -oP '(?<=\schanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
EDIT1: #vintnes #ikegami I agree I'd much rather take the json output parse the output but Salt doesn't offer a summary of changes when outputting to josn. So far this is what I have and while very ugly, it's working.
cat ${_FILE} | grep Summary | awk '{ print $3} ' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | grep -oP '(?<=unchanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | grep unchanged | awk -F' ' '{ print $4}' | \
grep -oP '(?<=changed=)[0-9]+' | tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | { grep "Warning" || true; } | awk -F: '{print $2+0} END { if (!NR) print "null" }' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | { grep "Failed" || true; } | awk -F: '{print $2+0} END { if (!NR) print "null" }' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
csvtool transpose /tmp/highstate_tmp.csv > /tmp/highstate.csv;
sed -i '1 i\instance,unchanged,changed,warning,failed' /tmp/highstate.csv;
Output:
instance,unchanged,changed,warning,failed
app1.domain.com,12,6,,0
app0.domain.com,13,6,,0
app2.domain.com,12,6,,0

Here you go. This will also work if your output contains warnings. Please note that the output is in a different order than you specified; it's the order in which each record occurs in the file. Don't hesitate with any questions.
$ awk -v OFS=, '
BEGIN { print "instance,unchanged,changed,warning,failed" }
/^Summary/ { instance=$NF }
/^Succeeded/ { split($3 $4 $5, S, /[^0-9]+/) }
/^Failed/ { print instance, S[2], S[3], S[4], $2 }
' "$_FILE"
split($3 $4 $5, S, /[^0-9]+/) handles the possibility of warnings by disregarding the first two "words" Succeeded: ### and using any number of non-digits as a separator.
edit: Printed on /^Fail/ instead of using /^Summ/ and END.

perl -e'
use strict;
use warnings qw( all );
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
$csv->say(select(), [qw( instance unchanged change warning failed )]);
my ( $instance, $unchanged, $changed, $warning, $failed );
while (<>) {
if (/^Summary for (\S+)/) {
( $instance, $unchanged, $changed, $warning, $failed ) = $1;
}
elsif (/^Succeeded:\s+\d+ \(unchanged=(\d+), changed=(\d+)\)/) {
( $unchanged, $changed ) = ( $1, $2 );
}
elsif (/^Warning:\s+(\d+)/) {
$warning = $1;
}
elsif (/^Failed:\s+(\d+)/) {
$failed = $1;
$csv->say(select(), [ $instance, $unchanged, $changed, $warning, $failed ]);
}
}
'
Provide input via STDIN, or provide path to file(s) from which to read as arguments.
Terse version:
perl -MText::CSV_XS -ne'
BEGIN {
$csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
$csv->say(select(), [qw( instance unchanged change warning failed )]);
}
/^Summary for (\S+)/ and #row=$1;
/^Succeeded:\s+\d+ \(unchanged=(\d+), changed=(\d+)\)/ and #row[1,2]=($1,$2);
/^Warning:\s+(\d+)/ and $row[3]=$1;
/^Failed:\s+(\d+)/ and ($row[4]=$1), $csv->say(select(), \#row);
'

Improving answer from #vintnes.
Producing output as tab separated CSV
Write awk script that reads values from lines by their order.
Print each record as it is read.
script.awk
BEGIN {print("computer","succeeded","unchanged","changed","failed","states run","run time");}
FNR%8 == 1 {arr[1] = $3}
FNR%8 == 3 {arr[2] = $2; arr[3] = extractNum($3); arr[4] = extractNum($4)}
FNR%8 == 4 {arr[5] = $2;}
FNR%8 == 6 {arr[6] = $4;}
FNR%8 == 7 {arr[7] = $4; print arr[1],arr[2],arr[3],arr[4],arr[5],arr[6],arr[7];}
function extractNum(str){match(str,/[[:digit:]]+/,m);return m[0];}
run script
Tab separated CSV output
awk -v OFS="\t" -f script.awk input-1.txt input-2.txt ...
Comma separated CSV output
awk -v OFS="," -f script.awk input-1.txt input-2.txt ...
Output
computer succeeded unchanged changed failed states run run time
app1.domain.com 278 12 6 0 278 7.383
app2.domain.com 278 12 6 0 278 7.448
app0.domain.com 293 13 6 0 293 7.510
computer,succeeded,unchanged,changed,failed,states run,run time
app1.domain.com,278,12,6,0,278,7.383
app2.domain.com,278,12,6,0,278,7.448
app0.domain.com,293,13,6,0,293,7.510
Explanation
BEGIN {print("computer","succeeded","unchanged","changed","failed","states run","run time");}
Print the heading CSV line
FNR%8 == 1 {arr[1] = $3}
Extract the arr[1] value from 3rd field in (first line from 8 lines)
FNR%8 == 3 {arr[2] = $2; arr[3] = extractNum($3); arr[4] = extractNum($4)}
Extract the arr[2,3,4] values from 2nd,3rd,4th fields in (third line from 8 lines)
FNR%8 == 4 {arr[5] = $2;}
Extract the arr[5] value from 2nd field in (4th line from 8 lines)
FNR%8 == 6 {arr[6] = $4;}
Extract the arr[6] value from 4th field in (6th line from 8 lines)
FNR%8 == 7 {arr[7] = $4;
Extract the arr[7] value from 4th field in (7th line from 8 lines)
print arr[1],arr[2],arr[3],arr[4],arr[5],arr[6],arr[7];}
print the array elements for the extracted variable at the completion of reading 7th line from 8 lines.
function extractNum(str){match(str,/[[:digit:]]+/,m);return m[0];}
Utility function to extract numbers from text field.

Dynamic pattern for matching incorrect characters in egrep

I have the next lines in files:
UserParameter=cassandra.status[*], curl -s "http://$1:$2/server-status?auto" | grep -e $3 | awk '{ print $$2 }'
UserParameter=ping.status[*],curl -s --retry 3 --max-time 3 'http://localhost:1111/engines?$1' | awk '/last_seen = / {split($$1, a, "/"); print a[2]}; END { if (!NR) print "NO_MATCHING_ENGINES" }' | tr "\n" "
and so on.
I want to display that line where comma after [*] is missed or there are any extra characters besides comma.
For example:
UserParameter=ping.status[*],,,curl -s --retry 3 --max-time 3 'http://localhost:1111/engines?$1' | awk '/last_seen = / {split($$1, a, "/"); print a[2]}; END { if (!NR) print "NO_MATCHING_ENGINES" }' | tr "\n" "
UserParameter=ping.status[*] curl -s --retry 3 --max-time 3 'http://localhost:1111/engines?$1' | awk '/last_seen = / {split($$1, a, "/"); print a[2]}; END { if (!NR) print "NO_MATCHING_ENGINES" }' | tr "\n" "
UserParameter=ping.status[*],;!curl -s --retry 3 --max-time 3 'http://localhost:1111/engines?$1' | awk '/last_seen = / {split($$1, a, "/"); print a[2]}; END { if (!NR) print "NO_MATCHING_ENGINES" }' | tr "\n" "
will be printed as long as there are extra characters and spaces besides single comma.
But:
UserParameter=ping.status[*],curl -s --retry 3 --max-time 3 'http://localhost:1111/engines?$1' | awk '/last_seen = / {split($$1, a, "/"); print a[2]}; END { if (!NR) print "NO_MATCHING_ENGINES" }' | tr "\n" "
will not be printed as long as there is single comma after [*].
I was trying to develop a pattern for egrep, but it doesn't fit for all cases where for example besides comma any other character which follows after [*]:
egrep (\[\*\].(|;|:|,|\.|))
I'll appreciate any help! Thank you!

grep -vE '\[\*\],[$/[:alpha:] ]' input
Do not print lines that match the pattern: [*], followed by any of: $, /, alphabetic character, or a space.

Parse IP and Download-Total from mikrotik

I wanna extract IP and download-total from mikrotik command /queue simple print stat
Here's some example :
0 name="101" target=192.168.10.101/32 rate=0bps/0bps total-rate=0bps
packet-rate=0/0 total-packet-rate=0 queued-bytes=0/0
total-queued-bytes=0 queued-packets=0/0 total-queued-packets=0
bytes=17574842/389197663 total-bytes=0 packets=191226/308561
total-packets=0 dropped=9/5899 total-dropped=0
1 name="102" target=192.168.10.102/32 rate=0bps/0bps total-rate=0bps
packet-rate=0/0 total-packet-rate=0 queued-bytes=0/0
total-queued-bytes=0 queued-packets=0/0 total-queued-packets=0
bytes=65593392/183786457 total-bytes=0 packets=163260/166022
total-packets=0 dropped=175/2403 total-dropped=0
2 name="103" target=192.168.10.103/32 rate=0bps/0bps total-rate=0bps
packet-rate=0/0 total-packet-rate=0 queued-bytes=0/0
total-queued-bytes=0 queued-packets=0/0 total-queued-packets=0
bytes=3263234/67407044 total-bytes=0 packets=41437/52602
total-packets=0 dropped=0/546 total-dropped=0
All that I need is :
192.168.10.101 389197663
192.168.10.102 183786457
192.168.10.103 67407044
But I get
target=192.168.10.101/32
bytes=17574842/389197663
target=192.168.10.102/32
bytes=65593392/183786457
target=192.168.10.103/32
bytes=3263234/67407044
I try it with grep -oP 'target=.*?\ |[^\-]bytes=.*?\ ' | sed 's/^ //g'.
So, how can I parse it? Sorry for bad english..

Just continue your line of parsing with another pipes (most easy way i think)
grep -oP 'target=.*?\ |[^\-]bytes=.*?\ ' file | sed 's/^ //g' | sed -r 's/target=([^/]*)[/].*/\1/; s/bytes=[^/]*[/]//' | sed 'N; s/\n/ /'
output
192.168.10.101 389197663
192.168.10.102 183786457
192.168.10.103 67407044

sed '/^[0-9]\{1,\}[[:blank:]]\{1,\}name/,/^[[:blank:]]*$/ {
/^[0-9]/{
s#.*target=\([^/]*\).*#\1#;h;d
}
\#^[[:blank:]]*bytes=[0-9]*/\([0-9]*\).*# !d
s//\1/
G
s/\(.*\)\n\(.*\)/\2 \1/p
}
d
' YourFile
A bit long but do the job in 1 sed
awk '{
if ( $3 ~ /target=/ ) split( $3, aIP, "[=/]")
if ( $1 ~ /^[[:blank:]]*bytes=[0-9]*/ ) {
split( $1, aByt, "/")
print aIP[2] " " aByt[2]
}
}' YourFile
same in awk
if always same exact structure
awk 'BEGIN{ RS="" }
{ split( $3, aIP, "[=/]"); split( $12, aByt, "/")
print aIP[2] " " aByt[2]
}' YourFile

Tokenize and capture with sed

Suppose we have a string like
"dir1|file1|dir2|file2"
and would like to turn it into
"-f dir1/file1 -f dir2/file2"
Is there an elegant way to do this with sed or awk for a general case of n > 2?
My attempt was to try
echo "dir1|file1|dir2|file2" | sed 's/\(\([^|]\)|\)*/-f \2\/\4 -f \6\/\8/'

An awk solution:
awk -F'|' '{ for (i=1;i<=NF;i+=2) printf "-f %s/%s%s", $i, $(i+1), ((i==NF-1) ? "\n" : " ") }' \
<<<"dir1|file1|dir2|file2"
-F'|' splits the input into fields by |
for (i=1;i<=NF;i+=2) loops over the field indices in increments of 2
printf "-f %s/%s%s", $i, $(i+1), ((i==NF-1) ? "\n" : " ") prints pairs of consecutive fields joined with / and prefixed with -f<space>
((i==NF-1) ? "\n" : " ") terminates each field-pair either with a space, if more fields follow, or a \n to terminate the overall output.
In a comment, the OP suggests a shorter variation, which may be of interest if you don't need/want the output to be \n-terminated:
awk -F'|' '{ for (i=1;i<=NF;++i) printf "%s", (i%2 ? " -f " $i : "/" $i ) }' \
<<<"dir1|file1|dir2|file2"

This might work for you (GNU sed):
sed 's/\([^|]*\)|\([^|]*\)|\?/-f \1\/\2 /g;s/ $//' file
This will work for dir1|file1|dir2|file2|dirn|filen type strings
The regexp forms two back references (\1,\2 used in the replacement part of the substitution command s/pattern/replacement/), the first is all non-|'s, then a |, the second is all non-|'s then an optional | i.e. for the first application of the substitution (N.B. the g flag is implemented and so the substitutions may be multiple) dir1 becomes \1 and file1 becomes \2. All that remains is to prepend -f and replace the first | by / and the second | by a space. The last space is not needed at the end of the line and is removed in the second substitution command.

$ awk -v RS='|' 'NR%2{p=$0;next} {printf " -f %s/%s", p, $0}' <<< 'dir1|file1|dir2|file2'
-f dir1/file1 -f dir2/file2

A gnu-awk solution:
s="dir1|file1|dir2|file2"
awk 'BEGIN{ FPAT="[^|]+\\|[^|]+" } {
for (i=1; i<=NF; i++) {
sub(/\|/, "/", $i);
if (i>1)
printf " ";
printf "-f " $i
};
print ""
}' <<< "$s"
-f dir1/file1 -f dir2/file2
FPAT is used for grabbing dir1|file2 into single field.

find integer from Nth field in awk

I've got a script producing output from Twitter's streaming API into a format like this
semmelracet_dev | 450587667 | 1 semla till idag! #semmelreport | 569866960802062336 | 1424701845728
Where field 3 is the actual tweet.
What I want to do was to grab the integer from that field and insert it into a database as a separate field/column.
To just insert those fields is not a problem, but getting the INT and handling it separately is. Could I enforce usage and split the field after the INT?
Sorry about not including expexted output. Basically i'm constructing a mysql insert like
"... insert into report values ("semmelracet_dev", 450587667, "1 semla till idag! #semmelreport", 1, 569866960802062336, 1424701845728)"
Any ideas?
EDIT again, or if it's something that's not doable, maybe keep all the columns and in field 3 just keep the int when inserting them into the database?
EDIT 2
Tried the solution from jeanrjc below with mixed success
cat tweetReport.txt | awk -F"\|" '{n=split($3,s," "); for (i=1;i<=n;i++) if
(s[i] + 0 == s[i]) int_val = s[i]}{print "\""$1"\","$2", \""$3"\",
"int_val", "$4", "$5}')
-bash: syntax error near unexpected token `)'
I then removed the trailing ) and got
cat tweetReport.txt | awk -F"\|" '{n=split($3,s," "); for (i=1;i<=n;i++) if
(s[i] + 0 == s[i]) int_val = s[i]}{print "\""$1"\","$2", \""$3"\",
"int_val", "$4", "$5}'
awk: warning: escape sequence `\|' treated as plain `|'
"semmelracet_dev ", 450587667 , " 1 semla till idag! #semmelreport ", 1,
569866960802062336 , 1424701845728 "",, "", 1, ,
Which is better, but with some jibberish i don't quite understand..

I'm not sure I fully understand what you want, but I guessed that you wanted to extract (or get rid of) the int value of the 3rd field, is that right ?
To do so:
awk -F"|" '{print $3}' file | awk '{for (i=1; i<=NF; i++) if ($i + 0 == $i) print $i}'
where ($i + 0 == $i) tests whether this word is an int or not, then print it.
I hope that from that, you'll manage to get what you want. Precise your expected output otherwise.
EDIT : To obtain desired output:
$ cat tweet.txt
semmelracet_dev | 999999999 | 2 foo bar! #fooreport | 999996696080209999 | 1429999845728
semmelracet_dev | 450587667 | 1 semla till idag! #semmelreport | 569866960802062336 | 1424701845728
$ awk -F"\|" '{n=split($3,s," "); for (i=1;i<=n;i++) if (s[i] + 0 == s[i]) int_val = s[i]}{print "\""$1"\","$2", \""$3"\", "int_val", "$4", "$5}' tweet.txt
"semmelracet_dev ", 999999999 , " 2 foo bar! #fooreport ", 2, 999996696080209999 , 1429999845728
"semmelracet_dev ", 450587667 , " 1 semla till idag! #semmelreport ", 1, 569866960802062336 , 1424701845728
Which you can capture in a variable and then pass it to construct your mysql insert.
HTH

I'm using a bashism to feed data to awk, you can use something else:
$ t="semmelracet_dev | 450587667 | 1 semla till idag! #semmelreport | 569866960802062336 | 1424701845728"
$ awk -F'|' '{n=$3;sub(/^ */,"",n);sub(/ .*/,"",n);print n;}' <<<"$t"
1
This simply does a couple of substitutions to "trim" data around the pipe, then remove anything after the first space.
If you want help inserting this number into a database, you'll have to be a bit more explicit about what tools you're using. For example, this might work:
$ n=$(awk -F'|' '{n=$3;sub(/^ */,"",n);sub(/ .*/,"",n);print n;}' <<<"$t")
$ psql -c $(printf 'INSERT INTO table (n) VALUES (%d);' "$n")
Or if you'd prefer to get these data from a log file and pipe thing through psql, you could do it this way:
awk -F'|' -vfmt="INSERT INTO table (n) VALUES (%d);" '
{
n=$3; sub(/^ */,"",n); sub(/ .*/,"",n);
printf(fmt,n);
}' input.txt \
| psql

awk 'BEGIN{FS="|";} {print($3);}' | sed -r 's/([0-9]+)(.*)/\1/'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

awk: splitting with a regex - regex

I'm trying to parse lines with fields separated by "|" and space padding. I thought it would be as simple as this: $ echo "1 a | 2 b | 3 c " | awk -F' | ' '{ print "-->" $2 "<--" }' However, what I get is -->a<-- instead of the expected -->2 b<-- I'm using GNU Awk 4.0.1.

echo "1 a | 2 b | 3 c " | awk -F '|' '{print $2}' | tr -d ' ' produces "2 b" for me

Related

Parse default Salt highstate output

Dynamic pattern for matching incorrect characters in egrep

Parse IP and Download-Total from mikrotik

Tokenize and capture with sed

find integer from Nth field in awk

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

awk: splitting with a regex - regex

I'm trying to parse lines with fields separated by "|" and space padding. I thought it would be as simple as this: $ echo "1 a | 2 b | 3 c " | awk -F' *| *' '{ print "-->" $2 "<--" }' However, what I get is -->a<-- instead of the expected -->2 b<-- I'm using GNU Awk 4.0.1.

echo "1 a | 2 b | 3 c " | awk -F '|' '{print $2}' | tr -d ' ' produces "2 b" for me

Related

Parse default Salt highstate output

Dynamic pattern for matching incorrect characters in egrep

Parse IP and Download-Total from mikrotik

Tokenize and capture with sed

find integer from Nth field in awk

Categories

Resources

I'm trying to parse lines with fields separated by "|" and space padding. I thought it would be as simple as this: $ echo "1 a | 2 b | 3 c " | awk -F' | ' '{ print "-->" $2 "<--" }' However, what I get is -->a<-- instead of the expected -->2 b<-- I'm using GNU Awk 4.0.1.