Extracting columns from previous and next line by pattern matching

Extracting columns from previous and next line by pattern matching - regex

I have the following file : extract_info.txt
ABC
PNG
CHNS
and to_extractfrom.txt from which I need to retrieve information:
ABC 123 234 TCHSL
NBV 234 23764 DHG
CHNS 123 347 CGJKS
CVS 233 4747 JSHGD
PNG 122 324 HGH
SJDH 373 3487 JHG
and I am running the following code
while read line
do
gene=$(echo $line | awk -F' ' '{print $1}')
app1=$(awk -v comp1="$gene" '(comp1==$1) {print $1 }' to_extractfrom.txt)
done < extract_info.txt
However my desired output is to extract the information for the column in extract_info.txt from the file to_extractfrom.txt such that I get the first column of the previous line on the right and next line on the left of the pattern matched line i.e for the columns in the first file, I will have the output as :
NBV ABC -
SJDH PNG CVS
CVS CHNS NBV

awk '
BEGIN {prev = "-"}
NR == FNR {extract[$1] = 1; next}
is_match {print $1, m1, m2; is_match = 0}
$1 in extract {is_match = 1; m1 = $1; m2 = prev}
{prev = $1}
' extract_info.txt to_extractfrom.txt
NBV ABC -
CVS CHNS NBV
SJDH PNG CVS
If you must have the output in the same order as the extract_info file, and you use GNU awk, you can do
gawk '
BEGIN {prev = "-"}
NR == FNR {extract[$1] = FNR; next}
is_match {output[m1] = $1 FS m1 FS m2; is_match = 0}
$1 in extract {is_match = 1; m1 = $1; m2 = prev}
{prev = $1}
END {
PROCINFO["sorted_in"] = "#val_num_asc"
for (key in extract) print output[key]
}
' extract_info.txt to_extractfrom.txt
NBV ABC -
SJDH PNG CVS
CVS CHNS NBV

Related

How to print a line with no field separator in awk?

I have data like this (file is called list-in.dat)
a ; b ; c ; i
d
e ; f ; a ; b
g ; h ; i
and I want a list like this (output file list-out.dat) with all items, in alphabetically order (case insensitive) and each unique item only once.
a
b
c
d
e
f
g
h
i
My attempt is:
awk -F " ; " ' BEGIN { OFS="\n" ; } {for(i=0; i<=NF; i++) print $i} ' file-in.dat | uniq | sort -uf > file-out.dat
But I end up with all antries except those lines which has only one item:
a
b
c
e
f
g
h
i
How can I get all (unique, sorted) items no matter how many items are in one line / if the field separator is missing?

Using gnu-awk:
awk -F '[[:blank:]]*;[[:blank:]]*' '{
for (i=1; i<=NF; i++) uniq[$i]
}
END {
PROCINFO["sorted_in"]="#ind_str_asc"
for (i in uniq)
print i
}' file
a
b
c
d
e
f
g
h
i
For non-gnu awk use:
awk -F '[[:blank:]]*;[[:blank:]]*' '{for (i=1; i<=NF; i++) uniq[$i]}
END{for (i in uniq) print i}' file | sort

awk -F' ; ' -v OFS='\n' '{$1=$1} 1' ip.txt | sort -fu
-F' ; ' sets space followed by ; followed by space as field separator
-v OFS='\n' sets newline as output field separator
{$1=$1} change $0 as per new OFS
1 print $0
sort -fu sort uniquely ignoring case in alphabetic order

Could you please try following, awk + sort solution, written and tested with shown samples. In case you want to use ignorecase then add IGNORECASE=1 in awk code.
awk '
BEGIN{
FS=" ; "
}
{
for(i=1;i<=NF;i++){
if(!a[$i]++){ print $i }
}
}
' Input_file | sort
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=" ; " ##Setting field separator as space semi-colon space here.
}
{
for(i=1;i<=NF;i++){ ##Starting a for loop till NF here for each line.
if(!a[$i]++){ print $i } ##Checking condition if current field is NOT present in array a then printing that field value here.
}
}
' Input_file | sort ##Mentioning Input_file name here and passing it to sort as Input to sort the data.

Parse default Salt highstate output

I'm trying to parse the highstate output of Salt has proven to be difficult. Without changing the output to json due to the fact that I still want it to be human legible.
What's the best way to convert the Summary into something machine readable?
Summary for app1.domain.com
--------------
Succeeded: 278 (unchanged=12, changed=6)
Failed: 0
--------------
Total states run: 278
Total run time: 7.383 s
--
Summary for app2.domain.com
--------------
Succeeded: 278 (unchanged=12, changed=6)
Failed: 0
--------------
Total states run: 278
Total run time: 7.448 s
--
Summary for app0.domain.com
--------------
Succeeded: 293 (unchanged=13, changed=6)
Failed: 0
--------------
Total states run: 293
Total run time: 7.510 s
Without a better idea I'm trying to grep and awk the output and insert it into a csv.
These two work:
cat ${_FILE} | grep Summary | awk '{ print $3} ' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
cat ${_FILE} | grep -oP '(?<=unchanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
But this one fails but works in Reger
cat ${_FILE} | grep -oP '(?<=\schanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
EDIT1: #vintnes #ikegami I agree I'd much rather take the json output parse the output but Salt doesn't offer a summary of changes when outputting to josn. So far this is what I have and while very ugly, it's working.
cat ${_FILE} | grep Summary | awk '{ print $3} ' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | grep -oP '(?<=unchanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | grep unchanged | awk -F' ' '{ print $4}' | \
grep -oP '(?<=changed=)[0-9]+' | tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | { grep "Warning" || true; } | awk -F: '{print $2+0} END { if (!NR) print "null" }' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | { grep "Failed" || true; } | awk -F: '{print $2+0} END { if (!NR) print "null" }' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
csvtool transpose /tmp/highstate_tmp.csv > /tmp/highstate.csv;
sed -i '1 i\instance,unchanged,changed,warning,failed' /tmp/highstate.csv;
Output:
instance,unchanged,changed,warning,failed
app1.domain.com,12,6,,0
app0.domain.com,13,6,,0
app2.domain.com,12,6,,0

Here you go. This will also work if your output contains warnings. Please note that the output is in a different order than you specified; it's the order in which each record occurs in the file. Don't hesitate with any questions.
$ awk -v OFS=, '
BEGIN { print "instance,unchanged,changed,warning,failed" }
/^Summary/ { instance=$NF }
/^Succeeded/ { split($3 $4 $5, S, /[^0-9]+/) }
/^Failed/ { print instance, S[2], S[3], S[4], $2 }
' "$_FILE"
split($3 $4 $5, S, /[^0-9]+/) handles the possibility of warnings by disregarding the first two "words" Succeeded: ### and using any number of non-digits as a separator.
edit: Printed on /^Fail/ instead of using /^Summ/ and END.

perl -e'
use strict;
use warnings qw( all );
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
$csv->say(select(), [qw( instance unchanged change warning failed )]);
my ( $instance, $unchanged, $changed, $warning, $failed );
while (<>) {
if (/^Summary for (\S+)/) {
( $instance, $unchanged, $changed, $warning, $failed ) = $1;
}
elsif (/^Succeeded:\s+\d+ \(unchanged=(\d+), changed=(\d+)\)/) {
( $unchanged, $changed ) = ( $1, $2 );
}
elsif (/^Warning:\s+(\d+)/) {
$warning = $1;
}
elsif (/^Failed:\s+(\d+)/) {
$failed = $1;
$csv->say(select(), [ $instance, $unchanged, $changed, $warning, $failed ]);
}
}
'
Provide input via STDIN, or provide path to file(s) from which to read as arguments.
Terse version:
perl -MText::CSV_XS -ne'
BEGIN {
$csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
$csv->say(select(), [qw( instance unchanged change warning failed )]);
}
/^Summary for (\S+)/ and #row=$1;
/^Succeeded:\s+\d+ \(unchanged=(\d+), changed=(\d+)\)/ and #row[1,2]=($1,$2);
/^Warning:\s+(\d+)/ and $row[3]=$1;
/^Failed:\s+(\d+)/ and ($row[4]=$1), $csv->say(select(), \#row);
'

Improving answer from #vintnes.
Producing output as tab separated CSV
Write awk script that reads values from lines by their order.
Print each record as it is read.
script.awk
BEGIN {print("computer","succeeded","unchanged","changed","failed","states run","run time");}
FNR%8 == 1 {arr[1] = $3}
FNR%8 == 3 {arr[2] = $2; arr[3] = extractNum($3); arr[4] = extractNum($4)}
FNR%8 == 4 {arr[5] = $2;}
FNR%8 == 6 {arr[6] = $4;}
FNR%8 == 7 {arr[7] = $4; print arr[1],arr[2],arr[3],arr[4],arr[5],arr[6],arr[7];}
function extractNum(str){match(str,/[[:digit:]]+/,m);return m[0];}
run script
Tab separated CSV output
awk -v OFS="\t" -f script.awk input-1.txt input-2.txt ...
Comma separated CSV output
awk -v OFS="," -f script.awk input-1.txt input-2.txt ...
Output
computer succeeded unchanged changed failed states run run time
app1.domain.com 278 12 6 0 278 7.383
app2.domain.com 278 12 6 0 278 7.448
app0.domain.com 293 13 6 0 293 7.510
computer,succeeded,unchanged,changed,failed,states run,run time
app1.domain.com,278,12,6,0,278,7.383
app2.domain.com,278,12,6,0,278,7.448
app0.domain.com,293,13,6,0,293,7.510
Explanation
BEGIN {print("computer","succeeded","unchanged","changed","failed","states run","run time");}
Print the heading CSV line
FNR%8 == 1 {arr[1] = $3}
Extract the arr[1] value from 3rd field in (first line from 8 lines)
FNR%8 == 3 {arr[2] = $2; arr[3] = extractNum($3); arr[4] = extractNum($4)}
Extract the arr[2,3,4] values from 2nd,3rd,4th fields in (third line from 8 lines)
FNR%8 == 4 {arr[5] = $2;}
Extract the arr[5] value from 2nd field in (4th line from 8 lines)
FNR%8 == 6 {arr[6] = $4;}
Extract the arr[6] value from 4th field in (6th line from 8 lines)
FNR%8 == 7 {arr[7] = $4;
Extract the arr[7] value from 4th field in (7th line from 8 lines)
print arr[1],arr[2],arr[3],arr[4],arr[5],arr[6],arr[7];}
print the array elements for the extracted variable at the completion of reading 7th line from 8 lines.
function extractNum(str){match(str,/[[:digit:]]+/,m);return m[0];}
Utility function to extract numbers from text field.

Print only '+' or '-' if string matches (two files)

I would like to print only a '+' o '-' symbols if string is found or not. Basically, I have two files:
Input file 1 (tab-delimited):
HPNK_00457
HPNK_00458
HPNK_00459
Input file 2 (tab-delimited):
HPNK_00457 AAA50325 1e-43 437 28 43 83 ATP-binding protein.
HPNK_00458 P25256 8e-43 429 28 43 82 RecName: Full=Tylosin resistance ATP-binding protein tlrC.
HPNK_00458 CAM96590 1e-42 429 27 42 87 ABC transporter ATP-binding protein [Streptomyces ambofaciens].
Desired output (tab-delimited, maintaining order of strings in file 1):
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -
This is what I've been using up to now, but need to update:
while read vl; do grep "^$vl " file2 || printf -- "- -\n" ; done < file1
Thanks, trying to learn everyday here.

Here's one way using awk:
awk 'FNR==NR { a[$1]; next } { print $1, ($1 in a ? "+" : "-" ) }' file2 file1
Results:
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -

You can use:
while read -r line
do
grep -q "$line" f2 && echo "$line +" || echo "$line -"
done < f1
As grep -q just returns true if it has matched something, in that case we print the file name + + otherwise, we print the file name + -.
It returns:
$ while read -r line; do grep -q "$line" f2 && echo "$line +" || echo "$line -"; done < f1
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -

perl -lane'
BEGIN{ $, ="\t"; $x=shift; #h{ map /(\S+)/, <> } =(); #ARGV=$x }
print #F, exists $h{$F[0]} ? "+" : "-";
' file1 file2
output
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -

Here's the algorithm:
Read file 2. For each line,
Get the first word
Store it in a hash.
Read file 1. For each line, chomp it, then
print $hash{$_}? '+' : '-'
I can write the code for you but if you want to learn everyday, it will be a useful exercise if you want to write it yourself.

This simple Perl script should do the work
#!/usr/local/bin/perl
## f1 and f2 are the 2 files containing your input data
open FILE1, f1;
open FILE2, f2;
#file1data = <FILE1>;
#file2data = <FILE2>;
my $row = 0;
foreach $data (#file1data) {
chomp($data);
if (grep (/$data/,$file2data[$row]) ) {
print $data . " " . "+\n";
}
else {
print $data . " " . "-\n";
}
$row++;
}

awk 'FNR==NR
{a[$1];next}
{b[$1]}
END{
for(i in a)
if(b[i]){print i,"+"}
else{print i,"-"}
}' file1 file2

Command line to match lines with matching first field (sed, awk, etc.)

What is fast and succinct way to match lines from a text file with a matching first field.
Sample input:
a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit
Desired output:
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
Desired output, alternative:
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
I can imagine many ways to write this, but I suspect there's a smart way to do it, e.g., with sed, awk, etc. My source file is approx 0.5 GB.
There are some related questions here, e.g., "awk | merge line on the basis of field matching", but that other question loads too much content into memory. I need a streaming method.

Here's a method where you only have to remember the previous line (therefore requires the input file to be sorted)
awk -F \| '
$1 == prev_key {print prev_line; matches ++}
$1 != prev_key {
if (matches) print prev_line
matches = 0
prev_key = $1
}
{prev_line = $0}
END { if (matches) print $0 }
' filename
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
Alternate output
awk -F \| '
$1 == prev_key {
if (matches == 0) printf "%s", $1
printf "%s%s", FS, prev_value
matches ++
}
$1 != prev_key {
if (matches) printf "%s%s\n", FS, prev_value
matches = 0
prev_key = $1
}
{prev_value = $2}
END {if (matches) printf "%s%s\n", FS, $2}
' filename
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

For fixed width fields you can used uniq:
$ uniq -Dw 1 file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
If you don't have fixed width fields here are two awk solution:
awk -F'|' '{a[$1]++;b[$1]=(b[$1])?b[$1]RS$0:$0}END{for(k in a)if(a[k]>1)print b[k]}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
awk -F'|' '{a[$1]++;b[$1]=b[$1]FS$2}END{for(k in a)if(a[k]>1)print k b[k]}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

Using awk:
awk -F '|' '!($1 in a){a[$1]=$2; next} $1 in a{b[$1]=b[$1] FS a[$1] FS $2}
END{for(i in b) print i b[i]}' file
d|amet|consectetur
e|adipisicing|elit
b|ipsum|dolor

This might work for you (GNU sed):
sed -r ':a;$!N;s/^(([^|]*\|).*)\n\2/\1|/;ta;/^([^\n|]*\|){2,}/P;D' /file
This reads 2 lines into the pattern space then checks to see if the keys in both lines are the same. If so it removes the second key and repeats. If not it checks to see if more than two fields exist in the first line and if so prints it out and then deletes it otherwise it just deletes the first line.

$ awk -F'|' '$1 == prev {rec = rec RS $0; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
$ awk -F'|' '$1 == prev {rec = rec FS $2; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

replace nth occurrence of character in a file using awk regardless of the line

I am trying to replace the nth occurrence of a character or string regardless of the line using awk.
So if our data was this
|||||||
||||||
|||||
|||
and we were trying to replace | with A
then the output should look like this, assuming we want to replace every 3rd occurance
||A||A|
|A||A|
|A||A
||A
The current awk command I am using is this
awk '/|/{c++;if(c==3){sub(/|/,"A");c=0}}1' test.data
and it wrongly outputs this
|||||||
||||||
A||||
|||
also the data can look like this
|||xfsafrwe|||asfasdf|
|safasf|||asfasdf||
||asfasf|||
|||
and the result of course is this
||Axfsafrwe||Aasfasdf|
|safasfA||asfasdfA|
|Aasfasf||A
||A
Thanks

With GNU awk:
awk '{
for (i = 0; ++i <= NF;)
++c % n || $i = v
}1' OFS= FS= n=3 v=A infile
Adjusted after OP clarification:
awk '{
for (i = 0; ++i <=NF;)
if ($i == o)
++C % c || $i = n
} 1' FS= OFS= c=3 o=\| n=A infile

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extracting columns from previous and next line by pattern matching - regex

Related

How to print a line with no field separator in awk?

Parse default Salt highstate output

Print only '+' or '-' if string matches (two files)

Command line to match lines with matching first field (sed, awk, etc.)

replace nth occurrence of character in a file using awk regardless of the line

Categories

Resources