In awk, Divide values in to array and count then compare - regex

I have a csv file in which column-2 has certain values with delimiter of "," and some values in column-3 with delimiter "|". Now I need to count the values in both columns and compare them. If both are equal, column-4 should print passed, if not is should print failed. I have written below awk script but not getting what I expected
cat /tmp/test.csv
awk -F '' 'BEGIN{ OFS=";"; print "sep=;\nresource;Required_packages;Installed_packages;Validation;"};
{
column=split($2,aray,",")
columns=split($3,aray,"|")
Count=${#column[#]}
Counts=${#column[#]}
if( Counts == Count)
print $1,$2,$3,"passed"
else
print $1,$2,$3,"failed";}'/tmp/test.csv
[![my csv][1]][1]
my csv file looks:
resource Required_Packages Installed_packages
--------------------------------------------------
Vm1 a,b,c,d a|b|c
vm2 a,b,c,d b|a
vm3 a,b,c,d c|b|a
my expected file:
resource Required_packages Installed_packages Validation
------------------------------------------------------------------
Vm1 a,b,c,d a|b|c Failed
vm2 a,b,c,d b|a Failed
vm3 a,b,c,d c|b|a|d Passed

you code doesn't match the input/output data (where are the dashed printed, etc) but
this code segment
column=split($2,aray,",")
columns=split($3,aray,"|")
Count=${#column[#]}
Counts=${#column[#]}
if( Counts == Count)
print $1,$2,$3,"passed"
else
print $1,$2,$3,"failed";
can be replaced with
print $1,$2,$3,(split($2,a,",")==split($3,a,"|")?"Passed":"Failed")
Also, just checking the counts may not be enough, I think you should be checking the matches as well.

Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR<=2{
print
next
}
{
num=split($2,array1,",")
num1=split($3,array2,"|")
for(i=1;i<=num;i++){
value[array1[i]]
}
for(k=1;k<=num1;k++){
if(array2[k] in value){ count++ }
}
if(count==num){ $(NF+1)="Passed" }
else { $(NF+1)="Failed" }
count=num=num1=""
delete value
}
1
' Input_file | column -t
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
FNR<=2{ ##Checking condition if line number is lesser or equal to 2 then do following.
print ##Printing current line here.
next ##next will skip all further statements from here.
}
{
num=split($2,array1,",") ##Splitting 2nd field into array named array1 with field separator of comma and num will have total number of elements of array1 in it.
num1=split($3,array2,"|") ##Splitting 3rd field into array named array2 with field separator of comma and num1 will have total number of elements of array2 in it.
for(i=1;i<=num;i++){ ##Starting a for loop from 1 to till value of num here.
value[array1[i]] ##Creating value which has key as value of array1 who has key as variable i in it.
}
for(k=1;k<=num1;k++){ ##Starting a for loop from from 1 to till value of num1 here.
if(array2[k] in value){ count++ } ##Checking condition if array2 with index k is present in value then increase variable of count here.
}
if(count==num){ $(NF+1)="Passed" } ##Checking condition if count equal to num then adding Passed to new last column of current line.
else { $(NF+1)="Failed" } ##Else adding Failed into nw last field of current line.
count=num=num1="" ##Nullify variables count, num and num1 here.
delete value
}
1 ##1 will print current line.
' Input_file | column -t ##Mentioning Input_file and passing its output to column command here.

Related

awk, skip current rule upon sanity check

How to skip current awk rule when its sanity check failed?
{
if (not_applicable) skip;
if (not_sanity_check2) skip;
if (not_sanity_check3) skip;
# the rest of the actions
}
IMHO, it's much cleaner to write code this way than,
{
if (!not_applicable) {
if (!not_sanity_check2) {
if (!not_sanity_check3) {
# the rest of the actions
}
}
}
}
1;
I need to skip the current rule because I have a catch all rule at the end.
UPDATE, the case I'm trying to solve.
There is multiple match point in a file that I want to match & alter, however, there's no other obvious sign for me to match what I want.
hmmm..., let me simplify it this way, I want to match & alter the first match and skip the rest of the matches and print them as-is.
As far as I understood your requirement, you are looking for if, else if here. Also you could use switch case available in newer version of gawk packages too.
Let's take an example of a Input_file here:
cat Input_file
9
29
Following is the awk code here:
awk -v var="10" '{if($0<var){print "Line " FNR " is less than var"} else if($0>var){print "Line " FNR " is greater than var"}}' Input_file
This will print as follows:
Line 1 is less than var
Line 2 isgreater than var
So if you see code carefully its checking:
First condition if current line is less than var then it will be executed in if block.
Second condition in else if block, if current line is greater than var then print it there.
I'm really not sure what you're trying to do but if I focus on just that last sentence in your question of I want to match & alter the first match and skip the rest of the matches and print them as-is. ... is this what you're trying to do?
{ s=1 }
s && /abc/ { $0="uvw"; s=0 }
s && /def/ { $0="xyz"; s=0 }
{ print }
e.g. to borrow #Ravinder's example:
$ cat Input_file
9
29
$ awk -v var='10' '
{ s=1 }
s && ($0<var) { $0="Line " FNR " is less than var"; s=0 }
s && ($0>var) { $0="Line " FNR " is greater than var"; s=0 }
{ print }
' Input_file
Line 1 is less than var
Line 2 is greater than var
I used the boolean flag variable name s for sane as you also mentioned something in your question about the conditions tested being sanity checks so each condition can be read as is the input sane so far and this next condition is true?.

awk to sum values among grouped lines after specific str match and header

I' ve this program in awk:
BEGIN {
FS="[>;]"
OFS=";"
}
function p(a, i)
{
for(i in a)
print ">" i, "*nr=" ln
}
/^>/ {p(out);ln=0;split("",out);next}
/[*]/ {idx=$2 OFS $3; out[idx]}
{ln++}
END {
if (ln) p(out)
}
it works on a file like this:
>Cluster 300
0 151nt, >last238708;size=1... *
>Cluster 301
0 141nt, >last103379;size=1... at -/99.29%
1 151nt, >last104482;size=1... *
>Cluster 302
0 151nt, >last104505;size=1... *
>Cluster 303
0 119nt, >last325860;size=1... at +/99.16%
1 122nt, >last106751;size=1... at +/99.18%
2 151nt, >last284418;size=1... *
3 113nt, >last8067;size=3... at -/100.00%
4 122nt, >last8102;size=3... at -/100.00%
5 135nt, >last14200;size=2... at +/99.26%
>Cluster 304
0 151nt, >last285146;size=1... *
What I need is that the program print, for each cluster, the id (lastxxxxxx) of the line with the asterisk and that computes the sum of all the "size=" numbers . for example for Cluster 303 it has to output this:
>last284418;nr=11
and for Cluster 304:
>last285146;nr=1
for the moment my code is only able to count the lines and sum them but doesn't take into account the "size=" value.
Thanks for your help!
Could you please try following, written and tested with shown samples only in GNU awk.
awk '
/^>Cluster [0-9]+/{
if(sum){
print clus_line ORS val_line" = "sum
}
val_line=sum=clus_line=""
clus_line=$0
next
}
{
match($0,/size=[0-9]+/)
line=substr($0,RSTART,RLENGTH)
sub(/.*size=/,"",line)
sum+=line
}
/\*$/{
match($0,/>last[^;]*/)
val_line=substr($0,RSTART+1,RLENGTH-1)
}
END{
if(sum){
print clus_line ORS val_line" = "sum
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/^>Cluster [0-9]+/{ ##Checking condition if line starts from Cluster with digits in line then do following.
if(sum){ ##Checking if variable sum is NOT NULL then do following.
print clus_line ORS val_line" = "sum ##Printing values of clus_line ORS(new line) val_line space = space and sum here.
}
val_line=sum=clus_line="" ##Nullifying val_line, sum and clus_line here.
clus_line=$0 ##Assigning current line to clus_line here.
next ##next will skip all further statements from here.
}
{
match($0,/size=[0-9]+/) ##Using match function to match size= digits in line.
line=substr($0,RSTART,RLENGTH) ##Creating line which has sub-string for current line starts from RSTART till RLENGTH.
sub(/.*size=/,"",line) ##Substituting everything till size= keyword here with NULL in line variable.
sum+=line ##Keep on adding value of digits in line variable in sum here.
}
/\*$/{ ##Checking condition if a line ends with * then do following.
match($0,/>last[^;]*/) ##Using match function to match >last till semi-colon comes here.
val_line=substr($0,RSTART+1,RLENGTH-1) ##Creating val_line which has sub-string of current line from RSTART+1 till RLENGTH-1 here.
}
END{ ##Starting END block of this program from here.
if(sum){ ##Checking if variable sum is NOT NULL then do following.
print clus_line ORS val_line" = "sum ##Printing values of clus_line ORS(new line) val_line space = space and sum here.
}
}' Input_file ##Mentioning Input_file name here.

AWK: dynamically change FS or RS

I cannot seem to get the trick to interchange the FS/RS variables dynamically, so that I get the following results from the input:
Input_file
header 1
header 2
{
something should not be removed
}
50
(
auto1
{
type good;
remove not useful;
}
auto2
{
type good;
keep useful;
}
auto3
{
type moderate;
remove not useful;
}
)
Output_file
header 1
header 2
{
something that should not be removed
}
50
(
auto1//good
{
type good;//good
}
auto2//good
{
type good;//good
keep useful;
}
auto3//moderate
{
type moderate;//moderate
}
)
The key things are:
There's no change is happening when the code-block {...} is not preceded by a autoX (X can be 1,2,3 etc.).
The changes should happen when autoX is followed by a codeblock {...}.
the value inside the codeblock & autoX is modified with the addition of \\good or //moderate, which needs to be read from the {...} itself.
the whole line should be removed from {...}, if it contains the phrase remove.
HINT: It might be something that can use regex and the idea explained here, with this particular example.
For now, I only have been able to meet the last requirement, with the following code:
awk ' {$1=="{"; FS=="}";} {$1!="}"; gsub("remove",""); print NR"\t\t"$0}' Input_file
Thanks in advance, for your skill & time, to tackle this problem with awk.
Here is my attempt to solve this problem:
awk '
FNR==NR{
if($0~/auto[0-9]+/){
found1=1
val=$0
next
}
if(found1 && $0 ~ /{/){
found2=1
next
}
if(found1 && found2 && $0 ~ /type/){
sub(/;/,"",$NF)
a[val]=$NF
next
}
if($0 ~ /}/){
found1=found2=val=""
}
next
}
found3 && /not useful/{
next
}
/}/{
found3=val1=""
}
found3 && /type/{
sub($NF,$NF"//"a[val1])
}
/auto[0-9]+/ && $0 in a{
print $0"//"a[$0]
found3=1
val1=$0
next
}
1
' Input_file Input_file
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
FNR==NR{ ##FNR==NR will be TRUE when first time Input_file is being read.
if($0~/auto[0-9]+/){ ##Check condition if a line is having auto string followed by digits then do following.
found1=1 ##Setting found1 to 1 which makes sure that the line with auto is FOUND to later logic.
val=$0 ##Storing current line value to variable val here.
next ##next will skip all further statements from here.
}
if(found1 && $0 ~ /{/){ ##Checking condition if found1 is SET and line has { in it then do following.
found2=1 ##Setting found2 value as 1 which tells program further that after auto { is also found now.
next ##next will skip all further statements from here.
}
if(found1 && found2 && $0 ~ /type/){ ##Checking condition if found1 and found2 are ET AND line has type in it then do following.
sub(/;/,"",$NF) ##Substituting semi colon in last field with NULL.
a[val]=$NF ##creating array a with variable var and its value is last column of current line.
next ##next will skip all further statements from here.
}
if($0 ~ /}/){ ##Checking if line has } in it then do following, which basically means previous block is getting closed here.
found1=found2=val="" ##Nullify all variables value found1, found2 and val here.
}
next ##next will skip all further statements from here.
}
/}/{ ##Statements from here will be executed when 2nd time Input_file is being read, checking if line has } here.
found3=val1="" ##Nullifying found3 and val1 variables here.
}
found3 && /type/{ ##Checking if found3 is SET and line has type keyword in it then do following.
sub($NF,$NF"//"a[val1]) ##Substituting last field value with last field and array a value with index val1 here.
}
/auto[0-9]+/ && $0 in a{ ##Searching string auto with digits and checking if current line is present in array a then do following.
print $0"//"a[$0] ##Printing current line // and value of array a with index $0.
found3=1 ##Setting found3 value to 1 here.
val1=$0 ##Setting current line value to val1 here.
next ##next will skip all further statements from here.
}
1 ##1 will print all edited/non0-edited lines here.
' Input_file Input_file ##Mentioning Input_file names here.
You can use two newlines as record separator and process each record which may contain one
autoX
{
...
...
}
block.
awk '
BEGIN{
RS="\n\n" # set record separator RS to two newlines
a["good"]; a["moderate"] # create array a with indices "good" and "moderate"
}
{
sub(/\n[ \t]+remove[^;]+;/, "") # remove line containing "remove xxx;"
for (i in a){ # loop array indices "good" and "moderate"
if (index($0, i)){ # if value exists in record
sub(i";", i";//"i) # add "//good" to "good;" or "//moderate" to "moderate;"
match($0, /(auto[0-9]+)/) # get pos. RSTART and length RLENGTH of "autoX"
if (RSTART){ # RSTART > 0 ?
# set prefix including "autox", "//value" and suffix
$0=substr($0, 1, RSTART+RLENGTH-1) "//"i substr($0, RSTART+RLENGTH)
}
break # stop looping (we already replaced "autoX")
}
}
printf "%s", (FNR==1 ? "" : RS)$0 # print modified line prefixed by RS if not the first line
}
' Input_file

Bash - Extract a column from a tsv file whose header matches a given pattern

I've got a tab-delimited file called dataTypeA.txt. It looks something like this:
Probe_ID GSM24652 GSM24653 GSM24654 GSM24655 GSM24656 GSM24657
1007_s_at 1149.82818866431 1156.14191288693 743.515922643437 1219.55564561635 1291.68030259557 1110.83793199643
1053_at 253.507372571459 150.907554200493 181.107054946649 99.0610660103702 147.953428467212 178.841519788697
117_at 157.176825094869 147.807257232552 162.11169957066 248.732378039521 176.808414979907 112.885784025819
121_at 1629.87514240262 1458.34809770171 1397.36209234134 1601.83045996129 1777.53949459116 1256.89054921471
1255_g_at 91.9622298972477 29.644137111864 61.3949774595639 41.2554576367652 78.4403716513328 66.5624213750532
1294_at 313.633291641829 305.907304474766 218.567756319376 335.301256439494 337.349552407502 316.760658896597
1316_at 195.799277107983 163.176402437481 111.887056644528 194.008323756222 211.992656497053 135.013920706472
1320_at 34.5168433158599 19.7928225262233 21.7147425051394 25.3213322300348 22.4410631949167 29.6960283168278
1405_i_at 74.938724593443 24.1084307838881 24.8088845994911 113.28326338746 74.6406975005947 70.016519414531
1431_at 88.5010900723741 21.0652011409692 84.8954961447585 110.017339630928 84.1264201735067 49.8556999547353
1438_at 26.0276274326623 45.5977459152141 31.8633816890024 38.568939176828 43.7048363737468 28.5759163094148
1487_at 1936.80799770498 2049.19167519573 1902.85054762899 2079.84030768241 2088.91036902825 1879.84684705068
1494_f_at 358.11266607978 271.309665853292 340.738488775022 477.953251687206 388.441738062896 329.43505750512
1598_g_at 2908.90515715761 4319.04621682741 2405.62061966298 3450.85255814957 2573.97860992156 2791.38660060659
160020_at 416.089910909237 327.353902186303 385.030831004533 385.199279534446 256.512900212781 217.754025190117
1729_at 43.1079499314469 114.654670657195 133.191500889286 86.4106614983387 122.099426341898 218.536976034472
177_at 75.9653827137444 27.4348937420347 16.5837374743166 50.6758325717831 58.7568500760629 18.8061888366161
1773_at 31.1717741953018 158.225161489953 161.976679771553 139.173486349393 218.572194156366 103.916119454
179_at 1613.72113870554 1563.35465407698 1725.1817757679 1694.82209331327 1535.8108561345 1650.09670894426
Let's say I have a variable col="GSM24655". I want to extract the column from dataTypeA.txt that corresponds to this column name.
Additionally, I'd like to put this in a function, where I can just give it a file (i.e. dataTypeA.txt), and a column (i.e. GSM24655), and it'll return that column.
I'm not very proficient in Bash, so I've been having some trouble with this. I'd appreciate the help.
Below script using awk can be used to achieve the objective.
col="GSM24655";
awk -v column_val="$col" '{ if (NR==1) {val=-1; for(i=1;i<=NF;i++) { if ($i == column_val) {val=i;}}} if(val != -1) print $val} ' dataTypeA.txt
Working: Initially, value of col is passed to awk script using -v column_val="$col" . Then the column number is find out. (when NR==1, i.e the first row, it iterates through all the fields (for(i=1;i<=NF;i++), awk variable NF contains the number of columns) and then compare the value of column_val (if ($i == column_val)), when a match is found the corresponding column number is found and stored ( val=i )). After that, from next row onwards, the values in that column is printed (print $val).
If you copy the below code into a file called say find_column.sh, you can call sh find_column.sh GSM24655 dataTypeA.txt to display the column having value of first parameter (GSM24655) in the file named second parameter (dataTypeA.txt). $1 and $2 are positional parameters. The lines column=$1 and file=$2 will assign the input values to the variables.
column=$1;
file=$2;
awk -v column_val="$column" '{ if (NR==1) {val=-1; for(i=1;i<=NF;i++) { if ($i == column_val) {val=i;}}} if(val != -1) print $val} ' $file
I would use the following, it is quick and easy.
In your script, you get the name of the file, let's say $1, and word, $2.
Then, in my for each I am using the whole header, but you can just add a head -1 $1, and in the IF, the $2, this is going to output column name.
c=0;
for each in `echo "Probe_ID GSM24652 GSM24653 GSM24654 GSM24655 GSM24656 GSM24657"`;do if [[ $each == "Probe_ID" ]];then
echo $c;
col=$c;
else c=$(( c + 1 ));
fi;
done
Right after this, you just do a cat $1| cut -d$'\t' -f$col

Extract line before first empty line after match

I have some CSV file in this form:
* COMMENT
* COMMENT
100 ; 1706 ; 0.18 ; 0.45 ; 0.00015 ; 0.1485 ; 0.03 ; 1 ; 1 ; 2 ; 280 ; 100 ; 100 ;
* COMMENT
* COMMENT
* ZT vector
0; 367; p; nan
1; 422; p; nan
2; 1; d; nan
* KS vector
0; 367; p; 236.27
1; 422; p; 236.27
2; 1; d; 236.27
*Total time: 4.04211
I need to extract the last line before an empty line after matching the pattern KS vector.
To be clearer, in the above example I would like to extract the line
2; 1; d; 236.27
since it's the non empty line just before the first empty one after I got the match with KS vector.
I would also like to use the same script to extract the same kind of line after matching the pattern ZT vector, that in the above example would return
2; 1; d; nan
I need to do this because I need the first number of that line, since it tells me the number of consecutive non-empty lines after KS vector.
My current workaround is this:
# counting number of lines after matching "KS vector" until first empty line
var=$(sed -n '/KS vector/,/^$/p' file | wc -l)
# Subtracting 2 to obtain actual number of lines
var=$(($var-2))
But if I could extract directly the last line I could extract the first element (2 in the example) and add 1 to it to obtain the same number.
You're going about this the wrong way. All you need is to put awk into paragraph mode and print 1 less than the number of lines in the record (since you don't want to include the KS vector line in your count):
$ awk -v RS= -F'\n' '/KS vector/{print NF-1}' file
3
Here's how awk sees the record when you put it into paragraph mode (by setting RS to null) with newline-separated fields (by setting FS to a newline):
$ awk -v RS= -F'\n' '/KS vector/{ for (i=1;i<=NF;i++) print NF, i, "<"$i">"}' file
4 1 <* KS vector>
4 2 <0; 367; p; 236.27>
4 3 <1; 422; p; 236.27>
4 4 <2; 1; d; 236.27>
With awk expression:
awk -v vec="KS vector" '$0~vec{ f=1 }f && !NF{ print r; exit }f{ r=$0 }' file
vec - variable containing the needed pattern/vector
$0~vec{ f=1 } - on encountering the needed pattern/vector - set the flag f in active state
f{ r=$0 } - while the flag f is active(under needed vector section) - capture the current line into variale r
f && !NF{ print r; exit } - (NF - total number of fields, if the line is empty - there's no fields !NF) on encountering empty line while iterating through the needed vector lines - print the last captured non-empty line r
exit - exit script execution immediately (avoiding redundant actions/iterations)
The output:
2; 1; d; 236.27
If you want to just print the actual number of lines under found vector use the following:
awk -v vec="KS vector" '$0~vec{ f=1 }f && !NF{ print r+1; exit }f{ r=$1 }' file
3
With awk:
awk '$0 ~ "KS vector" { valid=1;getline } valid==1 { cnt++;dat[cnt]=$0 } $0=="" { valid="" } END { print dat[cnt-1] }' filename
Check for any lines matching "KS vector". Set a valid flag and then read in the next line. Read the data into an array with an incremented counter. When space is encountered, reset the valid flag. At the end print the last but one element of the dat array.