Matching strings across non-consecutive rows with AWK - regex

I have been working with an AWK one-liner that does a good job of identifying string matches on previous rows, i.e. comparing field x on row n with field y on row (n+1). E.g., say input file consists of rows, 3 fields each:
A B C
B B B
C C C
D B D
The one-liner is:
awk "$2==a[2] { print a[1],a[2],a[3] } { for (i=1;i<=NF;i++) a[i]=$i }"
So this example prints out all three fields of any immediately previous row that matches on field 2, which in this case is only row 1. So the output would be:
A B C
Now, I'm wondering if there is a modification to this command that will allow me to find matches between the current row and the row that is 2 rows before it, or 3 rows before it, or even 4 rows before it.
So using the same sample input file, if I was trying to make matches for "2 rows before", on field 2, it would now only output
B B B
which is row 2, because it is the only instance of the 2nd field ("B") matching with the second field in the row that is 2 rows ahead (i.e. row 4).
I'm not terribly familiar with arrays. I'm guessing the run time will suffer but is the original command modifiable in this way ?

You could use this awk:
awk 'a[FNR%n,m]==$m {print a[FNR%n]}{a[FNR%n]=$0; a[FNR%n,m]=$m}' n=2 m=3 file.txt
The above will print the nth line, before the current line if field m in both lines match.
The above will keep the memory nicely in check: if you don't care too much about memory consumption, you can do this:
awk '(FNR-n,$m) in a {print a[FNR-n,$m]}{a[FNR,$m]=$0}' n=2 m=3 file.txt

You may use this awk solution:
cat prev.awk
FNR > p && n = split(row[FNR-p], cols) && $2 == cols[2] {
print row[FNR-p]
}
{
row[FNR] = $0
}
Then use it for current-2 row matching:
awk -v p=2 -f prev.awk file
B B B
and current-1 row matching:
awk -v p=1 -f prev.awk file
A B C

Related

In awk, Divide values in to array and count then compare

I have a csv file in which column-2 has certain values with delimiter of "," and some values in column-3 with delimiter "|". Now I need to count the values in both columns and compare them. If both are equal, column-4 should print passed, if not is should print failed. I have written below awk script but not getting what I expected
cat /tmp/test.csv
awk -F '' 'BEGIN{ OFS=";"; print "sep=;\nresource;Required_packages;Installed_packages;Validation;"};
{
column=split($2,aray,",")
columns=split($3,aray,"|")
Count=${#column[#]}
Counts=${#column[#]}
if( Counts == Count)
print $1,$2,$3,"passed"
else
print $1,$2,$3,"failed";}'/tmp/test.csv
[![my csv][1]][1]
my csv file looks:
resource Required_Packages Installed_packages
--------------------------------------------------
Vm1 a,b,c,d a|b|c
vm2 a,b,c,d b|a
vm3 a,b,c,d c|b|a
my expected file:
resource Required_packages Installed_packages Validation
------------------------------------------------------------------
Vm1 a,b,c,d a|b|c Failed
vm2 a,b,c,d b|a Failed
vm3 a,b,c,d c|b|a|d Passed
you code doesn't match the input/output data (where are the dashed printed, etc) but
this code segment
column=split($2,aray,",")
columns=split($3,aray,"|")
Count=${#column[#]}
Counts=${#column[#]}
if( Counts == Count)
print $1,$2,$3,"passed"
else
print $1,$2,$3,"failed";
can be replaced with
print $1,$2,$3,(split($2,a,",")==split($3,a,"|")?"Passed":"Failed")
Also, just checking the counts may not be enough, I think you should be checking the matches as well.
Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR<=2{
print
next
}
{
num=split($2,array1,",")
num1=split($3,array2,"|")
for(i=1;i<=num;i++){
value[array1[i]]
}
for(k=1;k<=num1;k++){
if(array2[k] in value){ count++ }
}
if(count==num){ $(NF+1)="Passed" }
else { $(NF+1)="Failed" }
count=num=num1=""
delete value
}
1
' Input_file | column -t
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
FNR<=2{ ##Checking condition if line number is lesser or equal to 2 then do following.
print ##Printing current line here.
next ##next will skip all further statements from here.
}
{
num=split($2,array1,",") ##Splitting 2nd field into array named array1 with field separator of comma and num will have total number of elements of array1 in it.
num1=split($3,array2,"|") ##Splitting 3rd field into array named array2 with field separator of comma and num1 will have total number of elements of array2 in it.
for(i=1;i<=num;i++){ ##Starting a for loop from 1 to till value of num here.
value[array1[i]] ##Creating value which has key as value of array1 who has key as variable i in it.
}
for(k=1;k<=num1;k++){ ##Starting a for loop from from 1 to till value of num1 here.
if(array2[k] in value){ count++ } ##Checking condition if array2 with index k is present in value then increase variable of count here.
}
if(count==num){ $(NF+1)="Passed" } ##Checking condition if count equal to num then adding Passed to new last column of current line.
else { $(NF+1)="Failed" } ##Else adding Failed into nw last field of current line.
count=num=num1="" ##Nullify variables count, num and num1 here.
delete value
}
1 ##1 will print current line.
' Input_file | column -t ##Mentioning Input_file and passing its output to column command here.

Bash - Extract a column from a tsv file whose header matches a given pattern

I've got a tab-delimited file called dataTypeA.txt. It looks something like this:
Probe_ID GSM24652 GSM24653 GSM24654 GSM24655 GSM24656 GSM24657
1007_s_at 1149.82818866431 1156.14191288693 743.515922643437 1219.55564561635 1291.68030259557 1110.83793199643
1053_at 253.507372571459 150.907554200493 181.107054946649 99.0610660103702 147.953428467212 178.841519788697
117_at 157.176825094869 147.807257232552 162.11169957066 248.732378039521 176.808414979907 112.885784025819
121_at 1629.87514240262 1458.34809770171 1397.36209234134 1601.83045996129 1777.53949459116 1256.89054921471
1255_g_at 91.9622298972477 29.644137111864 61.3949774595639 41.2554576367652 78.4403716513328 66.5624213750532
1294_at 313.633291641829 305.907304474766 218.567756319376 335.301256439494 337.349552407502 316.760658896597
1316_at 195.799277107983 163.176402437481 111.887056644528 194.008323756222 211.992656497053 135.013920706472
1320_at 34.5168433158599 19.7928225262233 21.7147425051394 25.3213322300348 22.4410631949167 29.6960283168278
1405_i_at 74.938724593443 24.1084307838881 24.8088845994911 113.28326338746 74.6406975005947 70.016519414531
1431_at 88.5010900723741 21.0652011409692 84.8954961447585 110.017339630928 84.1264201735067 49.8556999547353
1438_at 26.0276274326623 45.5977459152141 31.8633816890024 38.568939176828 43.7048363737468 28.5759163094148
1487_at 1936.80799770498 2049.19167519573 1902.85054762899 2079.84030768241 2088.91036902825 1879.84684705068
1494_f_at 358.11266607978 271.309665853292 340.738488775022 477.953251687206 388.441738062896 329.43505750512
1598_g_at 2908.90515715761 4319.04621682741 2405.62061966298 3450.85255814957 2573.97860992156 2791.38660060659
160020_at 416.089910909237 327.353902186303 385.030831004533 385.199279534446 256.512900212781 217.754025190117
1729_at 43.1079499314469 114.654670657195 133.191500889286 86.4106614983387 122.099426341898 218.536976034472
177_at 75.9653827137444 27.4348937420347 16.5837374743166 50.6758325717831 58.7568500760629 18.8061888366161
1773_at 31.1717741953018 158.225161489953 161.976679771553 139.173486349393 218.572194156366 103.916119454
179_at 1613.72113870554 1563.35465407698 1725.1817757679 1694.82209331327 1535.8108561345 1650.09670894426
Let's say I have a variable col="GSM24655". I want to extract the column from dataTypeA.txt that corresponds to this column name.
Additionally, I'd like to put this in a function, where I can just give it a file (i.e. dataTypeA.txt), and a column (i.e. GSM24655), and it'll return that column.
I'm not very proficient in Bash, so I've been having some trouble with this. I'd appreciate the help.
Below script using awk can be used to achieve the objective.
col="GSM24655";
awk -v column_val="$col" '{ if (NR==1) {val=-1; for(i=1;i<=NF;i++) { if ($i == column_val) {val=i;}}} if(val != -1) print $val} ' dataTypeA.txt
Working: Initially, value of col is passed to awk script using -v column_val="$col" . Then the column number is find out. (when NR==1, i.e the first row, it iterates through all the fields (for(i=1;i<=NF;i++), awk variable NF contains the number of columns) and then compare the value of column_val (if ($i == column_val)), when a match is found the corresponding column number is found and stored ( val=i )). After that, from next row onwards, the values in that column is printed (print $val).
If you copy the below code into a file called say find_column.sh, you can call sh find_column.sh GSM24655 dataTypeA.txt to display the column having value of first parameter (GSM24655) in the file named second parameter (dataTypeA.txt). $1 and $2 are positional parameters. The lines column=$1 and file=$2 will assign the input values to the variables.
column=$1;
file=$2;
awk -v column_val="$column" '{ if (NR==1) {val=-1; for(i=1;i<=NF;i++) { if ($i == column_val) {val=i;}}} if(val != -1) print $val} ' $file
I would use the following, it is quick and easy.
In your script, you get the name of the file, let's say $1, and word, $2.
Then, in my for each I am using the whole header, but you can just add a head -1 $1, and in the IF, the $2, this is going to output column name.
c=0;
for each in `echo "Probe_ID GSM24652 GSM24653 GSM24654 GSM24655 GSM24656 GSM24657"`;do if [[ $each == "Probe_ID" ]];then
echo $c;
col=$c;
else c=$(( c + 1 ));
fi;
done
Right after this, you just do a cat $1| cut -d$'\t' -f$col

Extract line before first empty line after match

I have some CSV file in this form:
* COMMENT
* COMMENT
100 ; 1706 ; 0.18 ; 0.45 ; 0.00015 ; 0.1485 ; 0.03 ; 1 ; 1 ; 2 ; 280 ; 100 ; 100 ;
* COMMENT
* COMMENT
* ZT vector
0; 367; p; nan
1; 422; p; nan
2; 1; d; nan
* KS vector
0; 367; p; 236.27
1; 422; p; 236.27
2; 1; d; 236.27
*Total time: 4.04211
I need to extract the last line before an empty line after matching the pattern KS vector.
To be clearer, in the above example I would like to extract the line
2; 1; d; 236.27
since it's the non empty line just before the first empty one after I got the match with KS vector.
I would also like to use the same script to extract the same kind of line after matching the pattern ZT vector, that in the above example would return
2; 1; d; nan
I need to do this because I need the first number of that line, since it tells me the number of consecutive non-empty lines after KS vector.
My current workaround is this:
# counting number of lines after matching "KS vector" until first empty line
var=$(sed -n '/KS vector/,/^$/p' file | wc -l)
# Subtracting 2 to obtain actual number of lines
var=$(($var-2))
But if I could extract directly the last line I could extract the first element (2 in the example) and add 1 to it to obtain the same number.
You're going about this the wrong way. All you need is to put awk into paragraph mode and print 1 less than the number of lines in the record (since you don't want to include the KS vector line in your count):
$ awk -v RS= -F'\n' '/KS vector/{print NF-1}' file
3
Here's how awk sees the record when you put it into paragraph mode (by setting RS to null) with newline-separated fields (by setting FS to a newline):
$ awk -v RS= -F'\n' '/KS vector/{ for (i=1;i<=NF;i++) print NF, i, "<"$i">"}' file
4 1 <* KS vector>
4 2 <0; 367; p; 236.27>
4 3 <1; 422; p; 236.27>
4 4 <2; 1; d; 236.27>
With awk expression:
awk -v vec="KS vector" '$0~vec{ f=1 }f && !NF{ print r; exit }f{ r=$0 }' file
vec - variable containing the needed pattern/vector
$0~vec{ f=1 } - on encountering the needed pattern/vector - set the flag f in active state
f{ r=$0 } - while the flag f is active(under needed vector section) - capture the current line into variale r
f && !NF{ print r; exit } - (NF - total number of fields, if the line is empty - there's no fields !NF) on encountering empty line while iterating through the needed vector lines - print the last captured non-empty line r
exit - exit script execution immediately (avoiding redundant actions/iterations)
The output:
2; 1; d; 236.27
If you want to just print the actual number of lines under found vector use the following:
awk -v vec="KS vector" '$0~vec{ f=1 }f && !NF{ print r+1; exit }f{ r=$1 }' file
3
With awk:
awk '$0 ~ "KS vector" { valid=1;getline } valid==1 { cnt++;dat[cnt]=$0 } $0=="" { valid="" } END { print dat[cnt-1] }' filename
Check for any lines matching "KS vector". Set a valid flag and then read in the next line. Read the data into an array with an incremented counter. When space is encountered, reset the valid flag. At the end print the last but one element of the dat array.

Bash - word/term frequency per line (i.e. document)

I have a file rev.txt like this:
header1,header2
1, some text here
2, some more text here
3, text and more text here
I also have a vocabulary document with all unique words from rev.txt, like so (but sorted):
a
word
list
text
here
some
more
and
I want to generate a term frequency table for each line in rev.txt where it lists the occurence of each vocabulary word in each line of rev.txt, like so:
0 0 0 1 1 1 0 0
0 0 0 1 1 1 1 0
0 0 0 2 1 0 1 1
They could be comma separated as well.
This is similar to a question here. However, instead of search through the entire document, I want to do this line by line, using the complete vocabulary I already have.
Re: Jean-François Fabre
Actually, I am performing these in MATLAB. However, bash (I believe) would be faster for this preprocessing as I have direct disk access to the files.
Normally, I would use python, but limiting myself to using bash, this hacky one-liner solution will works for the given test case.
perl -pe 's|^.*?,[ ]?(.*)|\1|' rev.txt | sed '1d' | awk -F' ' 'FILENAME=="wordlist.txt" {wc[$1]=0; wl[wllen++]=$1; next}; {for(i=1; i<=NF; i++){wc[$i]++}; for(i=0; i<wllen; i++){print wc[wl[i]]" "; wc[wl[i]]=0; if(i+1==wllen){print "\n"} }}' ORS="" wordlist.txt -
Explanation/My thinking...
In the first part, perl -pe 's|^.*?,[ ]?(.*)|\1|' rev.txt, was used to pull out everything after the first comma (+removing the leading whitespace) from "rev.txt".
In the next part, sed '1d', was used to remove the first i.e. header line.
In the next part, we specified awk -F' ' ... ORS="" wordlist.txt - to use whitespace as a field delimiter, the output record delimiter as no space (note: we will print them as we go), and to read input from wordlist.txt (i.e. the "vocabulary document with all unique words from rev.txt") and stdin.
In the awk command, if the FILENAME is equal to "wordlist.txt", then (1) initialize array wc where the keys are the vocab words and the count is 0, and (2) initialize a list wl where the word order in the same as wordlist.txt.
FILENAME=="wordlist.txt" {
wc[$1]=0;
wl[wllen++]=$1;
next
};
After initialization, for each word in a line of stdin (i.e. the tidy rev.txt), increment the count of the word in wc.
{ for (i=1; i<=NF; i++) {
wc[$i]++
};
After the word counts have been added for a line, for each word in the list of words wl, print the count of that word with a whitespace and reset the count in wc back to 0. If the word is the last in the list, then add a whitespace to the output.
for (i=0; i<wllen; i++) {
print wc[wl[i]]" ";
wc[wl[i]]=0;
if(i+1==wllen){
print "\n"
}
}
}
Overall, this should produce the specified output.
Here's one in awk. It reads in the vocabulary file voc.txt (it's a piece of cake to produce it automatically in awk), copies the word list for each row of text and counts the word frequencies:
$ cat program.awk
BEGIN {
PROCINFO["sorted_in"]="#ind_str_asc" # order for copying vocabulary array w
}
NR==FNR { # store the voc.txt to w
w[$1]=0
next
}
FNR>1 { # process text files to matrix
for(i in w) # copy voc array
a[i]=0
for(i=2; i<=NF; i++) # count freqs
a[$i]++
for(i in a) # output matrix row
printf "%s%s", a[i], OFS
print ""
}
Run it:
$ awk -f program.awk voc.txt rev.txt
0 0 1 0 0 1 1 0
0 0 1 0 1 1 1 0
0 1 1 0 1 0 2 0

Finding columns with only white space in a text file and replace them with a unique separator

I have a file like this:
aaa b b ccc 345
ddd fgt f u 3456
e r der der 5 674
As you can see the only way that we can separate the columns is by finding columns that have only one or more spaces. How can we identify these columns and replace them with a unique separator like ,.
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Note:
If we find all continuous columns with one or more white spaces (nothing else) and replace them with , (all the column) the problem will be solved.
Better explanation of the question by josifoski :
Per block of matrix characters, if all are 'space' then all block should be replaced vertically with one , on every line.
$ cat tst.awk
BEGIN{ FS=OFS=""; ARGV[ARGC]=ARGV[ARGC-1]; ARGC++ }
NR==FNR {
for (i=1;i<=NF;i++) {
if ($i == " ") {
space[i]
}
else {
nonSpace[i]
}
}
next
}
FNR==1 {
for (i in nonSpace) {
delete space[i]
}
}
{
for (i in space) {
$i = ","
}
gsub(/,+/,",")
print
}
$ awk -f tst.awk file
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Another in awk
awk 'BEGIN{OFS=FS=""} # Sets field separator to nothing so each character is a field
FNR==NR{for(i=1;i<=NF;i++)a[i]+=$i!=" ";next} #Increments array with key as character
#position based on whether a space is in that position.
#Skips all further commands for first file.
{ # In second file(same file but second time)
for(i=1;i<=NF;i++) #Loops through fields
if(!a[i]){ #If field is set
$i="," #Change field to ","
x=i #Set x to field number
while(!a[++x]){ # Whilst incrementing x and it is not set
$x="" # Change field to nothing
i=x # Set i to x so it doesnt do those fields again
}
}
}1' test{,} #PRint and use the same file twice
Since you have also tagged this r, here is a possible solution using the R package readr. It looks like you want to read a fix width file and convert it to a comma-seperated file. You can use read_fwf to read the fix width file and write_csv to write the comma-seperated file.
# required package
require(readr)
# read data
df <- read_fwf(path_to_input, fwf_empty(path_to_input))
# write data
write_csv(df, path = path_to_output, col_names = FALSE)