awk to print specific fields based on regex match in file

awk to print specific fields based on regex match in file - regex

I am trying to use awk to look in input for keywords and in found print specified fields. The awk below does run but does not produce the desired output. What is supposed to happen is that if TYPE=ins or TYPE=del is found in the line then $1,$2,$4,$5, and the LEN= prints. The LEN= is also a field in the line with a number after the =. Thank you :).
input
chr1 1647893 . C CTTTCTT 31.9545 PASS AF=0.330827;AO=179;DP=695;FAO=132;FDP=399;FR=.;FRO=267;FSAF=67;FSAR=65;FSRF=124;FSRR=143;FWDB=0.0145873;FXX=0.00249994;HRUN=1;LEN=6;MLLD=190.481;OALT=TTTCTT;OID=.;OMAPALT=CTTTCTT;OPOS=1647894;OREF=-;PB=0.5;PBP=1;QD=0.320346;RBI=0.0146526;REFB=-0.0116875;REVB=0.00138131;RO=471;SAF=85;SAR=94;SRF=236;SRR=235;SSEN=0;SSEP=0;SSSB=-0.0324817;STB=0.528856;STBP=0.43;TYPE=ins;VARB=0.0222858 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/1:31:695:399:471:267:179:132:0.330827:94:85:236:235:65:67:124:143
chr1 1650787 . T C 483.012 PASS AF=0.39;AO=181;DP=459;FAO=156;FDP=400;FR=.;FRO=244;FSAF=100;FSAR=56;FSRF=162;FSRR=82;FWDB=-0.00931067;FXX=0;HRUN=1;LEN=1;MLLD=210.04;OALT=C;OID=.;OMAPALT=C;OPOS=1650787;OREF=T;PB=0.5;PBP=1;QD=4.83012;RBI=0.018986;REFB=-0.0114993;REVB=-0.0165463;RO=276;SAF=116;SAR=65;SRF=184;SRR=92;SSEN=0;SSEP=0;SSSB=-0.0305478;STB=0.515311;STBP=0.652;TYPE=snp;VARB=0.019956 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/1:483:459:400:276:244:181:156:0.39:65:116:184:92:56:100:162:82
chr1 17034455 . CGCGCGCGT C 50 PASS AF=0.205882;AO=56;DP=272;FR=.;LEN=8;OALT=-;OID=.;OMAPALT=C;OPOS=17034456;OREF=GCGCGCGT;RO=216;SAF=27;SAR=29;SRF=112;SRR=104;TYPE=del GT:GQ:DP:RO:AO:SAF:SAR:SRF:SRR:AF 0/1:99:272:216:56:27:29:112:104:0.205882
awk
awk '/TYPE=ins/ {print $1,$2,$4,$5, "/TYPE=*/" "/LEN=*/" $0;next} /TYPE=del/ {print $1,$2,$4,$5, "/TYPE=*/" "/LEN=*/" $0;next} 1' input > out
desired output
chr1 1647893 C CTTTCTT TYPE=ins LEN=6
chr1 17034455 CGCGCGCGT C TYPE=del LEN=8

You can use this awk command:
awk 'function find(str) {
return substr($0, match($0, str "=[^; \t]+"), RLENGTH);
}
/TYPE=(ins|del)/ {
print $1, $2, $4, $5, find("TYPE"), find("LEN")
}' file
Output:
chr1 1647893 C CTTTCTT TYPE=ins LEN=6
chr1 17034455 CGCGCGCGT C TYPE=del LEN=8

Here is an awk-solution:
awk '$0~"TYPE=del" || $0~"TYPE=ins"{max=split($0,ar,";")
len=""
type=""
for(i=1; i<=max; i++){
if(ar[i]~"LEN="){len=ar[i]}
if(ar[i]~"TYPE="){type=ar[i]}
}
print $1,$2,$4,$5,type,len}' input
Output:
chr1 1647893 C CTTTCTT TYPE=ins LEN=6
chr1 17034455 CGCGCGCGT C TYPE=del LEN=8

Related

AWK - add value based on regex

I have to add the numbers returned by REGEX using awk in linux.
Basically from this file:
123john456:x:98:98::/home/john123:/bin/bash
I have to add the numbers 123 and 456 using awk.
So the result would be 579
So far I have done the following:
awk -F ':' '$1 ~ VAR+="/[0-9].*(?=:)/" ; {print VAR}' /etc/passwd
awk -F ':' 'VAR+="/[0-9].*(?=:)/" ; {print VAR}' /etc/passwd
awk -F ':' 'match($1, VAR=/[0-9].*?:/) ; {print VAR}' /etc/passwd
And from what I've seen match doesn't support this at all.
Does someone has any idea?
UPDATE:
it also should work for
john123 result - > 123
123john result - > 123

$ awk -F':' '{split($1,t,/[^0-9]+/); print t[1] + t[2]}' file
579
With your updated requirements:
$ cat file
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
$ awk -F':' '{split($1,t,/[^0-9]+/); print t[1] + t[2]}' file
579
123
123

With gawk and for the given example
awk -F ':' '{a=gensub(/[a-zA-Z]+/,"+", "g", $1); print a}' inputFile | bc
would do the job.
More general:
awk -F ':' '{a=gensub(/[a-zA-Z]+/,"+", "g", $1); a=gensub(/^+/,"","g",a); a=gensub(/+$/,"","g",a); print a}' inputFile | bc
The regex-part replaces all sequences of letters with '+' (e.g., '12johnny34' becomes 12+34). Finally, this mathematical operation is evaluated by bc.
(The be safe, I remove leading and trailing '+' sings by ^+ and +$)

You may use
awk -F ':' '{n=split($1, a, /[^0-9]+/); b=0; for (i=1;i<=n;i++) { b += a[i]; }; print b; }' /etc/passwd
See online awk demo
s="123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash"
awk -F ':' '{n=split($1, a, /[^0-9]+/); b=0; for (i=1;i<=n;i++) { b += a[i]; }; print b; }' <<< "$s"
Output:
579
123
Details
-F ':' - records are split into fields with : char
n=split($1, a, /[^0-9]+/) - gets Field 1 and splits into digit only chunks saving the numbers in a array and the n var contains the number of these chunks
b=0 - b will hold the sum
for (i=1;i<=n;i++) { b += a[i]; } - iterate over a array and sum the values
print b - prints the result.

I used awk's split() to separate the first field on any string not containing numbers.
split(string, target_array, [regex], [separator_array]*)
*separator_array requires gawk
$ awk -F: '{split($1, A, /[^0-9]+/, S); print S[1], A[1]+A[2]}' <<EOF
123john456:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
EOF
john 579
john 123

You can use [^0-9]+ as a field separator, and :[^\n]*\n as a record separator instead:
awk -F '[^0-9]+' 'BEGIN{RS=":[^\n]*\n"}{print $1+$2}' /etc/passwd
so that given the content of /etc/passwd being:
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
This outputs:
579
123
123

You can try Perl also
$ cat johnny.txt
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
$ perl -F: -lane ' $_=$F[0]; $sum+= $1 while(/(\d+)/g); print $sum; $sum=0 ' johnny.txt
579
123
123
$

Here is another awk variant that adds all the numbers present in first field separated by ::
cat file
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
1j2o3h4n5:x:98:98::/home/john123:/bin/bash
awk -F '[^0-9:]+' '{s=0; for (i=1; i<=NF; i++) {s+=$i; if ($i~/:$/) break} print s}' file
579
123
123
15

add plus or minus in awk if no match

I am trying to match all the lines in the below file to match. The awk will do that the problem is that the lines that do not match should be within plus or minus 10. I am not sure how to tell awk that the if a match is not found then use either plus or minus the coordinates in file. If no match is found after that then no match is in the file. Thank you :).
file
955763
957852
976270
bigfile
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2
chr1 970621 970740 chr1:970621-970740 AGRN-8|gc=57.1
awk
awk 'NR==FNR{A[$1];next}$3 in A' file bigfile > output
desired output (same as bigfile)
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2

If there's no difference between a row that matches and one that's close, you could just set all of the keys in the range in the array:
awk 'NR == FNR { for (i = -10; i <= 10; ++i) A[$1+i]; next }
$3 in A' file bigfile > output
The advantage of this approach is that only one lookup is performed per line of the big file.

You need to run a loop on array a:
awk 'NR==FNR {
a[$1]
next
}
{
for (i in a)
if (i <= $3+10 && i >= $3-10)
print
}' file bigfile > output

Your data already produces the desired output (all exact match).
$ awk 'NR==FNR{a[$1];next} $3 in a{print; next}
{for(k in a)
if((k-$3)^2<=10^2) {print $0, " --> within 10 margin"; next}}' file bigfile
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2
chr1 976251 976261 chr1:976251-976261 AGRN-8|gc=57.1 --> within 10 margin
I added a fake 4th row to get the margin match

awk to count and sum total using matching string from file

I am trying to get the total length of each matching string and the count of each match in a file using awk. The matching string in $5 is the count and the sum of each $3 - $2 is the total length. Hopefully the awk below is a good start. Thank you :).
input
chr1 1266716 1266926 chr1:1266716-1266926 TAS1R3
chr1 1267008 1267328 chr1:1267008-1267328 TAS1R3
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3
chr1 1268291 1268514 chr1:1268291-1268514 TAS1R3
chr1 1956371 1956503 chr1:1956371-1956503 GABRD
chr1 1956747 1956866 chr1:1956747-1956866 GABRD
chr1 1956947 1957187 chr1:1956947-1957187 GABRD
chr1 1220077 1220196 chr1:1220077-1220196 SCNN1D
desired output
TAS1R3 4 1555
GABRD 3 491
SCNN1D 1 119
awk
awk '{count[$5]++}
END {
for (word in count)
print $1,$2,$3,$4,word, count[word]
}' input > count |
awk 'print $1,$2,$3,$4,word, count[word]
}
{ $6 = $3 - $2 }
1' count.txt > length
edit
SCNN1D 1 119
GABRD 3 240
TAS1R3 4 223

You can do:
awk '{c1[$5]++; c2[$5]+=($3-$2)}
END{for (e in c1) print e, c1[e], c2[e]}' input
Note that the order of the records may be different than the order in the original file.

With awk, it's possible to do the entire thing in a single script,
by keeping a running count of both the cumulative length, and the number of instances for each word.
Try this (yet untested):
awk '{
offset1=$2; offset2=$3; word=$5
TotalLength[word]=offset2 - offset1 # or just $3-$2
count[word]++}
END {
for (word in count)
print word, count[word], TotalLength[word]
}' input
The original script had three errors.
The second awk chunk had an ambiguous input specification: Reading from pipe and a file argument (count.txt). In this case, awk cannot decide where to read from.
In an END section, the numbered fields will only refer to the fields of the last line/record read. This is not what you want.
Finally, the second awk script is missing the opening brace { for the print statement.

$ cat tst.awk
$5 != prev { if (NR>1) print prev, cnt, sum; prev=$5; cnt=sum=0 }
{ cnt++; sum+=($3-$2) }
END { print prev, cnt, sum }
$ awk -f tst.awk file
TAS1R3 4 1555
GABRD 3 491
SCNN1D 1 119

Awk tab-delimited columns with comma-delimited values, split them up

I have a file with columns like this:
TNFRSF14 chr1 2487803,2489164,2489781,2491261,2492062,2493111,2494303,2494586, 2488172,2489273,2489907,2491417,2492153,2493254,2494335,2497061,
ID3 chr1 23884420,23885425,23885617, 23884906,23885510,23886285,
In case the tabs cannot be seen on your browser:
TNFRSF14"\t"chr1"\t"2487803,2489164,2489781,2491261,2492062,2493111,2494303,2494586,"\t"2488172,2489273,2489907,2491417,2492153,2493254,2494335,2497061,
ID3"\t"chr1"\t"23884420,23885425,23885617,"\t"23884906,23885510,23886285,
I would like to have the output say:
TNFRSF14 chr1 2487803 2488172
TNFRSF14 chr1 2489164 2489273
...
ID3 chr1 23885425 23885510
ID3 chr1 23885617 23886285
As you can see, my original input is of varying lengths in columns 3 and 4, but the length of column 3 will always equal column 4. So far I have been able to split the files into varying column lengths, and have a python script that can place them. I was hoping there was a way for awk to do this though!
Thanks for any suggestions!

you can try to use split function
gawk '{
split($3,a,",");
split($4,b,",");
for(i=1; i<length(a); i++){
print $1, $2, a[i], b[i];
}
}' input
Note: length(array) is gnu-awk specific
you get:
TNFRSF14 chr1 2487803 2488172
TNFRSF14 chr1 2489164 2489273
TNFRSF14 chr1 2489781 2489907
TNFRSF14 chr1 2491261 2491417
TNFRSF14 chr1 2492062 2492153
TNFRSF14 chr1 2493111 2493254
TNFRSF14 chr1 2494303 2494335
TNFRSF14 chr1 2494586 2497061
ID3 chr1 23884420 23884906
ID3 chr1 23885425 23885510
ID3 chr1 23885617 23886285

$ cat tst.awk
BEGIN{ FS=OFS="\t" }
{
n = split($3,a,/,/)
split($4,b,/,/)
for (i=1;i<n;i++) {
print $1, $2, a[i], b[i]
}
}
$
$ awk -f tst.awk file
TNFRSF14 chr1 2487803 2488172
TNFRSF14 chr1 2489164 2489273
TNFRSF14 chr1 2489781 2489907
TNFRSF14 chr1 2491261 2491417
TNFRSF14 chr1 2492062 2492153
TNFRSF14 chr1 2493111 2493254
TNFRSF14 chr1 2494303 2494335
TNFRSF14 chr1 2494586 2497061
ID3 chr1 23884420 23884906
ID3 chr1 23885425 23885510
ID3 chr1 23885617 23886285

awk -F',? ' '
{
split($3, a, /,/)
split($4, b, /,/)
for (i in a) print $1, $2, a[i], b[i]
}' file

print between two pattern matches on same line

I have a file that looks like the following. I want to print the first, second, third, fourth, and fifth column, then split the eighth column and print between "EFF=" and the following "(" on each line and after splitting the eighth column between the the pipes "|" printing the sixth match.
chr1 10150 . C T 6.72 . DP=6;VDB=0.0074;AF1=0.2932;CLR=6;AC1=1;DP4=3,1,1,1;MQ=30;FQ=7.98;PV4=1,0.33,1,0.22;EFF=DOWNSTREAM(MODIFIER||4212|||WASH7P||NON_CODING|NR_024540.1||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||1724|||DDX11L1||NON_CODING|NR_046018.2||1) GT:PL:DP:SP:GQ 0/0:0,6,26:2:0:9 0/1:38,0,48:4:0:36
chr1 10291 . C T 3.55 . DP=1;AF1=1;AC1=4;DP4=0,0,1,0;MQ=52;FQ=-27.4;EFF=DOWNSTREAM(MODIFIER||4071|||WASH7P||NON_CODING|NR_024540.1||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||1583|||DDX11L1||NON_CODING|NR_046018.2||1) GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 0/1:31,3,0:1:0:5
chr1 10297 . C T 3.55 . DP=1;AF1=1;AC1=4;DP4=0,0,1,0;MQ=52;FQ=-27.4;EFF=DOWNSTREAM(MODIFIER||4065|||WASH7P||NON_CODING|NR_024540.1||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||1577|||DDX11L1||NON_CODING|NR_046018.2||1) GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 0/1:31,3,0:1:0:5
chr1 10327 . T C 3.02 . DP=3;VDB=0.0160;AF1=1;AC1=4;DP4=0,0,1,0;MQ=56;FQ=-27.4;EFF=DOWNSTREAM(MODIFIER||4035|||WASH7P||NON_CODING|NR_024540.1||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||1547|||DDX11L1||NON_CODING|NR_046018.2||1) GT:PL:DP:SP:GQ 0/1:30,3,0:1:0:5 0/0:0,0,0:0:0:3
output
chr1 10150 . C T WASH7P DOWNSTREAM
chr1 10291 . C T WASH7P DOWNSTREAM
chr1 10297 . C T WASH7P DOWNSTREAM
chr1 10327 . T C WASH7P DOWNSTREAM
I can print the columns and the sixth element on the eighth column between the pipes "|" using the following, but not the string that matches between the "EFF=" and the next "(".
awk '{split($8,a,"|"); print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" a[6] "\t" a[8]}'

You can use match() that uses a regular expression to match from EFF until an opening parentheses. It returns in eff variable the value EFF=DOWNSTREAM so then use substr() to extract the string between the equal sign and the opening parentheses, like:
awk '
{split($8,a,"|");
match($8, "EFF=[^(]*", eff);
print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" a[6] "\t" substr(eff[0], 5)}
' infile
It yields:
chr1 10150 . C T WASH7P DOWNSTREAM
chr1 10291 . C T WASH7P DOWNSTREAM
chr1 10297 . C T WASH7P DOWNSTREAM
chr1 10327 . T C WASH7P DOWNSTREAM
UPDATE: You are using an old version (or at least the non-GNU) of awk. And the match() function only accepts two parameters so you have to play with RSTART and RLENGTH variables, try this version:
awk '
{split($8,a,"|");
pos = match($8, "EFF=[^(]*");
print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" a[6] "\t" substr($8, RSTART + 4, RLENGTH - 4)}
' infile
The result is the same that previous one.

$ cat tst.awk
{
split($8,a,/[|(]|EFF=/)
print $1, $2, $3, $4, $5, a[8], a[2]
}
$ awk -f tst.awk file
chr1 10150 . C T WASH7P DOWNSTREAM
chr1 10291 . C T WASH7P DOWNSTREAM
chr1 10297 . C T WASH7P DOWNSTREAM
chr1 10327 . T C WASH7P DOWNSTREAM

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

awk to print specific fields based on regex match in file - regex

You can use this awk command: awk 'function find(str) { return substr($0, match($0, str "=[^; \t]+"), RLENGTH); } /TYPE=(ins|del)/ { print $1, $2, $4, $5, find("TYPE"), find("LEN") }' file Output: chr1 1647893 C CTTTCTT TYPE=ins LEN=6 chr1 17034455 CGCGCGCGT C TYPE=del LEN=8

Related

AWK - add value based on regex

add plus or minus in awk if no match

awk to count and sum total using matching string from file

Awk tab-delimited columns with comma-delimited values, split them up

print between two pattern matches on same line

Categories

Resources