Counting with a complicated conditions using awk - if-statement

I have a task. I must count inner and outer water bridges from data. I want to show you how I should count them.
For example, I have a data file:
MGD12 SOL54
MGD14 SOL74
MGD10 SOL37
MGD16 SOL65
MGD21 SOL66
MGD2 SOL65
MGD64 SOL74
MGD10 SOL37
MGD72 SOL74
MGD12 SOL54
Inner water bridges are when: MGD and SOL are the same (duplicates). Outer water bridges are when: MGD is different, but SOL is the same.
For example, in the third column, I write which line is inner water bridge and which is outer
1.MGD12 SOL54 inner (the same in line 10)
2.MGD14 SOL74 outer (the same SOL in 7, 9)
3.MGD10 SOL37 inner (the same in line 8)
4.MGD16 SOL65 outer (the same SOL in 6)
5.MGD21 SOL66 no water bridge
6.MGD2 SOL65 outer (the same SOL in 4)
7.MGD64 SOL74 outer (the same SOL in 2, 9)
8.MGD10 SOL37 inner (the same in line 3)
9.MGD72 SOL74 outer (the same SOL in 2, 7)
10.MGD12 SOL54 inner (the same in line 1)
In the output, I want just the number of inner and outer water bridges. In this case, it will be only numbers 4 and 5.
4 5
I try to write a script, but I don't know what I should put in condition, maybe I should use arrays?
#!/bin/bash
awk '{ if () inner++; else if () outer++} END { print inner " " outer}' probe.txt
Edit, I try to use that script, but it's not working
#!/bin/bash
awk 'NR==FNR {a[$1,$2]++; s[$2]++; next}
a[$1,$2]!=s[$2] {outer++; next}
s[$2]!=1 {inner++}
END {print inner,outer}' probe.txt | tee probe2.txt
input
MGD12 SOL54
MGD14 SOL74
MGD10 SOL37
MGD16 SOL65
MGD21 SOL66
MGD2 SOL65
MGD64 SOL74
MGD10 SOL37
MGD72 SOL74
MGD12 SOL54
In output I have a empty line (probe2.txt)
When I try another scipt
#!/bin/bash
awk 'NR==FNR {a[$1,$2]++; s[$2]++; next}
{print $0, (a[$1,$2]==s[$2]?(s[$2]==1?"no":"inner"):"outer")}' probe.txt | tee probe2.txt
I have again empty output.

a double-scan approach is easier...
$ awk 'NR==FNR {a[$1,$2]++; s[$2]++; next}
{print $0, (a[$1,$2]==s[$2]?(s[$2]==1?"no":"inner"):"outer")}' file{,}
MGD12 SOL54 inner
MGD14 SOL74 outer
MGD10 SOL37 inner
MGD16 SOL65 outer
MGD21 SOL66 no
MGD2 SOL65 outer
MGD64 SOL74 outer
MGD10 SOL37 inner
MGD72 SOL74 outer
MGD12 SOL54 inner
just the counts
$ awk 'NR==FNR {a[$1,$2]++; s[$2]++; next}
a[$1,$2]!=s[$2] {outer++; next}
s[$2]!=1 {inner++}
END {print inner,outer}' file{,}
4 5

Related

How to rename chromosome_position column in a Beagle file and match it with the index fai?

I have text files (tab separated) which have different columns. I need to rename my chromosome_position column (see MM.beagle.gz file below) since a program that I use don't allow multiple underscores in the chromosome name (causing a parsing issue because NC_044592.1_3795 is not working as a name).
My indexed genome looks like this:
head my.fna.fai
Contains this:
NC_044571.1 115307910 88 80 81
NC_044572.1 151975198 116749435 80 81
NC_044573.1 113180500 270624411 80 81
NC_044574.1 71869398 385219756 80 81
The bealgle file looks like this:
zcat MM.beagle.gz | head | cut -f 1-3
Which gives:
marker allele1 allele2
NC_044571.1_3795 G T
NC_044573.1_3796 G T
NC_044572.1_3801 T C
NC_044574.1_3802 G A
In R I can get the chromosome and position:
beag = read.table("MM.beagle.gz", header = TRUE)
chr=gsub("_\\d+$", "", beag$marker)
pos=gsub("^[A-Z]*_[0-9]*.[0-9]_", "", beag$marker)
But I'm not able to rename the beagle file in-place. I'd like to rename all contigs in the .fai file from 1:nrow(my.fna.fai) and match it to the beagle file.
So in the end the .fai should look like:
head my.fna.fai
Desired output:
1 115307910 88 80 81
2 151975198 116749435 80 81
3 113180500 270624411 80 81
4 71869398 385219756 80 81
And the beagle file:
zcat MM.beagle.gz | head | cut -f 1-3
Would give:
marker allele1 allele2
1_3795 G T
3_3796 G T
2_3801 T C
4_3802 G A
where 22_3795 is the concatenation of the contig 22 and the position 3795, separated with an _.
The solution would preferentially be in bash as R is not practical due to the large file size of my final compressed beagle file (>210GB)
Someone proposed to change the .fai with this:
awk 'BEGIN{OFS="\t"}{print NR, $2, $3, $4, $5}' my.fna.fai
What I'm not able to figure out now is to make sure that the .fai and the .beagle file are consistent with each other. For example, event if the first column (marker) of the .beagle file is shuffled, it should be possible to match it with the .fai file and rename the chromosome names in the .beagle file. For example, if NC_1234.1 is renamed to 142 in the .fai, then all NC_1234.1_XXX in the .beagle should become 142_XXX, where XXX are numbers.
Here is an attempt at the solution:
awk 'BEGIN{OFS="\t"}{print $1, NR}' my.fna.fai > my.fna.fai.nr
awk -F'\t' -v OFS='\t' '{split($1,a,"_"); print $0,a[1]"_"a[2],a[3]}' MM.beagle.txt | awk 'NR!=1 {print}' | awk 'BEGIN{OFS="\t"}{print $0, NR}'> file2.sep.txt
sort file2.sep.txt > file2.1.s.txt
join -1 4 -2 1 file2.1.s.txt my.fna.fai.nr | sort -k6 -n | awk 'BEGIN{OFS="\t"}{$1=$2=$6="";print $7"_"$5,$0}' | awk 'BEGIN{OFS="\t"}{$4=$5="";print $0}' > file4.txt
echo $(awk 'NR==1 {print}' MM.beagle.txt); cat file4.txt
Gives
marker allele1 allele2
1_3795 G T
3_3796 G T
2_3801 T C
4_3802 G A
To ensure the new FASTA index and modified Beagle files are consistent, we can use the FASTA index and an associative array to store the chromosome name with it's line number. This lets us then parse the Beagle file and use the chromosome name to retrieve the line number from the array. Here's one way using awk:
Contents of rename_chroms.awk:
BEGIN {
FS=OFS="\t"
}
FNR==NR {
arr[$1]=NR
next
}
FNR==1 {
print
next
}
{
n = split($1, a, "_")
chrom = substr($1, 0, length($1) - (length(a[n]) + 1))
pos = a[n]
print arr[chrom] "_" pos, $2, $3
}
Run using:
awk -f rename_chroms.awk my.fna.fai <(zcat MM.beagle.gz)
Results:
marker allele1 allele2
1_3795 G T
2_3796 G T
3_3801 T C
4_3802 G A

Add a condtion for specfic row length in a script

I want to modify the following script:
awk 'NR>242 && $1 =='$t' {print $4, "\t" '$t'}' test.txt > file
I want to add a condition for the first "1 to 121" data (corresponding to the first 121 points) and then for the "122 to 242" data (which corresponds to the other 121 points).
so it becomes:
when NR>242 take the corresponding values of rows form 1 to 121 print them to file1
when NR>242 take the corresponding values of rows form 121 to 242 print them to file2
Thanks!
Generic solution: Adding more generic solution here, where you could give all line numbers inside lines variable of awk program. Once line number matches with values it will increase counter of file with 1 eg: from file1 to file2 OR file2 to file3 and so on...
awk -v val="$t" -v lines="121,242" -v count=1'
BEGIN{
num=split(lines,arr,",")
for(i=1;i<=num;i++){
line[arr[i]]
outputfile="file"count
}
}
FNR in arr[i]{
close(outputfile)
outputfile="file"++count
}
($1 == val){
print $4 "\t" val > (outputfile)
}
' Input_file
With your shown samples, please try following. This will print all lines from 1st line to 242nd line to file1 and 243 line onwards it will print output to file2. Also program has a shell variable named t passed into awk program's variable named val here.
awk -v val="$t" '
FNR==1{
outputfile="file1"
}
FNR==243{
outputfile="file2"
}
($1 == val){
print $4 "\t" val > (outputfile)
}
' Input_file
$ awk -v val="$t" '{c=int((NR-1)%242/121)+1}
$1==val {print $4 "\t" $1 > (output"c")}' file
this should take the first, third, etc blocks of 121 records to output1 and second, fourth, etc blocks of 121 records to output2 if they satisfy the condition.
If you want to skip first two blocks (first 242 records) just add && NR>242 condition to the existing one.

add plus or minus in awk if no match

I am trying to match all the lines in the below file to match. The awk will do that the problem is that the lines that do not match should be within plus or minus 10. I am not sure how to tell awk that the if a match is not found then use either plus or minus the coordinates in file. If no match is found after that then no match is in the file. Thank you :).
file
955763
957852
976270
bigfile
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2
chr1 970621 970740 chr1:970621-970740 AGRN-8|gc=57.1
awk
awk 'NR==FNR{A[$1];next}$3 in A' file bigfile > output
desired output (same as bigfile)
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2
If there's no difference between a row that matches and one that's close, you could just set all of the keys in the range in the array:
awk 'NR == FNR { for (i = -10; i <= 10; ++i) A[$1+i]; next }
$3 in A' file bigfile > output
The advantage of this approach is that only one lookup is performed per line of the big file.
You need to run a loop on array a:
awk 'NR==FNR {
a[$1]
next
}
{
for (i in a)
if (i <= $3+10 && i >= $3-10)
print
}' file bigfile > output
Your data already produces the desired output (all exact match).
$ awk 'NR==FNR{a[$1];next} $3 in a{print; next}
{for(k in a)
if((k-$3)^2<=10^2) {print $0, " --> within 10 margin"; next}}' file bigfile
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2
chr1 976251 976261 chr1:976251-976261 AGRN-8|gc=57.1 --> within 10 margin
I added a fake 4th row to get the margin match

AWK display the line number of last match

I'm new to AWK. Does anyone know how to print out the line number of last match of a file using awk.
Here's a small part of the Test.txt file content:
CLOSE #140,value=140
WAIT = #14039,value=143
CLOSE #140,value=144
WAIT #0,value=155
WAIT = #14039,value=158
CLOSE #140,value=160
This is the code I used so far
Success first line:
awk -F= '{if($NF >= 143 && $NF <= 158){print NR,exit}}' Test.txt
But for last line
awk -F= '{if($NF >= 143 && $NF <= 158){a=$0}} END{print a,NR}' Test.txt
It's only printed out the hold matching line and the last line number of the file.
How can I get the line number of the last match?
Please help me with some advice.
Use a = NR instead of a = $0 (because it's the line number you want to remember, not the line itself).
Apart from that, it would arguably be more awkish to write
awk -F= '$NF >= 143 && $NF <= 158 { a = NR } END { print a }' Test.txt
{if(){}} is a bit ugly.

Matching blocks with conditions

I am in the need for some regexp guru help.
I am trying to make a small config system for a home project, but for this it seams that I need a bit more regexp code than my regexp skills can come up with.
I need to be able to extract some info inside blocks based on conditions and actions. For an example.
action1 [condition1 condition2 !condition3] {
Line 1
Line 2
Line 3
}
The conditions are stored in simple variables separated by space. I use these variables to create the regexp used to extract the block info from the file. Most if this is working fine, except that I have no idea how to make the "not matching" part, which basically means that a "word" is not available in the condition variable.
VAR1="condition1 condition2"
VAR2="condition1 condition2 condition3"
When matched against the above, it should match VAR1 but not VAR2.
This is what I have so far
PARAMS="con1 con2 con3"
INPUT_PARAMS="[^!]\\?\\<$(echo $PARAMS | sed 's/ /\\>\\|[^!]\\?\\</g')\\>"
sed -n "/^$ACTION[ \t]*\(\[\($INPUT_PARAMS\)*\]\)\?[ \t]*{/,/}$/p" default.cfg | sed '/^[^{]\+{/d' | sed '/}/d'
Not sure how pretty this is, but it does work, except for not-matching.
EDIT:
Okay I will try to elaborate a bit.
Let's say that I have the below text/config file
action1 [con1 con2 con3] {
Line A
Line B
}
action2 [con1 con2 !con3] {
Line C
}
action3 [con1 con2] {
Line D
}
action4 {
Line E
}
and I have the fallowing conditions to match against
ARG1="con1 con2 con3"
ARG2="con1 con2"
ARG3="con1"
ARG4="con1 con4"
# Matching against ARG1 should print Line A, B, D and E
# Matching against ARG2 should print Line C, D and E
# Matching against ARG3 should print Line E
# Matching against ARG4 should print Line E
Below is a java like example of action2 using normal conditional check. It give a better idea of what I am trying
if (ARG2.contains("con1") && ARG2.contains("con2") && !ARG2.contains("con3")) {
// Print all lines in this block
}
The logic of how you're selecting which records to print lines from isn't clear to me so here's how to create sets of positive and negative conditions using awk:
$ cat tst.awk
BEGIN{
RS = ""; FS = "\n"
# create the set of the positive conditions in the "conds" variable.
n = split(conds,tmp," ")
for (i=1; i<=n; i++)
wanted[tmp[i]]
}
{
# create sets of the positive and negative conditions
# present in the first line of the current record.
delete negPresent # use split("",negPresent) in non-gawk
delete posPresent
n = split($1,tmp,/[][ {]+/)
for (i=2; i<n; i++) {
cond = tmp[i]
sub(/^!/,"",cond) ? negPresent[cond] : posPresent[cond]
}
allPosInWanted = 1
for (cond in posPresent)
if ( !(cond in wanted) )
allPosInWanted = 0
someNegInWanted = 0
for (cond in negPresent)
if (cond in wanted)
someNegInWanted = 1
if (allPosInWanted && !someNegInWanted)
for (i=2;i<NF;i++)
print $i
}
.
$ awk -v conds='con1 con2 con3' -f tst.awk file
Line A
Line B
Line D
Line E
$
$ awk -v conds='con1 con2' -f tst.awk file
Line C
Line D
Line E
$
$ awk -v conds='con1' -f tst.awk file
Line E
$
$ awk -v conds='con1 con4' -f tst.awk file
Line E
$
and now you just have to code whatever logic you like in that final block where the printing is being done to compare the conditions in each of the sets.