file comparision using Awk in linux - regex

I have two files
File A.txt (Groupname; Groupid)
wheel:1
www:2
ftpshare:3
others:4
File B.txt (username:UserID:Groupid)
pi:1:1
useradmin:2:3
usertwo:3:3
trout:4:3
apachecri:5:2
guestthree:6:4
I need to create a output where it shows username:userID: Groupname like below
pi:1:wheel
useradmin:2:ftpshare
(and so on)
This needs to be done using awk for a unix class. After spending countless hrs trying to figure it out here is what I came up with.
awk -F ':' 'NR==FNR{ if ($2==[a-z]) a[$1] = $4;next} NF{ print $1, $2, $4}' fileA.txt fileB.txt
OR
awk -F, 'NR==FNR{ a[$2]=$2$1;next } NF{ print $1, $2 ((a[$2]==$2$3)?",ok":",nok") }' FileA.txt FileB.txt
can someone help me figure this out to get the right input and explain it to me what I am doing wrong.

You can use awk:
awk 'BEGIN{FS=OFS=":"} FNR==NR{a[$2]=$1; next} $3 in a{print $1, $2, a[$3]}' a.txt b.txt
pi:1:wheel
useradmin:2:ftpshare
usertwo:3:ftpshare
trout:4:ftpshare
apachecri:5:www
guestthree:6:others
How it works:
BEGIN{FS=OFS=":"} - Make input and output field separator as colon
FNR==NR - Execute this block for fileA only
{a[$2]=$1; next} - Create an associative array a with key as $2 and value as $1 and then skip to next record
$3 in a - Execute this block for 2nd file if $3 is found in array a
print $1, $2, a[$3] Print field1, field2 and a[field3]

I know you said you want to use awk, but you should also consider the standard tool designed for a task like this, namely join. Here is one way you could apply it:
join -o '2.1 2.2 1.1' -t: -1 2 -2 3 <(sort -t: -k2,2n fileA.txt) \
<(sort -t: -k3,3n fileB.txt)
Because the input to join needs to be sorted on the join-field, this method leaves the output unordered, if this is important use anubhava's answer.
Output in this case:
pi:1:wheel
apachecri:5:www
trout:4:ftpshare
useradmin:2:ftpshare
usertwo:3:ftpshare
guestthree:6:others

Related

Interval expressions in gawk to awk

I hope this is an easy fix
I originally wrote a clean and easy script that utilized gawk, I used this first and foremost because when I was solving the original issue was what I found. I now need to adapt it to only use awk.
sample file.fasta:
>gene1
>gene235
ATGCTTAGATTTACAATTCAGAAATTCCTGGTCTATTAACCCTCCTTCACTTTTCACTTTTCCCTAACCCTTCAAAATTTTATATCCAATCTTCTCACCCTCTACAATAATACATTTATTATCCTCTTACTTCAAAATTTTT
>gene335
ATGCTCCTTCTTAATCTAAACCTTCAAAATTTTCCCCCTCACATTTATCCATTATCACCTTCATTTCGGAATCCTTAACTAAATACAATCATCAACCATCTTTTAACATAACTTCTTCAAAATTTTACCAACTTACTATTGCTTCAAAATTTTTCAT
>gene406
ATGTACCACACACCCCCATCTTCCATTTTCCCTTTATTCTCCTCACCTCTACAATCCCCTTAATTCCTCTTCAAAATTTTTGGAGCCCTTAACTTTCAATAACTTCAAAATTTTTCACCATACCAATAATATCCCTCTTCAAAATTTTCCACACTCACCAAC
gawk '/[ACTG]{21,}GG/{print a; print}{a=$0}' file.fasta >"species_precrispr".fasta
what I know works is awk is the following:
awk '/[ACTG]GG/{print a; print}{a=$0}' file.fasta >"species_precrispr".fasta
the culprit therefore is the interval expression of {21,}
What I want it to do is search is for it to match each line that contains at least 21 nucleotides left of my "GG" match.
Can anyone help?
Edit:
Thanks for all the help:
There are various solutions that worked. To reply to some of the comments a more basic example of the initial output and the desired effect achieved...
Prior to awk command:
cat file1.fasta
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene2
ATGGGTGCCTTAACTTTCAATAACTG
>gene3
ATGTCAAAATTTTTCATTTCAAT
>gene4
ATCCTTTTTTTTGGGTCAAAATTAAA
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG
Following codes all produced the same desired output:
original code
gawk '/[ACTG]{21,}GG/{print a; print}{a=$0}' file1.fasta
slight modification that adds interval function to original awk version >3.x.x
awk --re-interval'/[ACTG]{21,}GG/{print a; print}{a=$0}' file1.fasta
Allows for modification of val and correct output , untested but should work with lower versions of awk
awk -v usr_count="21" '/gene/{id=$0;next} match($0,/.*GG/){val=substr($0,RSTART,RLENGTH-2);if(gsub(/[ACTG]/,"&",val)>= usr_count){print id ORS $0};id=""}' file1.fasta
awk --re-interval '/^>/ && seq { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS} /^>/{name=$0; seq=""; next} {seq = seq $0 } END { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS }' file1.fasta
Desired output: only grab genes names and sequences of sequences that have 21 nucleotides prior to matching GG
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG
Lastly just to show the discarded lines
>gene2
ATG-GG-TGCCTTAACTTTCAATAACTG # only 3 nt prior to any GG combo
>gene3
ATGTCAAAATTTTTCATTTCAAT # No GG match found
>gene4
ATCCTTTTTTTTGGGTCAAAATTAAA # only 14 nt prior to any GG combo
Hope this helps others!
EDIT: As per OP comment need to print gene ids too then try following.
awk '
/gene/{
id=$0
next
}
match($0,/.*GG/){
val=substr($0,RSTART,RLENGTH-2)
if(gsub(/[ACTG]/,"&",val)>=21){
print id ORS $0
}
id=""
}
' Input_file
OR one-liner form of above solution as per OP's request:
awk '/gene/{id=$0;next} match($0,/.*GG/){val=substr($0,RSTART,RLENGTH-2);if(gsub(/[ACTG]/,"&",val)>=21){print id ORS $0};id=""}' Input_file
Could you please try following, written and tested with shown samples only.
awk '
match($0,/.*GG/){
val=substr($0,RSTART,RLENGTH-2)
if(gsub(/[ACTG]/,"&",val)>=21){
print
}
}
' Input_file
OR more generic approach where created a variable in which user could mention value which user is looking to match should be present before GG.
awk -v usr_count="21" '
match($0,/.*GG/){
val=substr($0,RSTART,RLENGTH-2)
if(gsub(/[ACTG]/,"&",val)>=usr_count){
print
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/.*GG/){ ##Using Match function to match everything till GG in current line.
val=substr($0,RSTART,RLENGTH-2) ##Storing sub-string of current line from RSTART till RLENGTH-2 into variable val here.
if(gsub(/[ACTG]/,"&",val)>=21){ ##Checking condition if global substitution of ACTG(with same value) is greater or equal to 21 then do following.
print ##Printing current line then.
}
}
' Input_file ##Mentioning Input_file name here.
GNU awk accepts interval expressions in regular expressions from version 3.0 onwards. However, only from version 4.0, interval expression became defaultly enabled. If you have awk 3.x.x, you have to use the flag --re-interval to enable them.
awk --re-interval '/a{3,6}/{print}' file
There is an issue that often people overlook with FASTA files and using awk. When you have multi-line sequences, it is possible that your match is covering multiple lines. To this end you need to combine your sequences first.
The easiest way to process FASTA files with awk, is to build up a variable called name and a variable called seq. Every time you read a full sequence, you can process it. Remark that, for the best way of processing, the sequence, should be stored as a continues string, and not contain any newlines or white-spaces due. A generic awk for processing fasta, looks like this:
awk '/^>/ && seq { **process_sequence_here** }
/^>/{name=$0; seq=""; next}
{seq = seq $0 }
END { **process_sequence_here** }' file.fasta
In the presented case, your sequence processing looks like:
awk '/^>/ && seq { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS}
/^>/{name=$0; seq=""; next}
{seq = seq $0 }
END { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS }' file.fasta
Sounds like what you want is:
awk 'match($0,/[ACTG]+GG/) && RLENGTH>22{print a; print} {a=$0}' file
but this is probably all you need given the sample input you provided:
awk 'match($0,/.*GG/) && RLENGTH>22{print a; print} {a=$0}' file
They'll both work in any awk.
Using your updated sample input:
$ awk 'match($0,/.*GG/) && RLENGTH>22{print a; print} {a=$0}' file
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG

Numeric expression in if condition of awk

Pretty new to AWK programming. I have a file1 with entries as:
15>000000513609200>000000513609200>B>I>0011>>238/PLMN/000100>File Ef141109.txt>0100-75607-16156-14 09-11-2014
15>000000513609200>000000513609200>B>I>0011>Danske Politi>238/PLMN/000200>>0100-75607-16156-14 09-11-2014
15>000050354428060>000050354428060>B>I>0011>Danske Politi>238/PLMN/000200>>4100-75607-01302-14 31-10-2014
I want to write a awk script, where if 2nd field subtracted from 3rd field is a 0, then it prints field 2. Else if the (difference > 0), then it prints all intermediate digits incremented by 1 starting from 2nd field ending at 3rd field. There will be no scenario where 3rd field is less than 2nd. So ignoring that condition.
I was doing something as:
awk 'NR > 2 { print p } { p = $0 }' file1 | awk -F">" '{if ($($3 - $2) == 0) print $2; else l = $($3 - $2); for(i=0;i<l;i++) print $2++; }'
(( Someone told me awk is close to C in terms of syntax ))
But from the output it looks to me that the String to numeric or numeric to string conversions are not taking place at right place at right time. Shouldn't it be taken care by AWK automatically ?
The OUTPUT that I get:
513609200
513609201
513609200
Which is not quiet as expected. One evident issue is its ignoring the preceding 0s.
Kindly help me modify the AWK script to get the desired result.
NOTE:
awk 'NR > 2 { print p } { p = $0 }' file1 is just to remove the 1st and last entry in my original file1. So the part that needs to be fixed is:
awk -F">" '{if ($($3 - $2) == 0) print $2; else l = $($3 - $2); for(i=0;i<l;i++) print $2++; }'
In awk, think of $ as an operator to retrieve the value of the named field number ($0 being a special case)
$1 is the value of field 1
$NF is the value of the field given in the NF variable
So, $($3 - $2) will try to get the value of the field number given by the expression ($3 - $2).
You need fewer $ signs
awk -F">" '{
if ($3 == $2)
print $2
else {
v=$2
while (v < $3)
print v++
}
}'
Normally, this will work, but your numbers are beyond awk integer bounds so you need another solution to handle them. I'm posting this to initiate other solutions and better illustrate your specifications.
$ awk -F'>' '{for(i=$2;i<=$3;i++) print i}' file
note that this will skip the rows that you say impossible to happen
A small scale example
$ cat file_0
x>1000>1000>etc
x>2000>2003>etc
x>3000>2999>etc
$ awk -F'>' '{for(i=$2;i<=$3;i++) print i}' file_0
1000
2000
2001
2002
2003
Apparently, newer versions of gawk has --bignum options for arbitrary precision integers, if you have a compatible version that may solve your problem but I don't have access to verify.
For anyone who does not have ready access to gawk with bigint support, it may be simpler to consider other options if some kind of "big integer" support is required. Since ruby has an awk-like mode of operation,
let's consider ruby here.
To get started, there are just four things to remember:
invoke ruby with the -n and -a options (-n for the awk-like loop; -a for automatic parsing of lines into fields ($F[i]));
awk's $n becomes $F[n-1];
explicit conversion of numeric strings to integers is required;
To specify the lines to be executed on the command line, use the '-e TEXT' option.
Thus a direct translation of:
awk -F'>' '{for(i=$2;i<=$3;i++) print i}' file
would be:
ruby -an -F'>' -e '($F[1].to_i .. $F[2].to_i).each {|i| puts i }' file
To guard against empty lines, the following script would be slightly better:
($F[1].to_i .. $F[2].to_i).each {|i| puts i } if $F.length > 2
This could be called as above, or if the script is in a file (say script.rb) using the incantation:
ruby -an -F'>' script.rb file
Given the OP input data, the output is:
513609200
513609200
50354428060
The left-padding can be accomplished in several ways -- see for example this SO page.

awk - if values match then print file1 and file 2

I googled a lot my problem and tested different solutions, but none seem to work. I even used the same command in advance with success but now I do not manage to get my desired output.
I have file1
AAA;123456789A
BBB;123456789B
CCC;123456789C
And file2
1;2;3;CCC;pippo
1;2;3;AAA;pippo
1;2;3;BBB;pippo
1;2;3;*;pippo
My desired output is this:
1;2;3;CCC;pippo;CCC;123456789C
1;2;3;AAA;pippo;AAA;123456789A
1;2;3;BBB;pippo;BBB;123456789B
I tried with this command:
awk -F";" -v OFS=";" 'FNR == NR {a[$10]=$1; b[$20]=$2; next}($10 in a){ if(match(a[$10],$4)) print $0,a[$10],b[$20]}' file1 file2
But I get this output (only one entry, even with bigger files):
1;2;3;CCC;pippo;CCC;123456789C
What am I doing wrong? If it manages for one it should for all the other. Why is this not happening?
Also why if I set a[$1]=$1 it doesn't work?
Thank you for helping!
If possible could you explain the answer? So next time I won't have to ask for help!
EDIT: Sorry, I did not mention (since I wanted to keep the example minimal) that in file2 some fields are just "*". And I'd like to add an "else doesn't match do something".
awk to the rescue!
$ awk 'BEGIN{FS=OFS=";"}
NR==FNR{a[$1]=$0;next}
{print $0,a[$4]}' file1 file2
1;2;3;CCC;pippo;CCC;123456789C
1;2;3;AAA;pippo;AAA;123456789A
1;2;3;BBB;pippo;BBB;123456789B
UPDATE:
Based on the original input file it was only looking for exact match. If you want to skip the entries where there is no match, you need to qualify the print block with $4 in a
$ awk 'BEGIN{FS=OFS=";"}
NR==FNR{a[$1]=$0;next}
$4 in a{print $0,a[$4]}' file1 file2
join is made for this sort of thing:
$ join -t';' -1 4 -o1.{1..5} -o2.{1..2} <(sort -t';' -k4 file2) <(sort -t';' file1)
1;2;3;AAA;pippo;AAA;123456789A
1;2;3;BBB;pippo;BBB;123456789B
1;2;3;CCC;pippo;CCC;123456789C
The output is what you asked for except for the ordering of lines, which I assume isn't important. The -o options to join are needed because you want the full set of fields; you can try omitting it and you'll get the join field on the left a single time instead, which might also be fine.

Find duplicates of a file by name in a directory recursively - Linux

I have a folder which contains sub folders and some more files in them.
The files are named in the following way
abc.DEF.xxxxxx.dat
I'm trying to find the duplicate files only matching 'xxxxxx' in the above pattern ignoring the rest. The extension .dat doesn't change. But the length of abc and DEF might change. The order of separation by periods also doesn't change.
I'm guessing I need to use Find in the following way
find -regextype posix-extended -regex '\w+\.\w+\.\w+\.dat'
I need help coming up with the regular expression. Thanks.
Example:
For a file named 'epg.ktt.crwqdd.dat', I need to find duplicate files containing 'crwqdd'.
You can use awk for that:
find /path -type f -name '*.dat' | awk -F. 'a[$4]++'
Explanation:
Let find give the following output:
./abd.DdF.TTDFDF.dat
./cdd.DxdsdF.xxxxxx.dat
./abc.DEF.xxxxxx.dat
./abd.DdF.xxxxxx.dat
./abd.DEF.xxxxxx.dat
Basically, spoken with the words of a computer, you want to count the occurrences of a pattern between .dat and the next dot and print those lines where pattern appeared at least the second time.
To achieve this we split the file names by the . what gives us 5(!) fields:
echo ./abd.DEF.xxxxxx.dat | awk -F. '{print $1 " " $2 " " $3 " " $4 " " $5}'
/abd DEF xxxxxx dat
Note the first, empty field. The pattern of interest is $4.
To count the occurrences of a pattern in $4 we use an associative array a and increment it's value on each occurrence. Unoptimized, the awk command will look like:
... | awk -F. '{{if(a[$4]++ > 1){print}}'
However, you can write an awk program in the form:
CONDITION { ACTION }
What will give us:
... | awk -F. 'a[$4]++ > 1 {print}'
print is the default action in awk. It prints the whole current line. As it is the default action it can be omitted. Also the >1 check can be omitted because awk treats integer values greater than zero as true. This gives us the final command:
... | awk -F. 'a[$4]++'
To generalize the command we can say the pattern of interest isn't the 4th column, it is the next to last column. This can be expressed using number of fields in awk its NF:
... | awk -F. 'a[$(NF-1)]++'
Output:
./abc.DEF.xxxxxx.dat
./abd.DdF.xxxxxx.dat
./abd.DEF.xxxxxx.dat

Split string on a backslash ("\") delimiter in awk?

I am trying to split the string in a file based on some delimiter.But I am not able to achieve it correctly... Here is my code below.
awk 'var=split($2,arr,'\'); {print $var}' file1.dat
Here is my sample data guys.
Col1 Col2
abc 123\abc
abcd 123\abcd
Desire output:
Col1 Col2
abc abc
abcd abcd
You don't need to call split. Just use \\ as field separator:
echo 'a\b\c\d' | awk -F\\ '{printf("%s,%s,%s,%s\n", $1, $2, $3, $4)}'
OUTPUT:
a,b,c,d
Sample data and output is my best guess at your requirement
echo '1:2\\a\\b:3' | awk -F: '{
n=split($2,arr,"\\")
# print "#dbg:n=" n
var=arr[3]
print var
}'
output
b
Recall that split returns the number of fields that it found to split. You can uncomment the debug line and you'll see the value 3 returned.
Note also that for my test, I had to use 2 '\' chars for 1 to be processed. I don't think you'll need that in a file, but if this doesn't work with a file, then try adding extra '\' as needed to your data. I tried several variations on how to use '\', and this seems the most straightforward. Others are welcome to comment!
I hope this helps.
As some of the comments mentioned, you have nested single quotes. Switching one set to use double quotes should fix it.
awk 'var=split($2,arr,"\"); {print $var}' file1.dat
I'd prefer piping to another awk command to using split.I don't know that one is better than the other, it is just a preference.
awk '{print $2}' file1.dat | awk -F'\' '{...}'
You need to escape the backslash you're trying to split on. You can do this in you split using double-quotes like this: "\\"
Also, you can take an array slice to make your code more readable (and avoid defining another var). This should work for you:
awk 'NR==1 { print } NR>=2 { split($0,array,"\\"); print $1,array[2] }' file1.dat
HTH
awk '{sub(/123\\/,"")}1' file
Col1 Col2
abc abc
abcd abcd