I want to count presence of 3 strings in each line of a file,
for example how many times there is "a" and the is not "b" and ...(note tha a and b are some regex codes not exactly a and b)
I have used this:
#! /bin/sh
array1("/(a)/"
"/(b)/"
"/(c)/")
for i in "${array1[#]}"
do
for j in "${array1[#]}"
do
echo `/usr/bin/awk -v i="$i" -v j="$j" '( i &&! j)|| ( j && i) {count++ } END { print count }' myfile.txt`
done
done
but its not working with the valu of i and j (a,b,c) and it is executed as this:
/usr/bin/awk -v i=/(a)/ -v j=/(b)/ ( i &&! j)|| ( j && i) {count++ } END { print count } myfile.txt
You can use eval, like this:
eval "awk '( "$i" &&! "$j")|| ( "$j" && "$i") {count++ } END { print count }' myfile.txt"
Related
I have 2 files,
file1:
YARRA2
file2:
59204.9493055556
59205.5930555556
So, file1 has 1 line and file2 has 2 lines. If file1 has 1 line, and file2 has more than 1 line, I want to repeat the lines in file1 according to the number of lines in file2.
So, my code is this:
eprows=$(wc -l < file2)
awk '{ if( NR<2 && eprows>1 ) {print} {print}}' file1
but the output is
YARRA2
Any idea? I have also tried with
awk '{ if( NR<2 && $eprows>1 ) {print} {print}}' file1
but it is the same
You may use this awk solution:
awk '
NR == FNR {
++n2
next
}
{
s = $0
print;
++n1
}
END {
if (n1 == 1)
for (n1=2; n1 <= n2; ++n1)
print s
}' file2 file1
YARRA2
YARRA2
eprows=$(wc -l < file2)
awk '{ if( NR<2 && eprows>1 ) {print} {print}}' file1
Oops! You stepped hip-deep in mixed languages.
The eprows variable is a shell variable. It's not accessible to other processes except through the environment, unless explicitly passed somehow. The awk program is inside single-quotes, which would prevent interpreting eprows even if used correctly.
The value of a shell variable is obtained with $, so
echo $eprows
2
One way to insert the value into your awk script is by interpolation:
awk '{ if( NR<2 && '"$eprows"'>1 ) {print} {print}}' file1
That uses a lesser known trick: you can switch between single- and double-quotes as long as you don't introduce spaces. Because double-quoted strings in the shell are interpolated, awk sees
{ if( NR<2 && 2>1 ) {print} {print} }
Awk also lets you pass values to awk variables on the command line, thus:
awk -v eprows=$eprows '{ if( NR<2 && eprows >1 ) {print} {print}}' file1
but you'd have nicer awk this way:
awk -v eprows=$eprows 'NR < 2 && eprows > 1 { {print} {print} }' file1
whitespace and brevity being elixirs of clarity.
That works because in the awk pattern / action paradigm, pattern is anything that can be reduced to true/false. It need not be a regex, although it usually is.
One awk idea:
awk '
FNR==NR { cnt++; next } # count number of records in 1st file
# no specific processing for 2nd file => just scan through to end of file
END { if (FNR==1 && cnt >=2) # if 2nd file has just 1 record (ie, FNR==1) and 1st file had 2+ records then ...
for (i=1;i<=cnt;i++) # for each record in 1st file ...
print # print current (and only) record from 2nd file
}
' file2 file1
This generates:
YARRA2
YARRA2
I am processing text files with thousands of records per file. Each record is made up of two lines: a header that starts with > and followed by a line with a long string of characters -AGTCNR. The two lines make a complete record.
Here is how a simple file looks like:
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------NNNN
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
--------NNNTCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-----TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAANNN-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA--NNAGTNNNNNNNNNNNNNNNAATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
NNNNNNNNNNNTCCCTTTAATACTAGGAGCCCCTTTCCT----TAAATAAT-----
With the following code I can search through the second line, that contains the string of characters, for each record and extract records which have up to a certain maximum number of - or N or n characters at the beginning of line using $start_gaps variable and end of line using $end_gaps variable, this is done in the thread here:
start_Ns=10
end_Ns=10
awk -v start_N=$start_Ns -v end_N=$end_Ns ' /^>/ {
hdr=$0; next }; match($0,/^[-Nn]*/) && RLENGTH<=start_N &&
match($0,/[-Nn]*$/) && RLENGTH<=end_N {
print hdr; print }' infile.aln > without_shortseqs.aln
Now i need to search for the occurrence of - or N or n characters in the region "not including" the beginning or end terminals of the second line for every record and filter out records with more than a specific maximum number of - or N or n characters. The code below does it but i need to use a variable that i can easily reset:
start_Ns=10
end_Ns=10
awk -v start_N=10 -v end_N=10 ' /^>/ {
hdr=$0; next }; match($0,/^[-Nn]*/) && RLENGTH<=start_N &&
match($0,/[-Nn]*$/) && RLENGTH<=end_N && match($0,/N{0,11}/) {
print hdr; print }' infile.aln > without_shortseqs_mids.aln
As for a variable i tried the following but failed:
awk -v start_N=10 -v mid_N=11 -v end_N=10 ' /^>/ {
hdr=$0; next }; match($0,/^[-Nn]*/) && RLENGTH<=start_N &&
match($0,/N{0,mid_N}/) && match($0,/[-Nn]*$/) && RLENGTH<=end_N {
print hdr; print }' infile.aln > without_shortseqs_mids.aln
Expected results:
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-----TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAANNN-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
I would suggest the following logic in order not to overcomplicate things.
Search the beginning part, remove it from the string
Search the end part, remove it from the string
Search the middle part in the remainder:
awk -v start_N=10 -v mid_N=11 -v end_N=10 '
/^>/{hdr=$0; next}
{ seq=$0 }
match(seq,/^[-Nn]*/) && RLENGTH > start_N { next }
{ seq=substr(seq,RSTART+RLENGTH) }
match(seq,/[-Nn]*$/) && RLENGTH > end_N { next }
{ seq=substr(seq,1,RSTART-1) }
{ while (match(seq,/[-Nn]+/)) {
if(RLENGTH>mid_N) next
seq=substr(seq,RSTART+RLENGTH)
}
}
{ print hdr; print $0 }' file
An alternative method would be making use of Extended Regular expressions with character duplication:
awk -v start_N=10 -v mid_N=11 -v end_N=10 '
(FNR==1) { ere_start = "^[-Nn]{" start_N+1 ",}"
ere_end = "[-Nn]{" mid_N+1 ",}$"
ere_mid = "[^-Nn][-Nn]{" end_N+1 ",}[^-Nn]"
/^>/{hdr=$0; next}
{ seq=$0 }
match(seq,ere_start) { next }
match(seq,ere_end) { next }
match(seq,ere-mid) { next }
{ print hdr; print $0 }' file
You can use a string as the second argument to match and then the regular string interpolation operators in Awk work fine.
awk -v start_N=10 -v mid_N=11 -v end_N=10 ' /^>/ {
hdr=$0; next }
match($0,/^[-Nn]*/) && RLENGTH<=start_N &&
match($0,"N{0," mid_N "}") &&
match($0,/[-Nn]*$/) && RLENGTH<=end_N {
print hdr; print }'
Just to spell this out a bit, if you use /regex/ then the text between the slashes is immediately interpreted as a regular expression, but if you use "regex" or a piece of code which evaluates to a string, the regular Awk string-handling functions are processed first, and only then is the resulting string interpreted as a regular expression.
Thanks for your question. In my humble opinion, you should rephrase a bit your question and make sure your objective is 100% clear to all potential readers of this thread.
With regards to having a variable inside a construct in which awk doesn't allow the use of a variable, there is a standard trick that would apply whichever scripting tool you would use (e.g. sed or even some more complex stuff in perl or Python): interrupt your awk script by breaking the single-quote construct, and you insert in there a variable expansion that is performed by the shell, not by awk. For instance, here, you would define mid_N in Bash and then use "${mid_N}" in the middle of your awk script, with a closing single quote immediately before and a (re-)opening single quote immediately after. Like so:
mid_N=11
awk -v start_N=10 -v end_N=10 ' /^>/ {
hdr=$0; next }; match($0,/^[-Nn]*/) && RLENGTH<=start_N &&
match($0,/N{0,'"${mid_N}"'}/) && match($0,/[-Nn]*$/) && RLENGTH<=end_N {
print hdr; print }' infile.aln > without_shortseqs_mids.aln
That's a minimal-edit solution to the specific issue you mentioned below your "As for a variable i tried the following but failed:"
Hi I am trying to match the following string to no avail
echo '[xxAA][xxBxx][C]' | awk -F '/\[.*\]/' '{ for (i = 1; i <= NF; i++) printf "-->%s<--\n", $i }'
I basically want to have each field be an enclosing bracket such that
field 1 = xxAA
field 2 = xxBxx
field 3 = C
but i keep getting the following result
-->[xxAA][xxBxx][C]<--
any pointers where I am going wrong?
You can use a regex in Field Separator. We enclose the [ and ] in character class to have it considered as literal. Both are separated by | which is logical OR. Since we target them as field separator we just iterate over even field numbers to get the output.
$ echo '[xxAA][xxBxx][C]' | awk -v FS="[]]|[[]" '{ for (i=2;i<=NF;i+=2) print $i }'
xxAA
xxBxx
C
The regex /\[.*\]/ matches the entire input, because the .* matches the ][ inside the input as well as matching the letters.
You could split fields on the ']' character instead, then put it back again in the output:
echo '[xxAA][xxBxx][C]' | awk -F ']' '{ for (i = 1; i <= NF; i++) if ($i != "") printf "-->%s]<--\n", $i }'
This is a job for GNU awk's FPAT variable which lets you specify the pattern of the fields rather than the pattern of the field separators:
$ echo '[xxAA][xxBxx][C]' | awk -v FPAT='[^][]+' '{ for (i = 1; i <= NF; i++) printf "-->%s<--\n", $i }'
-->xxAA<--
-->xxBxx<--
-->C<--
With other awks I'd use:
$ echo '[xxAA][xxBxx][C]' | awk -F'\\]\\[' '{ gsub(/^\[|\]$/,""); for (i = 1; i <= NF; i++) printf "-->%s<--\n", $i }'
-->xxAA<--
-->xxBxx<--
-->C<--
I have an output file that I am trying to process into a formatted csv for our audit team.
I thought I had this mastered until I stumbled across bad data within the output. As such, I want to be able to handle this using awk.
MY OUTPUT FILE EXAMPLE
Enter password ==>
o=hoster
ou=people,o=hoster
ou=components,o=hoster
ou=websphere,ou=components,o=hoster
cn=joe-bloggs,ou=appserver,ou=components,o=hoster
cn=joe
sn=bloggs
cn=S01234565
uid=bloggsj
cn=john-blain,ou=appserver,ou=components,o=hoster
cn=john
uid=blainj
sn=blain
cn=andy-peters,ou=appserver,ou=components,o=hoster
cn=andy
sn=peters
uid=petersa
cn=E09876543
THE OUTPUT I WANT AFTER PROCESSING
joe,bloggs,s01234565;uid=bloggsj,cn=joe-bloggs,ou=appserver,ou=components,o=hoster
john,blain;uid=blainj;cn=john-blain,ou=appserver,ou=components,o=hoster
andy,peters,E09876543;uid=E09876543;cn=andy-peters,ou=appserver,ou=components,o=hoster
As you can see:
we always have a cn= variable that contains o=hoster
uid can have any value
we may have multiple cn= variables without o=hoster
I have acheived the following:
cat output | awk '!/^o.*/ && !/^Enter.*/{print}' | awk '{getline a; getline b; getline c; getline d; print $0,a,b,c,d}' | awk -v srch1="cn=" -v repl1="" -v srch2="sn=" -v repl2="" '{ sub(srch1,repl1,$2); sub(srch2,repl2,$3); print $4";"$2" "$3";"$1 }'
Any pointers or guidance is greatly appreciated using awk. Or should I give up and just use the age old long winded method a large looping script to process the file?
You may try following awk code
$ cat file
Enter password ==>
o=hoster
ou=people,o=hoster
ou=components,o=hoster
ou=websphere,ou=components,o=hoster
cn=joe-bloggs,ou=appserver,ou=components,o=hoster
cn=joe
sn=bloggs
cn=S01234565
uid=bloggsj
cn=john-blain,ou=appserver,ou=components,o=hoster
cn=john
uid=blainj
sn=blain
cn=andy-peters,ou=appserver,ou=components,o=hoster
cn=andy
sn=peters
uid=petersa
cn=E09876543
Awk Code :
awk '
function out(){
print s,u,last
i=0; s=""
}
/^cn/,!NF{
++i
last = i == 1 ? $0 : last
s = i>1 && !/uid/ && NF ? s ? s "," $NF : $NF : s
u = /uid/ ? $0 : u
}
i && !NF{
out()
}
END{
out()
}
' FS="=" OFS=";" file
Resulting
joe,bloggs,S01234565;uid=bloggsj;cn=joe-bloggs,ou=appserver,ou=components,o=hoster
john,blain;uid=blainj;cn=john-blain,ou=appserver,ou=components,o=hoster
andy,peters,E09876543;uid=petersa;cn=andy-peters,ou=appserver,ou=components,o=hoster
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk
This awk script works for your sample and produces the sample output:
BEGIN { delete cn[0]; OFS = ";" }
function print_info() {
if (length(cn)) {
names = cn[1] "," sn
for (i=2; i <= length(cn); ++i) names = names "," cn[i]
print names, uid, dn
delete cn
}
}
/^cn=/ {
if ($0 ~ /o=hoster/) dn = $0
else {
cn[length(cn)+1] = substr($0, index($0, "=") + 1)
uid = $0; sub("cn", "uid", uid)
}
}
/^sn=/ { sn = substr($0, index($0, "=") + 1) }
/^uid=/ { uid = $0 }
/^$/ { print_info() }
END { print_info() }
This should help you get started.
awk '$1 ~ /^cn/ {
for (i = 2; i <= NF; i++) {
if ($i ~ /^uid/) {
u = $i
continue
}
sub(/^[^=]*=/, x, $i)
r = length(r) ? r OFS $i : $i
}
print r, u, $1
r = u = x
}' OFS=, RS= infile
I assume that there is an error in your sample output: in the 3d record the uid should be petersa and not E09876543.
You might want look at some of the "already been there and done that" solutions to accomplish the task.
Apache Directory Studio for example, will do the LDAP query and save the file in CSV or XLS format.
-jim
Is there a nice bash one liner to map strings inside a file to a unique number?
For instance,
a
a
b
b
c
c
should be converted into
1
1
2
2
3
3
I am currently implementing it in C++ but a bash one-liner would be great.
awk '{if (!($0 in ids)) ids[$0] = ++i; print ids[$0]}'
This maintains an associative array called ids. Each time it finds a new string it assigns it a monotically increasing id ++i.
Example:
jkugelman$ echo $'a\nb\nc\na\nb\nc' | awk '{if (!($0 in ids)) ids[$0] = ++i; print ids[$0]}'
1
2
3
1
2
3
The awk solutions here are fine, but here's the same approach in pure bash (>=4)
declare -A stringmap
counter=0
while read string < INPUTFILE; do
if [[ -z ${stringmap[$string]} ]]; then
let counter+=1
stringmap[$string]=$counter
fi
done
for string in "${!stringmap[#]}"; do
printf "%d -> %s\n" "${stringmap[$string]}" "$string"
done
awk 'BEGIN { num = 0; }
{
if ($0 in seen) {
print seen[$0];
} else {
seen[$0] = ++num;
print num;
}
}' [file]
(Not exactly one line, ofcourse.)
slight modification without the if
awk '!($0 in ids){ids[$0]=++i}{print ids[$0]}' file