I have a text file consisting of xyz coordinates, each defining a particular depth contour of a slope.
All of these lines are stored in one file, with each contour separated by ">"
The file looks like:
>
x1 y1 z1
x2 y2 z2
>
x3 y3 z3
...
The file is huge and unwieldy and I want to print out the 7th point along each contour and pipe it into a tab delimited new file.
My code looks like this:
awk -v OFS='\t' -v count=1 '{if ($1 == ">") {count/=count}; else if (count%7 == 0) {{count+=1} print $0}; else {count+=1}}' infile > outfile
I keep getting an error message that says
awk: syntax error at source line 1
context is
{if ($1 == ">") {count/=count}; >>> else <<< if (count%7 == 0) {{count+=1}; print $0}; else {count+=1}}
awk: illegal statement at source line 1
I've spent a while checking my syntax and bracketing and it seems ok, I just might be missing something with the variable reassignment?
Your syntax is very close; just a bit off. Looks like there may be some confusion between the curly braces { } and the normal parentheses. As you play around more with awk the difference should become much clearer.
Before I get to your particular syntax issue, note that a simpler approach can solve the same problem:
awk -v OFS='\t' '$1 == ">" { count = 1; next } !(count++ % 7)' file
A multiline version your corrected code would be:
{
if ($1 == ">") {
count = 1
}
else
if (count % 7 == 0) {
count += 1
print $0
}
else
count += 1
}
As long as a statement is on its own line, you don't need a semicolon. But note that to make it a one-liner, you will need one as shown:
{ if ($1 == ">") { count = 1 } else if (count % 7 == 0) { count += 1; print $0 } else count += 1 }
Related
I am processing text files with thousands of records per file. Each record is made up of two lines: a header that starts with > and followed by a line with a long string of characters -AGTCNR. The two lines make a complete record.
Here is how a simple file looks like:
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------NNNN
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
--------NNNTCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-----TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAANNN-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA--NNAGTNNNNNNNNNNNNNNNAATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
NNNNNNNNNNNTCCCTTTAATACTAGGAGCCCCTTTCCT----TAAATAAT-----
With the following code I can search through the second line, that contains the string of characters, for each record and extract records which have up to a certain maximum number of - or N or n characters at the beginning of line using $start_gaps variable and end of line using $end_gaps variable, this is done in the thread here:
start_Ns=10
end_Ns=10
awk -v start_N=$start_Ns -v end_N=$end_Ns ' /^>/ {
hdr=$0; next }; match($0,/^[-Nn]*/) && RLENGTH<=start_N &&
match($0,/[-Nn]*$/) && RLENGTH<=end_N {
print hdr; print }' infile.aln > without_shortseqs.aln
Now i need to search for the occurrence of - or N or n characters in the region "not including" the beginning or end terminals of the second line for every record and filter out records with more than a specific maximum number of - or N or n characters. The code below does it but i need to use a variable that i can easily reset:
start_Ns=10
end_Ns=10
awk -v start_N=10 -v end_N=10 ' /^>/ {
hdr=$0; next }; match($0,/^[-Nn]*/) && RLENGTH<=start_N &&
match($0,/[-Nn]*$/) && RLENGTH<=end_N && match($0,/N{0,11}/) {
print hdr; print }' infile.aln > without_shortseqs_mids.aln
As for a variable i tried the following but failed:
awk -v start_N=10 -v mid_N=11 -v end_N=10 ' /^>/ {
hdr=$0; next }; match($0,/^[-Nn]*/) && RLENGTH<=start_N &&
match($0,/N{0,mid_N}/) && match($0,/[-Nn]*$/) && RLENGTH<=end_N {
print hdr; print }' infile.aln > without_shortseqs_mids.aln
Expected results:
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-----TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAANNN-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
I would suggest the following logic in order not to overcomplicate things.
Search the beginning part, remove it from the string
Search the end part, remove it from the string
Search the middle part in the remainder:
awk -v start_N=10 -v mid_N=11 -v end_N=10 '
/^>/{hdr=$0; next}
{ seq=$0 }
match(seq,/^[-Nn]*/) && RLENGTH > start_N { next }
{ seq=substr(seq,RSTART+RLENGTH) }
match(seq,/[-Nn]*$/) && RLENGTH > end_N { next }
{ seq=substr(seq,1,RSTART-1) }
{ while (match(seq,/[-Nn]+/)) {
if(RLENGTH>mid_N) next
seq=substr(seq,RSTART+RLENGTH)
}
}
{ print hdr; print $0 }' file
An alternative method would be making use of Extended Regular expressions with character duplication:
awk -v start_N=10 -v mid_N=11 -v end_N=10 '
(FNR==1) { ere_start = "^[-Nn]{" start_N+1 ",}"
ere_end = "[-Nn]{" mid_N+1 ",}$"
ere_mid = "[^-Nn][-Nn]{" end_N+1 ",}[^-Nn]"
/^>/{hdr=$0; next}
{ seq=$0 }
match(seq,ere_start) { next }
match(seq,ere_end) { next }
match(seq,ere-mid) { next }
{ print hdr; print $0 }' file
You can use a string as the second argument to match and then the regular string interpolation operators in Awk work fine.
awk -v start_N=10 -v mid_N=11 -v end_N=10 ' /^>/ {
hdr=$0; next }
match($0,/^[-Nn]*/) && RLENGTH<=start_N &&
match($0,"N{0," mid_N "}") &&
match($0,/[-Nn]*$/) && RLENGTH<=end_N {
print hdr; print }'
Just to spell this out a bit, if you use /regex/ then the text between the slashes is immediately interpreted as a regular expression, but if you use "regex" or a piece of code which evaluates to a string, the regular Awk string-handling functions are processed first, and only then is the resulting string interpreted as a regular expression.
Thanks for your question. In my humble opinion, you should rephrase a bit your question and make sure your objective is 100% clear to all potential readers of this thread.
With regards to having a variable inside a construct in which awk doesn't allow the use of a variable, there is a standard trick that would apply whichever scripting tool you would use (e.g. sed or even some more complex stuff in perl or Python): interrupt your awk script by breaking the single-quote construct, and you insert in there a variable expansion that is performed by the shell, not by awk. For instance, here, you would define mid_N in Bash and then use "${mid_N}" in the middle of your awk script, with a closing single quote immediately before and a (re-)opening single quote immediately after. Like so:
mid_N=11
awk -v start_N=10 -v end_N=10 ' /^>/ {
hdr=$0; next }; match($0,/^[-Nn]*/) && RLENGTH<=start_N &&
match($0,/N{0,'"${mid_N}"'}/) && match($0,/[-Nn]*$/) && RLENGTH<=end_N {
print hdr; print }' infile.aln > without_shortseqs_mids.aln
That's a minimal-edit solution to the specific issue you mentioned below your "As for a variable i tried the following but failed:"
I am trying to follow the solution at
Moving matching lines in a text file using sed
The situation is that pattern2 needs to be applied just once in the whole file. How can I change the following to get this done
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ {t[2]=$0;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
Here is the file on which I applied the pattern2 (RELOC_DIR)
asdasd0
-SRC_OUT_DIR = /a/b/c/d/e/f/g/h
RELOC_DIR = /i/j/k/l/m
asdasd3
asdasd4
DEFAULTS {
asdasd6
$(RELOC_DIR)/some other text1
$(RELOC_DIR)/some other text2
$(RELOC_DIR)/some other text3
$(RELOC_DIR)/some other text4
and the last 4 lines of the file got deleted because of the match.
asdasd0
-SRC_OUT_DIR = /a/b/c/d/e/f/g/h
asdasd3
asdasd4
DEFAULTS {
RELOC_DIR = /i/j/k/l/m
asdasd6
I am assuming you need to check pattern2 along with some other condition if this is the case then try.
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ && /check_other_text_in_current_line/{t[2]=$0;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
Above is checking check_other_text_in_current_line string(which is a sample and you could change it as per your actual string) is present along with pattern2 also in same line. If this si not what you are looking for then please post samples of input and expected output in your post.
OR in case you are looking that only 1st match for pattern2 in Input_file and skip all others then try. It will only print very first match for pattern2 and skip all others.(since samples are not provied by OP so this code is written only for the ask of specific pattern matching)
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ && ++count==1{t[2]=$0;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
OR
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ && !found2{t[2]=$0;found2=1;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
EDIT: Though my 2nd solution looks like should be the one as per OP's ask but complete picture of requirement is not given so adding code only for printing Pattern2(string RELOC_DIR)'s first occurence here.
awk '/RELOC_DIR/ && ++ count==1{print}' Input_file
RELOC_DIR = /i/j/k/l/m
OR
awk '!found2 && /RELOC_DIR/ { t[2]=$0; found2=1; print}' Input_file
I am trying to write an AWK script to parse a file of the form
> field1 - field2 field3 ...
lineoftext
anotherlineoftext
anotherlineoftext
and I am checking using regex if the first line is correct (begins with a > and then has something after it) and then print all the other lines. This is the script I wrote but it only verifies that the file is in a correct format and then doesn't print anything.
#!/bin/bash
# FASTA parser
awk ' BEGIN { x = 0; }
{ if ($1 !~ />.*/ && x == 0)
{ print "Not a FASTA file"; exit; }
else { x = 1; next; }
print $0 }
END { print " - DONE - "; }'
Basically you can use the following awk command:
awk 'NR==1 && /^>./ {p=1} p' file
On the first row NR==1 it checks whether the line starts with a > followed by "something" (/^>./). If that condition is true the variable p will be set to one. The p at the end checks whether p evaluates true and prints the line in that case.
If you want to print the error message, you need to revert the logic a bit:
awk 'NR==1 && !/^>./ {print "Not a FASTA file"; exit 1} 1' file
In this case the program prints the error messages and exits the program if the first line does not start with a >. Otherwise all lines gets printed because 1 always evaluates to true.
For this OP literally
awk 'NR==1{p=$0~/^>/}p' YourFile
# shorter version with info of #EdMorton
awk 'NR==1{p=/^>/}p' YourFile
for line after > (including)
awk '!p{p=$0~/^>/}p' YourFile
# shorter version with info of #EdMorton
awk '!p{p=/^>/}p' YourFile
Since all you care about is the first line, you can just check that, then exit.
awk 'NR > 1 { exit (0) }
! /^>/ { print "Not a FASTA file" >"/dev/stderr"; exit (1) }' file
As noted in comments, the >"/dev/stderr" is a nonportable hack which may not work for you. Regard it as a placeholder for something slightly more sophisticated if you want a tool which behaves as one would expect from a standard Unix tool (run silently if no problems; report problems to standard error).
I have an output file that I am trying to process into a formatted csv for our audit team.
I thought I had this mastered until I stumbled across bad data within the output. As such, I want to be able to handle this using awk.
MY OUTPUT FILE EXAMPLE
Enter password ==>
o=hoster
ou=people,o=hoster
ou=components,o=hoster
ou=websphere,ou=components,o=hoster
cn=joe-bloggs,ou=appserver,ou=components,o=hoster
cn=joe
sn=bloggs
cn=S01234565
uid=bloggsj
cn=john-blain,ou=appserver,ou=components,o=hoster
cn=john
uid=blainj
sn=blain
cn=andy-peters,ou=appserver,ou=components,o=hoster
cn=andy
sn=peters
uid=petersa
cn=E09876543
THE OUTPUT I WANT AFTER PROCESSING
joe,bloggs,s01234565;uid=bloggsj,cn=joe-bloggs,ou=appserver,ou=components,o=hoster
john,blain;uid=blainj;cn=john-blain,ou=appserver,ou=components,o=hoster
andy,peters,E09876543;uid=E09876543;cn=andy-peters,ou=appserver,ou=components,o=hoster
As you can see:
we always have a cn= variable that contains o=hoster
uid can have any value
we may have multiple cn= variables without o=hoster
I have acheived the following:
cat output | awk '!/^o.*/ && !/^Enter.*/{print}' | awk '{getline a; getline b; getline c; getline d; print $0,a,b,c,d}' | awk -v srch1="cn=" -v repl1="" -v srch2="sn=" -v repl2="" '{ sub(srch1,repl1,$2); sub(srch2,repl2,$3); print $4";"$2" "$3";"$1 }'
Any pointers or guidance is greatly appreciated using awk. Or should I give up and just use the age old long winded method a large looping script to process the file?
You may try following awk code
$ cat file
Enter password ==>
o=hoster
ou=people,o=hoster
ou=components,o=hoster
ou=websphere,ou=components,o=hoster
cn=joe-bloggs,ou=appserver,ou=components,o=hoster
cn=joe
sn=bloggs
cn=S01234565
uid=bloggsj
cn=john-blain,ou=appserver,ou=components,o=hoster
cn=john
uid=blainj
sn=blain
cn=andy-peters,ou=appserver,ou=components,o=hoster
cn=andy
sn=peters
uid=petersa
cn=E09876543
Awk Code :
awk '
function out(){
print s,u,last
i=0; s=""
}
/^cn/,!NF{
++i
last = i == 1 ? $0 : last
s = i>1 && !/uid/ && NF ? s ? s "," $NF : $NF : s
u = /uid/ ? $0 : u
}
i && !NF{
out()
}
END{
out()
}
' FS="=" OFS=";" file
Resulting
joe,bloggs,S01234565;uid=bloggsj;cn=joe-bloggs,ou=appserver,ou=components,o=hoster
john,blain;uid=blainj;cn=john-blain,ou=appserver,ou=components,o=hoster
andy,peters,E09876543;uid=petersa;cn=andy-peters,ou=appserver,ou=components,o=hoster
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk
This awk script works for your sample and produces the sample output:
BEGIN { delete cn[0]; OFS = ";" }
function print_info() {
if (length(cn)) {
names = cn[1] "," sn
for (i=2; i <= length(cn); ++i) names = names "," cn[i]
print names, uid, dn
delete cn
}
}
/^cn=/ {
if ($0 ~ /o=hoster/) dn = $0
else {
cn[length(cn)+1] = substr($0, index($0, "=") + 1)
uid = $0; sub("cn", "uid", uid)
}
}
/^sn=/ { sn = substr($0, index($0, "=") + 1) }
/^uid=/ { uid = $0 }
/^$/ { print_info() }
END { print_info() }
This should help you get started.
awk '$1 ~ /^cn/ {
for (i = 2; i <= NF; i++) {
if ($i ~ /^uid/) {
u = $i
continue
}
sub(/^[^=]*=/, x, $i)
r = length(r) ? r OFS $i : $i
}
print r, u, $1
r = u = x
}' OFS=, RS= infile
I assume that there is an error in your sample output: in the 3d record the uid should be petersa and not E09876543.
You might want look at some of the "already been there and done that" solutions to accomplish the task.
Apache Directory Studio for example, will do the LDAP query and save the file in CSV or XLS format.
-jim
I have the following sentence in awk
$ gawk '$2 == "-" { print $1 }' file
I was wondering what thing this instruction exactly did because I can't parse exactly the words I need.
Edit: How can I do in order to skip the lines before the following astersiks?
Let's say I have the following lines:
text
text
text
* * * * * * *
line1 -
line2 -
And then I want to filter just
line1
line2
with the sentence I posted above...
$ gawk '$2 == "-" { print $1 }' file
Thanks for your time and response!
This will find all lines on which the second column (Separated by spaces) is a -, and will then print the first column.
The first part ($2 == "-") checks for the second column being a -, and then if that is the case, runs the attached {} block, which prints the first column ($0 being the whole line, and $1, $2, etc being the first, second, ... columns.)
Spaces are the separator here simply because they are the default separator in awk.
Edit: To do what you want to do now, try the following (Not the most elegant, but it should work.)
gawk 'BEGIN { p = 0 } { if (p != 0 && $2 == "-") { print $1 } else { p = ($0 == "* * * * * * *")? 1 : 0 } }'
Spread over more lines for clarity on what's happening:
gawk 'BEGIN { p = 0 }
{ if (p != 0 && $2 == "-")
{ print $1 }
else
{ p = ($0 == "* * * * * * *")? 1 : 0 }
}'
Answer to the original question:
If the second column in a line from the file matches the string "-" then it prints out the first column of the line, columns are by default separated by spaces.
This would match and print out one:
one - two three
This would not:
one two three four
Answer to the revised question:
This code should do what you need after the match of the given string:
awk '/\* \* \* \* \* \* \*/{i++}i && $2 == "-" { print $1 }' data2.txt
Testing on this data gives the following output:
2two
2two