AWK issue : counting "non-matches" - if-statement

I want to count the occurrences of some words in a file. Then I modify my code to additionally count how many lines did not match to any word.
For example here is my input file (test.txt):
fred
fred
fred
bob
bob
john
BILL
BILL
Here is my code:
awk '
/fred/ { count["fred"]++ }
/bob/ { count["bob"]++ }
/john/ { count["john"]++ }
END { for (name in count) print name, "was found on", count[name], "lines." }
' test.txt
This works fine and gives me this output:
john was found on 1 lines.
bob was found on 2 lines.
fred was found on 3 lines.
Now I want to get a count of the lines that didn't match so I did the following code:
awk '
found=0
/fred/ { count["fred"]++; found=1 }
/bob/ { count["bob"]++; found=1 }
/john/ { count["john"]++; found=1 }
if (found==0) { count["none"]++ }
END { for (name in count) print name, "was found on", count[name], "lines." }
' test.txt
I get an error on the if statement like this:
awk: syntax error at source line 6
context is
>>> if <<< (found==0) { count["none"]++; }
awk: bailing out at source line 8
Any ideas why this isn't working?

You have simple syntax errors about using conditions. This statement is not valid:
awk 'if (found==0) { count["none"]++ }' # syntax error
because if () it does not form a condition that could exist outside {}. You should use either:
awk '{ if (found==0) count["none"]++ }'
or
awk 'found==0{ count["none"]++ }'
Also found = 0 at the beginning of your script should be inside {} as it is also a statement. Here are some useful links: Outside and in front of {} can be these patterns and inside {} we have the actions.
Your script with only the necessary modifications could be:
BEGIN { count["fred"]; count["bob"]; count["john"]; count["none"] }
{ found = 0 }
/fred/ { count["fred"]++; found=1 }
/bob/ { count["bob"]++; found=1 }
/john/ { count["john"]++; found=1 }
found==0{ count["none"]++ }
END { for (name in count) print name, "was found on", count[name]+0, "lines." }
two syntax errors corrected.
added items initialisation, because without it, no line will be printed for "fred" if there is no "fred" at all.
added count[name]+0 so if item is empty string, will print zero.

Could you please try following, considering that you want to print lines which are coming only 1 time. You need NOT to define same variable for each array value because it may give false positive results. So its better to check count value from array's value in condition.
awk '
/fred/{ count["fred"]++ }
/bob/{ count["bob"]++}
/john/{ count["john"]++}
END{
for(name in count){
if(count[name]==1){
print name, "was found only 1 time ", name
}
}
}
' Input_file
NOTE: Also on your syntax error, awk works on method of condition then action so when a condition is true or false, mentioned actions will be performed as per that eg--> /test/{print "something...."}. In your case you are directly mentioning action which is assigning a value to variable which would have been worked if you would have used {found=1} this is just to answer your syntax error part.

There are a couple of ways you can achieve what you want. While the method that the OP presents works, it is not really flexible. We assume you have a string str which contains your words of interest:
awk -v str="fred bob john" \
'BEGIN{split(str,b);for(i in b) a[b[i]]; delete b }
($0 in a) {a[$0]++; c++}
END {for(i in a) print i,"was found",a[i]+0", times
print NR-c, "lines did not match" }' file1 file2 file3

Related

Match strings in two files using awk and regexp

I have two files.
File 1 includes various types of SeriesDescriptions
"SeriesDescription": "Type_*"
"SeriesDescription": "OtherType_*"
...
File 2 contains information with only one SeriesDescription
"Name":"Joe"
"Age":"18"
"SeriesDescription":"Type_(Joe_text)"
...
I want to
compare the two files and find the lines that match for "SeriesDescription" and
print the line number of the matched text from File 1.
Expected Output:
"SeriesDescription": "Type_*" 24 11 (the correct line numbers in my files)
"SeriesDescription" will always be found on line 11 of File 2. I am having trouble matching given the * and have also tried changing it to .* without luck.
Code I have tried:
grep -nf File1.txt File2.txt
Successfully matches, but I want the line number from File1
awk 'FNR==NR{l[$1]=NR; next}; $1 in l{print $0, l[$1], FNR}' File2.txt File1.txt
This finds a match and prints the line number from both files, however, this is matching on the first column and prints the last line from File 1 as the match (since every line has the same column 1 for File 1).
awk 'FNR==NR{l[$2]=$3;l[$2]=NR; next}; $2 in l{print $0, l[$2], FNR}' File2.txt File1.txt
Does not produce a match.
I have also tried various settings of FS=":" without luck. I am not sure if the trouble is coming from the regex or the use of "" in the files or something else. Any help would be greatly appreciated!
With your shown samples, please try following. Written and tested in GNU awk, should work in any awk.
awk '
{ val="" }
match($0,/^[^_]*_/){
val=substr($0,RSTART,RLENGTH)
gsub(/[[:space:]]+/,"",val)
}
FNR==NR{
if(val){
arr[val]=$0 OFS FNR
}
next
}
(val in arr){
print arr[val] OFS FNR
}
' SeriesDescriptions file2
With your shown samples output will be:
"SeriesDescription": "Type_*" 1 3
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{ val="" } ##Nullifying val here.
match($0,/^[^_]*_/){ ##Using match to match value till 1st occurrence of _ here.
val=substr($0,RSTART,RLENGTH) ##Creating val which has sub string of above matched regex.
gsub(/[[:space:]]+/,"",val) ##Globally substituting spaces with NULL in val here.
}
FNR==NR{ ##This will execute when first file is being read.
if(val){ ##If val is NOT NULL.
arr[val]=$0 OFS FNR ##Create arr with index of val, which has value of current line OFS and FNR in it.
}
next ##next will skip all further statements from here.
}
(val in arr){ ##Checking if val is present in arr then do following.
print arr[val] OFS FNR ##Printing arr value with OFS, FNR value.
}
' SeriesDescriptions file2 ##Mentioning Input_file name here.
Bonus solution: If above is working fine for you AND you have this match only once in your file2 then you can exit from program to make it quick, in that case have above code in following way.
awk '
{ val="" }
match($0,/^[^_]*_/){
val=substr($0,RSTART,RLENGTH)
gsub(/[[:space:]]+/,"",val)
}
FNR==NR{
if(val){
arr[val]=$0 OFS FNR
}
next
}
(val in arr){
print arr[val] OFS FNR
exit
}
' SeriesDescriptions file2

awk convert number to date format to select line bigger than specific mmyy

INPUT:
test,1120,1
test,1219,2
Expected Output
test,1120,1
Goal: trying to print line where $2 which is mmyy format is bigger than 1020 as example.
I've tried with the following:
awk -F, '{ if ( $2 > 1020 ) { print $0 }}' file that's will not give the expected output because it's still number etc.. 1219 is bigger than 1020.
Assuming the 2nd field always contains 4 digits, how about:
awk -F, 'substr($2, 3, 2) substr($2, 1, 2) > 2010' input
Please note that I have interpreted the word bigger as later, meaning 0921 is bigger than 1020. If my assumption is incorrect, please let me know.
EDIT: Since OP mentioned that now if dates require lesser than provided input in that case one could try following.
awk -v val="1020" '
BEGIN{
FS=OFS=","
user_year=substr(val,3)
user_month=substr(val,1,2)
}
{
year=substr($2,3)
month=substr($2,1,2)
if(year==user_year){
if(month<user_month){
print
}
}
else if(year<user_year){
print
}
}
' Input_file
Could you please try following. I have create a variable named val here which will have value which user needs to compare to all the lines of Input_file. In this case it is set to 1020
awk -v val="1020" '
BEGIN{
FS=OFS=","
user_year=substr(val,3)
user_month=substr(val,1,2)
}
{
year=substr($2,3)
month=substr($2,1,2)
if(year==user_year){
if(month>user_month){
print
}
}
if(year>user_year){
print
}
}
' Input_file

Match only very first occurrence of a pettern using awk

I am trying to follow the solution at
Moving matching lines in a text file using sed
The situation is that pattern2 needs to be applied just once in the whole file. How can I change the following to get this done
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ {t[2]=$0;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
Here is the file on which I applied the pattern2 (RELOC_DIR)
asdasd0
-SRC_OUT_DIR = /a/b/c/d/e/f/g/h
RELOC_DIR = /i/j/k/l/m
asdasd3
asdasd4
DEFAULTS {
asdasd6
$(RELOC_DIR)/some other text1
$(RELOC_DIR)/some other text2
$(RELOC_DIR)/some other text3
$(RELOC_DIR)/some other text4
and the last 4 lines of the file got deleted because of the match.
asdasd0
-SRC_OUT_DIR = /a/b/c/d/e/f/g/h
asdasd3
asdasd4
DEFAULTS {
RELOC_DIR = /i/j/k/l/m
asdasd6
I am assuming you need to check pattern2 along with some other condition if this is the case then try.
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ && /check_other_text_in_current_line/{t[2]=$0;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
Above is checking check_other_text_in_current_line string(which is a sample and you could change it as per your actual string) is present along with pattern2 also in same line. If this si not what you are looking for then please post samples of input and expected output in your post.
OR in case you are looking that only 1st match for pattern2 in Input_file and skip all others then try. It will only print very first match for pattern2 and skip all others.(since samples are not provied by OP so this code is written only for the ask of specific pattern matching)
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ && ++count==1{t[2]=$0;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
OR
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ && !found2{t[2]=$0;found2=1;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
EDIT: Though my 2nd solution looks like should be the one as per OP's ask but complete picture of requirement is not given so adding code only for printing Pattern2(string RELOC_DIR)'s first occurence here.
awk '/RELOC_DIR/ && ++ count==1{print}' Input_file
RELOC_DIR = /i/j/k/l/m
OR
awk '!found2 && /RELOC_DIR/ { t[2]=$0; found2=1; print}' Input_file

AWK script to check first line of a file and then print the rest

I am trying to write an AWK script to parse a file of the form
> field1 - field2 field3 ...
lineoftext
anotherlineoftext
anotherlineoftext
and I am checking using regex if the first line is correct (begins with a > and then has something after it) and then print all the other lines. This is the script I wrote but it only verifies that the file is in a correct format and then doesn't print anything.
#!/bin/bash
# FASTA parser
awk ' BEGIN { x = 0; }
{ if ($1 !~ />.*/ && x == 0)
{ print "Not a FASTA file"; exit; }
else { x = 1; next; }
print $0 }
END { print " - DONE - "; }'
Basically you can use the following awk command:
awk 'NR==1 && /^>./ {p=1} p' file
On the first row NR==1 it checks whether the line starts with a > followed by "something" (/^>./). If that condition is true the variable p will be set to one. The p at the end checks whether p evaluates true and prints the line in that case.
If you want to print the error message, you need to revert the logic a bit:
awk 'NR==1 && !/^>./ {print "Not a FASTA file"; exit 1} 1' file
In this case the program prints the error messages and exits the program if the first line does not start with a >. Otherwise all lines gets printed because 1 always evaluates to true.
For this OP literally
awk 'NR==1{p=$0~/^>/}p' YourFile
# shorter version with info of #EdMorton
awk 'NR==1{p=/^>/}p' YourFile
for line after > (including)
awk '!p{p=$0~/^>/}p' YourFile
# shorter version with info of #EdMorton
awk '!p{p=/^>/}p' YourFile
Since all you care about is the first line, you can just check that, then exit.
awk 'NR > 1 { exit (0) }
! /^>/ { print "Not a FASTA file" >"/dev/stderr"; exit (1) }' file
As noted in comments, the >"/dev/stderr" is a nonportable hack which may not work for you. Regard it as a placeholder for something slightly more sophisticated if you want a tool which behaves as one would expect from a standard Unix tool (run silently if no problems; report problems to standard error).

Awk replace entire line when match is found

I have the following code:
function replaceappend() {
awk -v old="^$2" -v new="$3" '
sub(old,new) { replaced=1 }
{ print }
END { if (!replaced) print new }
' "$1" > /tmp/tmp$$ &&
mv /tmp/tmp$$ "$1"
}
replaceappend "/etc/ssh/sshd_config" "Port" "Port 222"
It works perfectly but I am looking to modify it so it replaces the entire lines contents rather than just the matching text.
At the moment it would do this:
Port 1234 -> Port 222 1234
I want it to be work like this:
Port 1234 -> Port 222
I closest code I can find to do this is found here:
awk 'NR==4 {$0="different"} { print }' input_file.txt
This would replace the entire line of the match with the new content. How can I implement this into my existing code?
Just change:
sub(old,new) { replaced=1 }
to:
$0~old { $0=new; replaced=1 }
or:
sub(".*"old".*",new) { replaced=1 }
If you want to replace the entire line you can simplify your function. To avoid problems with metacharacters in the variables you pass to awk, I would suggest using a simple string search too:
awk -vold="$2" -vnew="$3" 'index($0,old)==1{f=1;$0=new}1;END{if(!f)print new}' "$1"
index returns the character position of the string that you are searching for, starting at 1. If the string old is at the start of the line, then the line is changed to the value of new. The 1 after the block is always true so every line is printed (this is a common shorthand for an unconditional {print} block).
As mklement0 has pointed out in the comments, the variables you pass to awk are still subject to some interpretation: for example, the string \n will be interpreted as a newline character, \t as a tab character, etc. However, this issue is much less significant than it would be using regular expressions, where things like a . would match any character.
Again, use a regular expression for that which you want to replace:
replaceappend port.txt "Port.*" "Port 222"
Here you are replacing Port (if it starts the line, as per your function definition) plus whatever follows until the end of the line with "Port 222".
EDIT: To make this part of the function instead of requiring it in the call, modify it to
function replaceappend() {
awk -v old="^$2.*" -v new="$3" '
sub(old,new) { replaced=1 }
{ print }
END { if (!replaced) print new }
' "$1" > /tmp/tmp$$ &&
mv /tmp/tmp$$ "$1"
}