Using awk, how can I find the max value in one column, print it; then print the match value in another column - if-statement

Let's say I have this data:
1 text1 1 1 5
2 text2 2 2 10
3 text3 3 3 15
4 text4 4 4 50
5 text5 5 5 25
I obtain the max value of column #5 with this code:
awk 'BEGIN {a=0} {if ($5>0+a) a=$5} END{print a}' data.txt
My question is how do I add more parameters in that code in order to find the associated value in whatever column I choose (but just one)? For example, I want to find the max value of column #5 and the associated value from column #2
The output I want is:
50 text4
I don't know how to add more parameters in order to obtain the match value.

Right way to do this is this awk:
awk 'NR==1 || $5>max { max=$5; val=$2 } END { print max, val }' file
50 text4
This sets max=$5 and val=$2 for the first record or when $5 is greater than max variable.

When you find a new max then save both the new max and the associated value from column #2.
One idea, along with some streamlining of the current code:
$ awk '$5>(a+0) { a=$5; col2=$2 } END {print a, col2}' data.txt
50 text4
NOTE:
this assumes that at least one value in column #5 is positive; if all values in column #5 are negative then $5>(a+0) will always be false and a (and col2) will never get set, which in turn means print a, col2 will print a line with a single space; a better solution would be to set a to the first value processed and then go from there (see anubhava's answer for an example)

An alternative using sort
% sort -nk 5 file | tail -1 | awk '{print $5, $2}'
50 text4

With your shown samples please try following sort + awk option here. With GNU sort sorting the file by 5th column and then piping its result to awk where reading very first line which is containing max value and printing it, exiting from program to save time of awk.
sort -s -rnk5 file1 | awk 'FNR==1{print $NF,$2;exit}'
50 text4

Related

simply pass a variable into a regex OR string search in awk

This is driving me nuts. Here's what I want to do, and I've made it simple as possible:
This is written into an awk script:
#!/bin/bash/awk
# pass /^CHEM/, /^BIO/, /^ENG/ into someVariable and search file.txt
/someVariable/ {print NR, $0}
OR I would be fine with (but like less)
#!/bin/bash/awk
# pass "CHEM", "BIO", "ENG" into someVariable and search file.txt
$1=="someVariable" {print NR, $0}
I find all kinds of stuff on BASH/SHELL variables being passed but I don't want to learn BASH programming to simply pass a value to a variable.
Bonus: I actually have to search 125 values in each document, with 40 documents needing to be evaluated. It can't hurt to ask a bit more, but how would I take a separate file of these 125 values, pass them individually to someVariable?
I have all sorts of ways to do this in BASH but I don't understand them and there has got to be a way to simply cycle through a set of search terms dynamically in awk (perhaps by an array since I do not believe a list exists yet)
Thank you as I am tired of beating my head into a wall.
I actually have to search 125 values in each document, with 40 documents needing to be evaluated.
Let's put the strings that we want to search for in file1:
$ cat file1
apple
banana
pear
Let's call the file that we want to search file2:
$ cat file2
ear of corn
apple blossom
peas in a pod
banana republic
pear tree
To search file2 for any of the words in file1, use:
$ awk 'FNR==NR{a[$1]=1;next;} ($1 in a){print FNR,$0;}' file1 file2
2 apple blossom
4 banana republic
5 pear tree
How it works
FNR==NR{a[$1]=1;next;}
This stores every word that we are looking for as a key in array a.
In more detail, NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: file1. For every line in file1, we set a[$1] to 1.
next tells awk to skip the rest of the commands and start over with the next line.
($1 in a){print FNR,$0;}
If we get to this command, we are on file2.
If the first field is a key in array a, then we print the line number and the line.
"...For example I wanted the text between two regexp from file2. Let's say /apple/, /pear/. How would I substitute and extract the text between those two regexp?..."
while read b e; do awk "/^$b$/,/^$e$/" <(seq 1 100); done << !
> 1 5
> 2 8
> 90 95
> !
1
2
3
4
5
2
3
4
5
6
7
8
90
91
92
93
94
95
Here between the two exclamation points is the input for ranges and as the data file I used 1..100. Notice the double quotes instead of single quotes in the awk script.
If you have entered start end values in the file ranges, and your data in file data
while read b e; do awk "/^$b$/,/^$e$/" data; done < ranges
If you want to print the various ranges to different files, you can do something like this
while read b e; do awk "/^$b$/,/^$e$/ {print > $b$e}" data; done < ranges
A slight variation that you may or may not like... I sometimes use the BEGIN section to read the contents of a file into an array...
BEGIN {
count = 1
while ("cat file1" | getline)
{
a[count] = $3
count++
}
}
The rest continues in much the same way. Anyway, maybe that works for you as well.

Add consecutive entries in a column

I have a file that has the format
0.99987799 17743.000
1.9996300 75.000000
2.9993899 75.000000
3.9991500 102.00000
4.9988999 131.00000
5.9986601 130.00000
6.9984102 152.00000
7.9981699 211.00000
8.9979200 256.00000
9.9976797 259.00000
10.997400 341.00000
11.997200 373.00000
What I would like to do is add the data in the second column, every four lines. So a desired output would be
1 17743+75+75+102
2 131+130+52+211
3 256+259+341+373
How can this be done in awk?
I know that I can find a specific element in the file using
awk 'FNR == 5 {print $2}' file
but I don't know how to add 4 elements in a row. If I try for instance
awk '$2 {print FNR == 5}' file
I get nothing but zeros, so I don't know how to parse the column first and then the line. I also tried
awk 'BEGIN{i=4}
{
for (NR>=1 || NR<=i)
{
print $2
}
}' filename
but I get a syntax error at NR<=i. I also don't have any idea how to loop on the entire file. Any help or idea would be more than welcome! Or perhaps would it be better to do it in C++? I don't know which is more convenient...
I also tried
awk 'BEGIN{sum=0} {{sum += $2} if(FNR%4 == 0) { print sum; sum=0}}' infile.dat
but it doesn't seem to work properly...
awk 'NR%4==1{sum=$2; next}{sum+=$2} NR%4==0{print ++j,sum;}' input.txt
Output:
1 17995
2 624
3 1229
For first number of a group it stores value of second column in $2, for next 3 rows adds the value of the second column and sum. for last row of a group NR%4==0 prints the result.
If you don't need the row numbers before the sum results just remove ++j,.
awk '{print $2}' file | paste -d+ - - - - | bc
This works fine for me:
awk '{sum += $2}
FNR%4==0 {print FNR/4, sum; sum = 0}
END {if(FNR%4){print int(FNR/4)+1, sum}}' awktest.txt
with the result of:
1 17995
2 624
3 1229

Two float numbers ara attached together in my output text file

In my output file two columns corresponding to two float numbers are attached together, forming one column. An example is shown here, is there anyway to separet these two columns from each other?
Here, this is supposed to be 5 columns separated by white-spaces, but space between columns 3&4 is missing. Is there anyway to correct this mistake with some UNIX commands such as cut, awk, sed or even Regular Expressions?
3.77388 0.608871 -8216.342.42161 1.88655
4.39243 0.625 -8238.241.49211 0.889258
4.38903 0.608871 -7871.71.52994 0.883976
4.286 0.653226 -8287.322.3195 2.13736
4.29313 0.629032 -7954.651.59168 1.02046
The corrected version should look like this:
3.77388 0.608871 -8216.34 2.42161 1.88655
4.39243 0.625 -8238.24 1.49211 0.889258
4.38903 0.608871 -7871.7 1.52994 0.883976
4.286 0.653226 -8287.32 2.3195 2.13736
4.29313 0.629032 -7954.65 1.59168 1.02046
More info: column 4 is always less than 10, so it only has one digit to the left of decimal point.
I have tried to use awk:
tail -n 5 output.dat | awk '{print $3}'
-8216.342.42161
-8238.241.49211
-7871.71.52994
-8287.322.3195
-7954.651.59168
Is there any way to separate this column into two columns?
One solution:
sed 's/\(\.[0-9]*\)\([0-9]\.\)/\1 \2/'
Using Perl one-liner:
perl -pe 's/(\d+\.\d+)(\d\.\d+)/$1 $2/' < output.dat > fixed_output.dat
Your input file
$ cat file
3.77388 0.608871 -8216.342.42161 1.88655
4.39243 0.625 -8238.241.49211 0.889258
4.38903 0.608871 -7871.71.52994 0.883976
4.286 0.653226 -8287.322.3195 2.13736
4.29313 0.629032 -7954.651.59168 1.02046
Awk approach
awk '{
n = index($3,".") # index of dot from field 3
x = substr($3,1,n+3) ~/\.$/ ? n+1 : n+2 # Decision for no of char to consider
$3 = substr($3,1,x) OFS substr($3,x+1) # separate out fields
$0 = $0 # Recalculate fields (number of fields NF)
$1 = $1 # recalculate the record, removing excess spacing (the new field separator becomes OFS, default is a single space)
}1' OFS='\t' file
Resulting
3.77388 0.608871 -8216.34 2.42161 1.88655
4.39243 0.625 -8238.24 1.49211 0.889258
4.38903 0.608871 -7871.7 1.52994 0.883976
4.286 0.653226 -8287.32 2.3195 2.13736
4.29313 0.629032 -7954.65 1.59168 1.02046

awk: Handle positions with NR in a if loop. Next and previous position

I have this line into my bash script.
#Trying to find the FIRST maximum in the column $10
awk 'BEGIN{max=0} {if($10>=max){max=$10} else{exit}} END{print NR}'
And it works.
But I need something more sophisticated (for another purpose). I need to awk to check if the next and the previous row values are higher than the actual one (something like this):
awk 'BEGIN{max=0} {if($10[NR]>=max && $10[NR-1]>=$10[NR] && $10[NR+1]>=$10[NR] ){max=$10} else{exit}} END{print NR}'
But it doesn't work, probably because I don't know how to handle the positions in the column. Can you help me please?
Clarification:
I just want to read a column completely (column 10) and find the row number which has a higher value in the previous row and the next row. For instance if the column has the values 1,2,3,4,1,2 then I want to get the row number "5" (corresponding to the second 1 in the data) because it is the row in the column which has two higher values next to it.
awk '{ if ($10 > old1 && old1 < old2) print NR-1; old2 = old1; old1 = $10; }'
To test, I replaced $10 with $1. Run on this data:
1
2
3
4
3
4
5
4
3
2
1
2
3
4
5
4
5
4
3
2
1
It produces this output:
5
11
16

unix regex for adding contents in a file

i have contents in a file
like
asdfb ... 1
adfsdf ... 2
sdfdf .. 3
I want to write a unix command that should be able to add 1 + 2 + 3 and give the result as 6
From what I am aware grep and awk would be handy, any pointers would help.
I believe the following is what you're looking for. It will sum up the last field in each record for the data that is read from stdin.
awk '{ sum += $NF } END { print sum }' < file.txt
Some things to note:
With awk you don't need to declare variables, they are willed into existence by assigning values to them.
The variable NF is the number of fields in the current record. By prepending it with a $ we are treating its value as a variable. At least this is how it appears to work anyway :)
The END { } block is only once all records have been processed by the other blocks.
An awk script is all you need for that, since it has grep facilities built in as part of the language.
Let's say your actual file consists of:
asdfb zz 1
adfsdf yyy 2
sdfdf xx 3
and you want to sum the third column. You can use:
echo 'asdfb zz 1
adfsdf yyy 2
sdfdf xx 3' | awk '
BEGIN {s=0;}
{s = s + $3;}
END {print s;}'
The BEGIN clause is run before processing any lines, the END clause after processing all lines.
The other clause happens for every line but you can add more clauses to change the behavior based on all sorts of things (grep-py things).
This might not exactly be what you're looking for, but I wrote a quick Ruby script to accomplish your goal:
#!/usr/bin/env ruby
total = 0
while gets
total += $1.to_i if $_ =~ /([0-9]+)$/
end
puts total
Here's one in Perl.
$ cat foo.txt
asdfb ... 1
adfsdf ... 2
sdfdf .. 3
$ perl -a -n -E '$total += $F[2]; END { say $total }' foo
6
Golfed version:
perl -anE'END{say$n}$n+=$F[2]' foo
6