Add consecutive entries in a column - c++

I have a file that has the format
0.99987799 17743.000
1.9996300 75.000000
2.9993899 75.000000
3.9991500 102.00000
4.9988999 131.00000
5.9986601 130.00000
6.9984102 152.00000
7.9981699 211.00000
8.9979200 256.00000
9.9976797 259.00000
10.997400 341.00000
11.997200 373.00000
What I would like to do is add the data in the second column, every four lines. So a desired output would be
1 17743+75+75+102
2 131+130+52+211
3 256+259+341+373
How can this be done in awk?
I know that I can find a specific element in the file using
awk 'FNR == 5 {print $2}' file
but I don't know how to add 4 elements in a row. If I try for instance
awk '$2 {print FNR == 5}' file
I get nothing but zeros, so I don't know how to parse the column first and then the line. I also tried
awk 'BEGIN{i=4}
{
for (NR>=1 || NR<=i)
{
print $2
}
}' filename
but I get a syntax error at NR<=i. I also don't have any idea how to loop on the entire file. Any help or idea would be more than welcome! Or perhaps would it be better to do it in C++? I don't know which is more convenient...
I also tried
awk 'BEGIN{sum=0} {{sum += $2} if(FNR%4 == 0) { print sum; sum=0}}' infile.dat
but it doesn't seem to work properly...

awk 'NR%4==1{sum=$2; next}{sum+=$2} NR%4==0{print ++j,sum;}' input.txt
Output:
1 17995
2 624
3 1229
For first number of a group it stores value of second column in $2, for next 3 rows adds the value of the second column and sum. for last row of a group NR%4==0 prints the result.
If you don't need the row numbers before the sum results just remove ++j,.

awk '{print $2}' file | paste -d+ - - - - | bc

This works fine for me:
awk '{sum += $2}
FNR%4==0 {print FNR/4, sum; sum = 0}
END {if(FNR%4){print int(FNR/4)+1, sum}}' awktest.txt
with the result of:
1 17995
2 624
3 1229

Related

Using awk, how can I find the max value in one column, print it; then print the match value in another column

Let's say I have this data:
1 text1 1 1 5
2 text2 2 2 10
3 text3 3 3 15
4 text4 4 4 50
5 text5 5 5 25
I obtain the max value of column #5 with this code:
awk 'BEGIN {a=0} {if ($5>0+a) a=$5} END{print a}' data.txt
My question is how do I add more parameters in that code in order to find the associated value in whatever column I choose (but just one)? For example, I want to find the max value of column #5 and the associated value from column #2
The output I want is:
50 text4
I don't know how to add more parameters in order to obtain the match value.
Right way to do this is this awk:
awk 'NR==1 || $5>max { max=$5; val=$2 } END { print max, val }' file
50 text4
This sets max=$5 and val=$2 for the first record or when $5 is greater than max variable.
When you find a new max then save both the new max and the associated value from column #2.
One idea, along with some streamlining of the current code:
$ awk '$5>(a+0) { a=$5; col2=$2 } END {print a, col2}' data.txt
50 text4
NOTE:
this assumes that at least one value in column #5 is positive; if all values in column #5 are negative then $5>(a+0) will always be false and a (and col2) will never get set, which in turn means print a, col2 will print a line with a single space; a better solution would be to set a to the first value processed and then go from there (see anubhava's answer for an example)
An alternative using sort
% sort -nk 5 file | tail -1 | awk '{print $5, $2}'
50 text4
With your shown samples please try following sort + awk option here. With GNU sort sorting the file by 5th column and then piping its result to awk where reading very first line which is containing max value and printing it, exiting from program to save time of awk.
sort -s -rnk5 file1 | awk 'FNR==1{print $NF,$2;exit}'
50 text4

Numeric expression in if condition of awk

Pretty new to AWK programming. I have a file1 with entries as:
15>000000513609200>000000513609200>B>I>0011>>238/PLMN/000100>File Ef141109.txt>0100-75607-16156-14 09-11-2014
15>000000513609200>000000513609200>B>I>0011>Danske Politi>238/PLMN/000200>>0100-75607-16156-14 09-11-2014
15>000050354428060>000050354428060>B>I>0011>Danske Politi>238/PLMN/000200>>4100-75607-01302-14 31-10-2014
I want to write a awk script, where if 2nd field subtracted from 3rd field is a 0, then it prints field 2. Else if the (difference > 0), then it prints all intermediate digits incremented by 1 starting from 2nd field ending at 3rd field. There will be no scenario where 3rd field is less than 2nd. So ignoring that condition.
I was doing something as:
awk 'NR > 2 { print p } { p = $0 }' file1 | awk -F">" '{if ($($3 - $2) == 0) print $2; else l = $($3 - $2); for(i=0;i<l;i++) print $2++; }'
(( Someone told me awk is close to C in terms of syntax ))
But from the output it looks to me that the String to numeric or numeric to string conversions are not taking place at right place at right time. Shouldn't it be taken care by AWK automatically ?
The OUTPUT that I get:
513609200
513609201
513609200
Which is not quiet as expected. One evident issue is its ignoring the preceding 0s.
Kindly help me modify the AWK script to get the desired result.
NOTE:
awk 'NR > 2 { print p } { p = $0 }' file1 is just to remove the 1st and last entry in my original file1. So the part that needs to be fixed is:
awk -F">" '{if ($($3 - $2) == 0) print $2; else l = $($3 - $2); for(i=0;i<l;i++) print $2++; }'
In awk, think of $ as an operator to retrieve the value of the named field number ($0 being a special case)
$1 is the value of field 1
$NF is the value of the field given in the NF variable
So, $($3 - $2) will try to get the value of the field number given by the expression ($3 - $2).
You need fewer $ signs
awk -F">" '{
if ($3 == $2)
print $2
else {
v=$2
while (v < $3)
print v++
}
}'
Normally, this will work, but your numbers are beyond awk integer bounds so you need another solution to handle them. I'm posting this to initiate other solutions and better illustrate your specifications.
$ awk -F'>' '{for(i=$2;i<=$3;i++) print i}' file
note that this will skip the rows that you say impossible to happen
A small scale example
$ cat file_0
x>1000>1000>etc
x>2000>2003>etc
x>3000>2999>etc
$ awk -F'>' '{for(i=$2;i<=$3;i++) print i}' file_0
1000
2000
2001
2002
2003
Apparently, newer versions of gawk has --bignum options for arbitrary precision integers, if you have a compatible version that may solve your problem but I don't have access to verify.
For anyone who does not have ready access to gawk with bigint support, it may be simpler to consider other options if some kind of "big integer" support is required. Since ruby has an awk-like mode of operation,
let's consider ruby here.
To get started, there are just four things to remember:
invoke ruby with the -n and -a options (-n for the awk-like loop; -a for automatic parsing of lines into fields ($F[i]));
awk's $n becomes $F[n-1];
explicit conversion of numeric strings to integers is required;
To specify the lines to be executed on the command line, use the '-e TEXT' option.
Thus a direct translation of:
awk -F'>' '{for(i=$2;i<=$3;i++) print i}' file
would be:
ruby -an -F'>' -e '($F[1].to_i .. $F[2].to_i).each {|i| puts i }' file
To guard against empty lines, the following script would be slightly better:
($F[1].to_i .. $F[2].to_i).each {|i| puts i } if $F.length > 2
This could be called as above, or if the script is in a file (say script.rb) using the incantation:
ruby -an -F'>' script.rb file
Given the OP input data, the output is:
513609200
513609200
50354428060
The left-padding can be accomplished in several ways -- see for example this SO page.

awk search column from one file, if match print columns from both files

I'm trying to compare column 1 from file1 and column 3 from file 2, if they match then print the first column from file1 and the two first columns from file2.
here's a sample from each file:
file1
Cre01.g000100
Cre01.g000500
Cre01.g000650
file2
chromosome_1 71569 |655|Cre01.g000500|protein_coding|CODING|PAC:26902937|1|1)
chromosome_1 93952 |765|Cre01.g000650|protein_coding|CODING|PAC:26903448|11|1)
chromosome_1 99034 |1027|Cre01.g000100 |protein_coding|CODING|PAC:26903318|9|1)
desired output
Cre01.g000100 chromosome_1 99034
Cre01.g000500 chromosome_1 71569
Cre01.g000650 chromosome_1 93952
I've been looking at various threads that are somewhat similar, but I can't seem to get it to print the columns from both files. Here are some links that are somewhat related:
awk compare 2 files, 2 fields different order in the file, print or merge match and non match lines
Obtain patterns from a file, compare to a column of another file, print matching lines, using awk
awk compare columns from two files, impute values of another column
Obtain patterns in one file from another using ack or awk or better way than grep?
Awk - combine the data from 2 files and print to 3rd file if keys matched
I feel like I should have been able to figure it out based on these threads, but it's been two days that I've been trying different variations of the codes and I haven't gotten anywhere.
Here is some code that I've tried using on my files:
awk 'FNR==NR{a[$3]=$1;next;}{print $0 ($3 in a ? a[$3]:"NA")}' file1 file2
awk 'NR==FNR{ a[$1]; next} ($3 in a) {print $1 $2 a[$1]}' file1 file2
awk 'FNR==NR{a[$1]=$0; next}{print a[$1] $0}' file1 file2
I know i have to create a temp matrix that contains the first column of file1 (or the 3rd column of file2) then compare it to the other file. If there is a match, then print first column from file1 and column 1 and column 2 from file 2.
Thanks for the help!
You can use this awk:
awk -F '[| ]+' -v OFS='\t' 'NR==FNR{a[$4]=$1 OFS $2; next}
$1 in a{print $1, a[$1]}' file2 file1
Cre01.g000100 chromosome_1 99034
Cre01.g000500 chromosome_1 71569
Cre01.g000650 chromosome_1 93952
Your middle attempt of the three is closest, but:
You haven't specified the field delimiter is |.
You don't assign to a[$1].
Your sample output is inconsistent with your desired output (the sample output shows column 1 from file 1 and column 1 from file 2; the desired output is reputedly column 1 from file 1 and columns 1 and 2 from file 2, though this interpretation depends on the interpretation of $3 in file 2 being the name between two pipe symbols).
Citing the question at the time this answer was created:
… compare column 1 from file1 and column 3 from file 2, if they match then print the first column from file1 and the two first columns from file2.
desired output
Cre01.g000100 chromosome_1 99034
Cre01.g000500 chromosome_1 71569
Cre01.g000650 chromosome_1 93952
We can observe that if $3 in file 2 is equal to a value from file 1, then it is as easy to print $3 as a saved value.
So, fixing this up:
awk -F'|' 'NR==FNR { a[$1]=1; next } ($3 in a) { print $3, $1 }' file1 file2
The key change is the assignment to a[$1] (and the -F'|'); the rest is cosmetic and can be tweaked to suit your requirements (since the question is self-inconsistent, it is hard to give a better answer).

Two float numbers ara attached together in my output text file

In my output file two columns corresponding to two float numbers are attached together, forming one column. An example is shown here, is there anyway to separet these two columns from each other?
Here, this is supposed to be 5 columns separated by white-spaces, but space between columns 3&4 is missing. Is there anyway to correct this mistake with some UNIX commands such as cut, awk, sed or even Regular Expressions?
3.77388 0.608871 -8216.342.42161 1.88655
4.39243 0.625 -8238.241.49211 0.889258
4.38903 0.608871 -7871.71.52994 0.883976
4.286 0.653226 -8287.322.3195 2.13736
4.29313 0.629032 -7954.651.59168 1.02046
The corrected version should look like this:
3.77388 0.608871 -8216.34 2.42161 1.88655
4.39243 0.625 -8238.24 1.49211 0.889258
4.38903 0.608871 -7871.7 1.52994 0.883976
4.286 0.653226 -8287.32 2.3195 2.13736
4.29313 0.629032 -7954.65 1.59168 1.02046
More info: column 4 is always less than 10, so it only has one digit to the left of decimal point.
I have tried to use awk:
tail -n 5 output.dat | awk '{print $3}'
-8216.342.42161
-8238.241.49211
-7871.71.52994
-8287.322.3195
-7954.651.59168
Is there any way to separate this column into two columns?
One solution:
sed 's/\(\.[0-9]*\)\([0-9]\.\)/\1 \2/'
Using Perl one-liner:
perl -pe 's/(\d+\.\d+)(\d\.\d+)/$1 $2/' < output.dat > fixed_output.dat
Your input file
$ cat file
3.77388 0.608871 -8216.342.42161 1.88655
4.39243 0.625 -8238.241.49211 0.889258
4.38903 0.608871 -7871.71.52994 0.883976
4.286 0.653226 -8287.322.3195 2.13736
4.29313 0.629032 -7954.651.59168 1.02046
Awk approach
awk '{
n = index($3,".") # index of dot from field 3
x = substr($3,1,n+3) ~/\.$/ ? n+1 : n+2 # Decision for no of char to consider
$3 = substr($3,1,x) OFS substr($3,x+1) # separate out fields
$0 = $0 # Recalculate fields (number of fields NF)
$1 = $1 # recalculate the record, removing excess spacing (the new field separator becomes OFS, default is a single space)
}1' OFS='\t' file
Resulting
3.77388 0.608871 -8216.34 2.42161 1.88655
4.39243 0.625 -8238.24 1.49211 0.889258
4.38903 0.608871 -7871.7 1.52994 0.883976
4.286 0.653226 -8287.32 2.3195 2.13736
4.29313 0.629032 -7954.65 1.59168 1.02046

unix regex for adding contents in a file

i have contents in a file
like
asdfb ... 1
adfsdf ... 2
sdfdf .. 3
I want to write a unix command that should be able to add 1 + 2 + 3 and give the result as 6
From what I am aware grep and awk would be handy, any pointers would help.
I believe the following is what you're looking for. It will sum up the last field in each record for the data that is read from stdin.
awk '{ sum += $NF } END { print sum }' < file.txt
Some things to note:
With awk you don't need to declare variables, they are willed into existence by assigning values to them.
The variable NF is the number of fields in the current record. By prepending it with a $ we are treating its value as a variable. At least this is how it appears to work anyway :)
The END { } block is only once all records have been processed by the other blocks.
An awk script is all you need for that, since it has grep facilities built in as part of the language.
Let's say your actual file consists of:
asdfb zz 1
adfsdf yyy 2
sdfdf xx 3
and you want to sum the third column. You can use:
echo 'asdfb zz 1
adfsdf yyy 2
sdfdf xx 3' | awk '
BEGIN {s=0;}
{s = s + $3;}
END {print s;}'
The BEGIN clause is run before processing any lines, the END clause after processing all lines.
The other clause happens for every line but you can add more clauses to change the behavior based on all sorts of things (grep-py things).
This might not exactly be what you're looking for, but I wrote a quick Ruby script to accomplish your goal:
#!/usr/bin/env ruby
total = 0
while gets
total += $1.to_i if $_ =~ /([0-9]+)$/
end
puts total
Here's one in Perl.
$ cat foo.txt
asdfb ... 1
adfsdf ... 2
sdfdf .. 3
$ perl -a -n -E '$total += $F[2]; END { say $total }' foo
6
Golfed version:
perl -anE'END{say$n}$n+=$F[2]' foo
6