AWK: Pattern match multiline data with variable line number - regex

I am trying to write a script which will analyze data from a pipe. The problem is, a single element is described in a variable number of lines. Look at the example data set:
3 14 -30.48 17.23
4 1 -18.01 12.69
4 3 -11.01 2.69
8 12 -21.14 -8.76
8 14 -18.01 -5.69
8 12 -35.14 -1.76
9 2 -1.01 22.69
10 1 -88.88 17.28
10 1 -.88 14.28
10 1 5.88 1.28
10 1 -8.88 -7.28
In this case, the first entry is what defines the event to which the following data belongs. In the case of event number 8, we have data in 3 lines. To simplify the rather complex problem that I am trying to solve, let us imagine, that I want to calculate the following expression:
sum_i($2 * ($3 + $4))
Where i is taken over all lines belonging to a given element. The output I want to produce would then look like:
3=-185.5 [14(-30.48+17.23) ]
4=-30.28 [1(-18.01+12.69) + 3(-11.01+2.69)]
8=-1106.4 [...]
I thus need a script which reads all the lines that have the same index entry.
I am an AWK newbie and I've started learning the language a couple of days ago. I am now uncertain whether I will be able to achieve what I want. Therefore:
Is this doable with AWK?
If not, whith what? SED?
If yes, how? I would be grateful if one provided a link describing how this can be implemented.
Finally, I know that there is a similar question: Can awk patterns match multiple lines?, however, I do not have a constant pattern which separates my data.
Thanks!

You could try this:
awk '{ar[$1]+=$2*($3+$4)}
END{for (key in ar)
{print key"="ar[key]}}' inputFile
For each line input we do the desired calculation and sum the result in an array. $1 serves as the key of the array.
When the entire file is read, we print the results in the END{...}-block.
The output for the given sample input is:
4=-30.28
8=-1133.4
9=43.36
10=-67.2
3=-185.5
If sorting of the output is required, you might want to have a look at gawk's asorti function or Linux' sort-command (e.g. awk '{...} inputFile' | sort -n).
This solution does not require that the input is sorted.

awk 'id!=$1{if(id){print id"="sum;sum=0};id=$1}{sum+=$2*($3+$4)} END{print id"="sum}' file
3=-185.5
4=-30.28
8=-1133.4
9=43.36
10=-67.2

yet another similar awk
$ awk -v OFS="=" 'NR==1{p=$1}
p!=$1{print p,s; s=0; p=$1}
{s+=$2*($3+$4)}
END{print p,s}' file
3=-185.5
4=-30.28
8=-1133.4
9=43.36
10=-67.2
ps. Your calculation for "8" seems off.

Related

Bash - count a pattern and print the line containing the pattern

everyone! While I was reading this discussion, "Count number of occurrences of a pattern in a file (even on same line)", I wondered if I could add the line containing the pattern next to the count values.
Somehow I wasn't able to add any comment on the discussion, so I'm posting a new question. Can somebody en-light me?
There must be some misunderstanding here, so I put an example.
Let's say, I have a DNA sequence like below and want to find out how many 'CG' are present in each line.
ACAAAGAACTCAAGAAGTTGGACCCCAGAGAACCAAATAACCCTATTAAA
AATTCGGAACAGAGATAAACAAAGAATTCTCAACTGAGGAAACTTGAATG
GGATTTTTTTTTAAGATTCACTTATTTTTATTTTCTGCATGAGTGTTTGC
CTCGATGTATGTACATATACGACATGTGTACGTGGTGCGCAAGTAAGCAG
Additionally, I want to print each line (not the pattern) along with the pattern counts.
0 ACAAAGAACTCAAGAAGTTGGACCCCAGAGAACCAAATAACCCTATTAAA
1 AATTCGGAACAGAGATAAACAAAGAATTCTCAACTGAGGAAACTTGAATG
0 GGATTTTTTTTTAAGATTCACTTATTTTTATTTTCTGCATGAGTGTTTGC
4 CTCGATGTATGTACATATACGACATGTGTACGTGGTGCGCAAGTAAGCAG
I wish the example above will help to understand the question better.
Thank you!
You can do:
printf 'pattern' | tee >(sed 's/$/ : /') | grep -cf - input.txt
Taking help of tee and process substitution.
Example:
% cat file.txt
foobar
spamegg
foo
% printf 'foo' | tee >(sed 's/$/ : /') | grep -cf - file.txt
foo : 2
cat fileName | grep pattern | uniq -c
I just found a really simple and elegant solution using EXCEL.
The formula goes like below...
=(LEN(B2)-LEN(SUBSTITUTE(B2,"CG","")))/2
What this formula basically does is it counts total length of strings in a cell and length after removal of the pattern ("CG" in this case), then subtract them. Since each "CG" is replaced by blanks, 2 strings are missing after substitution, and you can get the number of the pattern by dividing it with length of your pattern which is 2 in this case.
For example, following sequence contains 50 strings and 13 CG's.
CAGTGCACACAACACATGTACGCGCGCGCGCGCGCGCGCGCGCGCGTGTG 50
After substituting "CG" to blanks, you get 24 strings.
CAGTGCACACAACACATGTATGTG 24
To count the "CG" occurances,
(50-24)/2 = 13
If you are looking for "CAG", enter "CAG" instead of "CG" and divide by 3.
How simple is that!
You can see the original post in the following link.
http://fiveminutelessons.com/learn-microsoft-excel/count-occurrences-single-character-cell-excel#sthash.H4VfOkGB.dpbs
English is not my primary language, so please understand errors in my writing.
People are geniuses!

Efficiently extracting columns/regex on a large file

I need to pull some information out of a file I have. I had been doing it in R previously, but the file is very, very big, and it's taking quite a while, so I feel like using command line tools is a much better alternative.
The file basically consists of 100 tab-delimited columns, I'm only interested in the 1st, 2nd and 4th columns however.
An example of the first four columns in the file:
10 rs149353603:74656:C:G 0 74656 ...
10 rs140638708:75794:G:T 0 75794 ...
10 rs201043140:76210:A:G 0 76210 ...
10 rs202007578:76294:T:C 0 76294 ...
10 rs75914453 0 77582 ...
I would like it to be in the format 2nd column, 1st column, 4th column. Furthermore, I'd like to trim off everything but the first rs# in the cases where the second row has a colon in it (everything before the first colon).
E.g. the first line would be
rs149353603 10 74656
I fully intend to learn awk when I have the time, but that time is not now unfortunately! Could anyone lend a hand here?
You can use awk command like this:
awk 'BEGIN{FS=OFS="\t"} {sub(/:.*$/, "", $2); print $2, $1, $4}' file
rs149353603 10 74656
rs140638708 10 75794
rs201043140 10 76210
rs202007578 10 76294
rs75914453 10 77582
Since your file is very large, you might find that using "cut" first is faster, along the lines of the following pipeline:
cut -f 1,2,4 | awk ....

simply pass a variable into a regex OR string search in awk

This is driving me nuts. Here's what I want to do, and I've made it simple as possible:
This is written into an awk script:
#!/bin/bash/awk
# pass /^CHEM/, /^BIO/, /^ENG/ into someVariable and search file.txt
/someVariable/ {print NR, $0}
OR I would be fine with (but like less)
#!/bin/bash/awk
# pass "CHEM", "BIO", "ENG" into someVariable and search file.txt
$1=="someVariable" {print NR, $0}
I find all kinds of stuff on BASH/SHELL variables being passed but I don't want to learn BASH programming to simply pass a value to a variable.
Bonus: I actually have to search 125 values in each document, with 40 documents needing to be evaluated. It can't hurt to ask a bit more, but how would I take a separate file of these 125 values, pass them individually to someVariable?
I have all sorts of ways to do this in BASH but I don't understand them and there has got to be a way to simply cycle through a set of search terms dynamically in awk (perhaps by an array since I do not believe a list exists yet)
Thank you as I am tired of beating my head into a wall.
I actually have to search 125 values in each document, with 40 documents needing to be evaluated.
Let's put the strings that we want to search for in file1:
$ cat file1
apple
banana
pear
Let's call the file that we want to search file2:
$ cat file2
ear of corn
apple blossom
peas in a pod
banana republic
pear tree
To search file2 for any of the words in file1, use:
$ awk 'FNR==NR{a[$1]=1;next;} ($1 in a){print FNR,$0;}' file1 file2
2 apple blossom
4 banana republic
5 pear tree
How it works
FNR==NR{a[$1]=1;next;}
This stores every word that we are looking for as a key in array a.
In more detail, NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: file1. For every line in file1, we set a[$1] to 1.
next tells awk to skip the rest of the commands and start over with the next line.
($1 in a){print FNR,$0;}
If we get to this command, we are on file2.
If the first field is a key in array a, then we print the line number and the line.
"...For example I wanted the text between two regexp from file2. Let's say /apple/, /pear/. How would I substitute and extract the text between those two regexp?..."
while read b e; do awk "/^$b$/,/^$e$/" <(seq 1 100); done << !
> 1 5
> 2 8
> 90 95
> !
1
2
3
4
5
2
3
4
5
6
7
8
90
91
92
93
94
95
Here between the two exclamation points is the input for ranges and as the data file I used 1..100. Notice the double quotes instead of single quotes in the awk script.
If you have entered start end values in the file ranges, and your data in file data
while read b e; do awk "/^$b$/,/^$e$/" data; done < ranges
If you want to print the various ranges to different files, you can do something like this
while read b e; do awk "/^$b$/,/^$e$/ {print > $b$e}" data; done < ranges
A slight variation that you may or may not like... I sometimes use the BEGIN section to read the contents of a file into an array...
BEGIN {
count = 1
while ("cat file1" | getline)
{
a[count] = $3
count++
}
}
The rest continues in much the same way. Anyway, maybe that works for you as well.

How can I replace certain characters in a line, based on existence of data...?

I have several files that have formatted data. Depending on the file, the format will be different.
Based on this, I use variables to define the positions, so I only have to change the variable in my scripts.
The script I'm working on now, I want to Look for the 'existence' of data in a position of the file. If data exists (non-blank), then I need to split that data and move half of that data to another section, and move the second half of the data into another position.
Below are the positions and some notional data to describe what I'm trying to do.
Field_1_Position=26
Field_2_Position=41
Field_3_Position=56
Field_Length=10
CURRENT DATA:
1 2 3 4 5 6 7
1234567890123456789012345678901234567890123456789012345678901234567890
---------|---------|---------|---------|---------|---------|---------|
201401010001AABBCCDDXXXXX1122334455XXXXX----------XXXXXAABBCCDDEEZZZZZ
201401010001AABBCCDDXXXXX1122334455XXXXXZZYYXXWWVVXXXXXAABBCCDDEEZZZZZ
201401010001AABBCCDDXXXXX1122334455XXXXX----------XXXXXAABBCCDDEEZZZZZ
201401010001AABBCCDDXXXXX1122334455XXXXX----------XXXXXAABBCCDDEEZZZZZ
201401010001AABBCCDDXXXXX1122334455XXXXXMMNNOOPPQQXXXXXAABBCCDDEEZZZZZ
My problem is with (What I'm calling" Field 2. Most of the time these (10) characters are blank (Not 'dashes' as represented here!). However, if data does exist, I need to take the first five characters and put them into the first five characters of Field 1, I then need to take the second five characters of field2 and put them into the first five characters of field 3. Fields where these 10 characters are empty need to remain as is (Though I would like to keep them as fields so I can insert escape codes to color the columns delineate the fields defined by the variables.
DESIRED:
1 2 3 4 5 6 7
1234567890123456789012345678901234567890123456789012345678901234567890
---------|---------|---------|---------|---------|---------|---------|
201401010001AABBCCDDXXXXX1122334455XXXXX----------XXXXXAABBCCDDEEZZZZZ
201401010001AABBCCDDXXXXXZZYYX34455XXXXXZZYYXXWWVVXXXXXXWWVVCDDEEZZZZZ
201401010001AABBCCDDXXXXX1122334455XXXXX----------XXXXXAABBCCDDEEZZZZZ
201401010001AABBCCDDXXXXX1122334455XXXXX----------XXXXXAABBCCDDEEZZZZZ
201401010001AABBCCDDXXXXXMMNNO34455XXXXXMMNNOOPPQQXXXXXOPPQQCDDEEZZZZZ
Appreciate any thoughts on this!
--Edited to show actual position numbers for sample data.
KSL.
cat current.txt \
| awk '{
if(substr($0,41,10)=="----------") {print $0} else {
printf substr($0,1,25)substr($0,41,5)substr($0,31,5)substr($0,36,20)substr($0,46,5)substr($0,61,99)"\n"}}'
You can pass the field positions as variables to awk with:
awk -v field1=26 -v field2=41 -v field3=56
I find your question very hard to understand, but want to show you how to split the fields easily with gawk's FIELDWIDTHS variable:
awk 'BEGIN{FIELDWIDTHS="7 13 5 10 10 10 10"} {print $1,$2,$3,$4}' file
Output:
2014010 10001AABBCCDD XXXXX 1122334455
2014010 10001AABBCCDD XXXXX 1122334455
2014010 10001AABBCCDD XXXXX 1122334455
2014010 10001AABBCCDD XXXXX 1122334455
2014010 10001AABBCCDD XXXXX 1122334455
Of course, you could pass a value for FIELDWIDTHS into awk via a variable too, if you want.

Calculating in NP++ with regular expressions

My document has x|y| in the beginning of each line, where x,y are integers between 0 and 300. E.g.:
1|1|text1
1|2|text2
1|3|text3
Now I want to make the following simple change: Each second number of every line should be subtracted by 1. So the above lines should be changed to
1|0|text1
1|1|text2
1|2|text3
Is that possible?
Okay, so here's something funny you can do, assuming the text is formatted like you indicated:
Do a first search and replace, replacing (\d+\|(\d+)\|.*) with ####\2\n\1. this will grab the index and output
1|1|text1
####1
1|2|text2
####2
1|3|text3
####3
do second search and replace to update the index of the row following the #### marker, replacing
####(\d)\n(\d+\|)\d+(\|.*)
with
\2\1\3
You need to update manually the first and last line, and you're good to go !
Since the numbers in the second column are following each other, all you need to do is increment their row.
A perl way to do the job:
perl -i.back -ape 's/\|(\d+)\|/"|".($1-1)."|"/e' in.txt
This replace all second number by this number minus one directly in the file (in.txt).
This file is save before in in.txt.back
Input file before:
1|1|text1
1|2|text2
1|3|text3
2|1|text4
3|1|text5
3|2|text6
after:
1|0|text1
1|1|text2
1|2|text3
2|0|text4
3|0|text5
3|1|text6
Regex can't do math. If the file is huge I suggest you to look into a simple python script or small C++ program to do the parsing job for you. Good luck.
missing awk ??? awwwww !!
awk -F'[|]' 'NR % 2 {num = $2-1;print $1"|"num"|"$3}' text.txt
where text.txt is the file containing your contents.
you can write the output into a file using the > operator
like this
awk -F'[|]' 'NR % 2 {num = $2-1;print $1"|"num"|"$3}' text.txt > final.txt