how does this AWK associative array with two files work? - regex

I am writing to ask for an explanation for some of the elements of this short AWK command, which I am using to print fields from test-file_long.txt which match fields in input test-file_short.txt. The code works fine-- I would just like to know exactly what the program is doing since I am very new to programming and I would like to be able to think on my toes for future commands that I will need to write. Here is the example:
$ cat test-file_long.txt
2 41647 41647 A G
2 45895 45895 A G
2 45953 45953 T C
2 224919 224919 A G
2 230055 230055 C G
2 233239 233239 A G
2 234130 234130 T G
$ cat test-file_short.txt
2 41647 41647 A G
2 45895 45895 A G
2 FALSE 224919 A G
2 233239 233239 A G
2 234130 234130 T G
$ awk 'NR==FNR{a[$2];next}$2 in a{print $0,FNR}' test-file_short.txt test-file_long.txt
2 41647 41647 A G 1
2 45895 45895 A G 2
2 233239 233239 A G 6
2 234130 234130 T G 7
It is a very simple matching problem for which I found the commands on this site a few weeks ago. My questions are 1) what exactly does NR==FNR do? I know that it stands for number of records = number of records of the current input file, respectively, but why is this necessary for the code to operate? When I remove this from the command, the result is the same as paste test-file_long.txt test-file_short.txt. 2) for $2 in a, does AWK automatically read field 2 from file 2 as part of the syntax here? 3) I just want to confirm that ;next just means to skip all other blocks and go to next line? So in other words the code first performs a[$2] for every line and then goes back and performs the other blocks for each line? When I remove ;next I still get the filtered output but only trailing a full printout of test-file_short.txt.
Thanks for any and all input, my goal is just to understand better how AWK works, since it has been extraordinarily useful for my current work (processing large genomics datasets).

Here are some information related to your code:
NR==FNR will only be valid for the first file. Since, for file number 2, FNR will start from 1 again, whereas NR continues to increase.
$2 in a will only be executed for file number 2, this is due to the next statement inside the first rule. Due to this next statement, the second rule will never be reached for file number 1.

Related

Keeping specific rows with grep function

I have a large data sets and the variable includes different format
Subject Result
1 3
2 4
3 <4
4 <3
5 I need to go to school<>
6 I need to <> be there
7 2.3 need to be< there
8 <.3
9 .<9
10 ..<9
11 >3 need to go to school
12 <16.1
13 <5.0
I just want to keep the rows which include the "< number" or "> number" and not the rows with the text format (forexample, I want to exclude >3 need to school, I need to go to school <>). The problem is that some records are something like .<3, ..<9, >9., >:9. So how can I remove ".","..",":" from the data set and then keep the rows with "< a number" notation. How can I use "grep" function?
Again, I just want to keep the following rows
Subject Result
> 3 <4
> 4 <3
> 8 <.3
> 9 .<9
> 10 ..<9
> 12 <16.1
> 13 <5.0
You can simply apply two greps, one to find the "<>" keys, and then one to eliminate fields with characters:
grep "[><]" | grep -v "[A-Za-z]"
If you want to be pedantic, you can also apply another grep to find those with numbers
grep "[><]" | grep -v "[A-Za-z]" | grep "[0-9]"
"grep -v" means match and don't return, by the way.
Assuming you're certain that [.,:;] are the only problematic punctuation:
df$Result<-gsub("[.,;:]","", df$Result) # remove any cases of [.,;:] from your results column
df[grep("^\\s*[<>][0-9]+$", df$Result),] # find all cases of numbers preceded by < or > (with possible spaces) and succeeded by nothing else.

Regular Expression - Pattern

I am new to Shell scripting. I am trying to write a code that should grep few lines from a huge file based on certain condition.
Contents of file, say names.txt
1 ae1aee2sonata om,vadodara,23-Aug-2016
2 chdc501ae om,patna,26-Aug-2016
3 chdc4326aee6 om,bhuvi,01-Oct-2016
4 ae3aee6prsons hqr,bangalore,29-Aug-2016
5 praaeei5 om,lucknow,11-Nov-2016
6 aetaeen6pana om,phanto,13-Oct-2016
and goes on for 500 or more entries.
Now, I am looking for output for the following :
Filter lines with only "aee" available in it. So, the output will look
like:
3 chdc4326aee6.om,bhuvi,01-Oct-2016
5 praaeei5 om,lucknow,11-Nov-2016
Filter lines with only "ae" and "ae + "aee" available in the file. So,
the output will look like:
1 ae1aee2sonata.hqr,vadodara,23-Aug-2016
2 chdc501ae.om,patna,26-Aug-2016
4 ae3aee6prsons hqr,bangalore,29-Aug-2016
6 aetaeen6pana om,phanto,13-Oct-2016
Filter lines with only "ae" from the file. So, the output will look like:
2 chdc501ae.om,patna,26-Aug-2016
Any suggestions please. You can point to a good place for getting more information about this, so I can learn.
Use grep with option -P and lookahead
The file:
$ cat data.txt
1 ae1aee2sonata om,vadodara,23-Aug-2016
2 chdc501ae om,patna,26-Aug-2016
3 chdc4326aee6 om,bhuvi,01-Oct-2016
4 ae3aee6prsons hqr,bangalore,29-Aug-2016
5 praaeei5 om,lucknow,11-Nov-2016
6 aetaeen6pana om,phanto,13-Oct-2016
Find aee but not ae :
$ grep -P '^(?:(?=.*aee[^e]))?(?!.*ae[^e]).*(aee)[^e]' data.txt
3 chdc4326aee6 om,bhuvi,01-Oct-2016
5 praaeei5 om,lucknow,11-Nov-2016
Find ae or ae + aee :
$ grep -P '^(?:(?!.*aee[^e]))?(?=.*ae[^e]).*(aee?)[^e]' data.txt
1 ae1aee2sonata om,vadodara,23-Aug-2016
2 chdc501ae om,patna,26-Aug-2016
4 ae3aee6prsons hqr,bangalore,29-Aug-2016
6 aetaeen6pana om,phanto,13-Oct-2016
Find ae only :
$ grep -P '^(?!.*aee[^e])(?=.*ae[^e]).*(ae)[^e]' data.txt
2 chdc501ae om,patna,26-Aug-2016

AWK: Pattern match multiline data with variable line number

I am trying to write a script which will analyze data from a pipe. The problem is, a single element is described in a variable number of lines. Look at the example data set:
3 14 -30.48 17.23
4 1 -18.01 12.69
4 3 -11.01 2.69
8 12 -21.14 -8.76
8 14 -18.01 -5.69
8 12 -35.14 -1.76
9 2 -1.01 22.69
10 1 -88.88 17.28
10 1 -.88 14.28
10 1 5.88 1.28
10 1 -8.88 -7.28
In this case, the first entry is what defines the event to which the following data belongs. In the case of event number 8, we have data in 3 lines. To simplify the rather complex problem that I am trying to solve, let us imagine, that I want to calculate the following expression:
sum_i($2 * ($3 + $4))
Where i is taken over all lines belonging to a given element. The output I want to produce would then look like:
3=-185.5 [14(-30.48+17.23) ]
4=-30.28 [1(-18.01+12.69) + 3(-11.01+2.69)]
8=-1106.4 [...]
I thus need a script which reads all the lines that have the same index entry.
I am an AWK newbie and I've started learning the language a couple of days ago. I am now uncertain whether I will be able to achieve what I want. Therefore:
Is this doable with AWK?
If not, whith what? SED?
If yes, how? I would be grateful if one provided a link describing how this can be implemented.
Finally, I know that there is a similar question: Can awk patterns match multiple lines?, however, I do not have a constant pattern which separates my data.
Thanks!
You could try this:
awk '{ar[$1]+=$2*($3+$4)}
END{for (key in ar)
{print key"="ar[key]}}' inputFile
For each line input we do the desired calculation and sum the result in an array. $1 serves as the key of the array.
When the entire file is read, we print the results in the END{...}-block.
The output for the given sample input is:
4=-30.28
8=-1133.4
9=43.36
10=-67.2
3=-185.5
If sorting of the output is required, you might want to have a look at gawk's asorti function or Linux' sort-command (e.g. awk '{...} inputFile' | sort -n).
This solution does not require that the input is sorted.
awk 'id!=$1{if(id){print id"="sum;sum=0};id=$1}{sum+=$2*($3+$4)} END{print id"="sum}' file
3=-185.5
4=-30.28
8=-1133.4
9=43.36
10=-67.2
yet another similar awk
$ awk -v OFS="=" 'NR==1{p=$1}
p!=$1{print p,s; s=0; p=$1}
{s+=$2*($3+$4)}
END{print p,s}' file
3=-185.5
4=-30.28
8=-1133.4
9=43.36
10=-67.2
ps. Your calculation for "8" seems off.

simply pass a variable into a regex OR string search in awk

This is driving me nuts. Here's what I want to do, and I've made it simple as possible:
This is written into an awk script:
#!/bin/bash/awk
# pass /^CHEM/, /^BIO/, /^ENG/ into someVariable and search file.txt
/someVariable/ {print NR, $0}
OR I would be fine with (but like less)
#!/bin/bash/awk
# pass "CHEM", "BIO", "ENG" into someVariable and search file.txt
$1=="someVariable" {print NR, $0}
I find all kinds of stuff on BASH/SHELL variables being passed but I don't want to learn BASH programming to simply pass a value to a variable.
Bonus: I actually have to search 125 values in each document, with 40 documents needing to be evaluated. It can't hurt to ask a bit more, but how would I take a separate file of these 125 values, pass them individually to someVariable?
I have all sorts of ways to do this in BASH but I don't understand them and there has got to be a way to simply cycle through a set of search terms dynamically in awk (perhaps by an array since I do not believe a list exists yet)
Thank you as I am tired of beating my head into a wall.
I actually have to search 125 values in each document, with 40 documents needing to be evaluated.
Let's put the strings that we want to search for in file1:
$ cat file1
apple
banana
pear
Let's call the file that we want to search file2:
$ cat file2
ear of corn
apple blossom
peas in a pod
banana republic
pear tree
To search file2 for any of the words in file1, use:
$ awk 'FNR==NR{a[$1]=1;next;} ($1 in a){print FNR,$0;}' file1 file2
2 apple blossom
4 banana republic
5 pear tree
How it works
FNR==NR{a[$1]=1;next;}
This stores every word that we are looking for as a key in array a.
In more detail, NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: file1. For every line in file1, we set a[$1] to 1.
next tells awk to skip the rest of the commands and start over with the next line.
($1 in a){print FNR,$0;}
If we get to this command, we are on file2.
If the first field is a key in array a, then we print the line number and the line.
"...For example I wanted the text between two regexp from file2. Let's say /apple/, /pear/. How would I substitute and extract the text between those two regexp?..."
while read b e; do awk "/^$b$/,/^$e$/" <(seq 1 100); done << !
> 1 5
> 2 8
> 90 95
> !
1
2
3
4
5
2
3
4
5
6
7
8
90
91
92
93
94
95
Here between the two exclamation points is the input for ranges and as the data file I used 1..100. Notice the double quotes instead of single quotes in the awk script.
If you have entered start end values in the file ranges, and your data in file data
while read b e; do awk "/^$b$/,/^$e$/" data; done < ranges
If you want to print the various ranges to different files, you can do something like this
while read b e; do awk "/^$b$/,/^$e$/ {print > $b$e}" data; done < ranges
A slight variation that you may or may not like... I sometimes use the BEGIN section to read the contents of a file into an array...
BEGIN {
count = 1
while ("cat file1" | getline)
{
a[count] = $3
count++
}
}
The rest continues in much the same way. Anyway, maybe that works for you as well.

sed: Dynamically remove all text columns except positions defined by pattern

By searching and trying (no regex expert), I have managed to process a text output using sed or grep, and extract some lines, formatted this way:
Tree number 280:
1 0.500 1 node_15 6 --> H 1551.code
1 node_21 S ==> H node_20
Tree number 281:
1 0.500 1 node_16 S ==> M 1551.code
1 node_20 S --> H node_19
Then, using
sed 's/^.\{35\}\(.\{9\}\).*/\1/' infile , I get the desired part, plus some output which I get rid of later (not a problem).
Tree number 280:
6 --> H
S ==> H
Tree number 281:
S ==> M
S --> H
However, the horizontal position of the C --> C pattern may vary from file to file, although it is always aligned. Is there a way to extract the --> or ==> including the single preceeding and following characters, no matter which columns they are found in?
The Tree number # part is not necessary and could be left blank as well, but there has to be a separator of a kind.
UPDATE (alternative approach)
Trying to use grep, I issued
grep -Eo '(([a-zA-Z0-9] -- |[a-zA-Z0-9] ==)> [a-zA-Z0-9]|Changes)' infile.
A sample of my initial file follows, if anyone thinks of a better, more efficient approach, or my use of regex is insane, please comment!
..MISC TEXT...
Character change lists:
Character CI Steps Changes
----------------------------------------------------------------
1 0.000 1 node_235 H --> S node
1 node_123 S ==> 6 1843
1 node_126 S ==> H 2461
1 node_132 S ==> 6 1863
1 node_213 H --> I 1816
1 node_213 H --> 8 1820
..CT...
Character change lists:
Character CI Steps Changes
----------------------------------------------------------------
1 0.000 1 node_165 H --> S node
1 node_123 S ==> 6 1843
1 node_231 H ==> S 1823
..MISC TEXT...
Grep is a bit easier for just extracting the matching regex (if you need different separators you can add them to the list separated by pipes [-|=]
grep -o '. [-|=][-|=]> .' infile
Of if you really want to sed for this, this should do the first part matches only lines that have the pattern and the second part extracts only the matching regex
sed -n '/[--|==]>/{s/.*\(. [=|-][-|=]> .\).*/\1/p}' infile