Efficiently extracting columns/regex on a large file

Efficiently extracting columns/regex on a large file - regex

I need to pull some information out of a file I have. I had been doing it in R previously, but the file is very, very big, and it's taking quite a while, so I feel like using command line tools is a much better alternative.
The file basically consists of 100 tab-delimited columns, I'm only interested in the 1st, 2nd and 4th columns however.
An example of the first four columns in the file:
10 rs149353603:74656:C:G 0 74656 ...
10 rs140638708:75794:G:T 0 75794 ...
10 rs201043140:76210:A:G 0 76210 ...
10 rs202007578:76294:T:C 0 76294 ...
10 rs75914453 0 77582 ...
I would like it to be in the format 2nd column, 1st column, 4th column. Furthermore, I'd like to trim off everything but the first rs# in the cases where the second row has a colon in it (everything before the first colon).
E.g. the first line would be
rs149353603 10 74656
I fully intend to learn awk when I have the time, but that time is not now unfortunately! Could anyone lend a hand here?

You can use awk command like this:
awk 'BEGIN{FS=OFS="\t"} {sub(/:.*$/, "", $2); print $2, $1, $4}' file
rs149353603 10 74656
rs140638708 10 75794
rs201043140 10 76210
rs202007578 10 76294
rs75914453 10 77582

Since your file is very large, you might find that using "cut" first is faster, along the lines of the following pipeline:
cut -f 1,2,4 | awk ....

Related

Awk: make an average in file with both number and string

here is my problem (not sure that my title was clear), I have to display an average of number in a file. However there is string in the file too.
file: test
Richie;jack;27 Yo;07Richiej#gmail.com
Cash;tom;29 Yo;Ctom01#gmail.com
Megane;susan;37 Yo;meganeSusan#gmail.com
...
it has to display the average age of the people in my file, I'm not supposed to know how many people there are.
I thought about using RegEx to only get number in my 3rd field, but got errors each time.
awk 'BEGIN{FS=";"} / /

To compute the average of the number in the third column:
$ awk -F\; '{s+=$3} END{print s/NR}' test
31
How it works
-F\;
This tells awk to use ; as the field separator. Because ; is a shell-active character, we have to either escape it (as shown above) or quote it.
s+=$3
For each line read, this adds the number in the third column to s. Because += is an arithmetic operation, awk converts the third field to a number.
This code also illustrates awk's automatic conversion of fields to numbers:
$ awk -F\; '{printf "field=\"%s\" number=%s\n", $3, $3+0}' test
field="27 Yo" number=27
field="29 Yo" number=29
field="37 Yo" number=37
When we print $3, we get the full string including the Yo. When we print $3+0, the conversion to a number is forced and, as shown above, we just get the number.
END{print s/NR}
After we have reached the end of the file, this prints the total of the third columns, save in s, divided by the number of lines read, NR.

AWK: Pattern match multiline data with variable line number

I am trying to write a script which will analyze data from a pipe. The problem is, a single element is described in a variable number of lines. Look at the example data set:
3 14 -30.48 17.23
4 1 -18.01 12.69
4 3 -11.01 2.69
8 12 -21.14 -8.76
8 14 -18.01 -5.69
8 12 -35.14 -1.76
9 2 -1.01 22.69
10 1 -88.88 17.28
10 1 -.88 14.28
10 1 5.88 1.28
10 1 -8.88 -7.28
In this case, the first entry is what defines the event to which the following data belongs. In the case of event number 8, we have data in 3 lines. To simplify the rather complex problem that I am trying to solve, let us imagine, that I want to calculate the following expression:
sum_i($2 * ($3 + $4))
Where i is taken over all lines belonging to a given element. The output I want to produce would then look like:
3=-185.5 [14(-30.48+17.23) ]
4=-30.28 [1(-18.01+12.69) + 3(-11.01+2.69)]
8=-1106.4 [...]
I thus need a script which reads all the lines that have the same index entry.
I am an AWK newbie and I've started learning the language a couple of days ago. I am now uncertain whether I will be able to achieve what I want. Therefore:
Is this doable with AWK?
If not, whith what? SED?
If yes, how? I would be grateful if one provided a link describing how this can be implemented.
Finally, I know that there is a similar question: Can awk patterns match multiple lines?, however, I do not have a constant pattern which separates my data.
Thanks!

You could try this:
awk '{ar[$1]+=$2*($3+$4)}
END{for (key in ar)
{print key"="ar[key]}}' inputFile
For each line input we do the desired calculation and sum the result in an array. $1 serves as the key of the array.
When the entire file is read, we print the results in the END{...}-block.
The output for the given sample input is:
4=-30.28
8=-1133.4
9=43.36
10=-67.2
3=-185.5
If sorting of the output is required, you might want to have a look at gawk's asorti function or Linux' sort-command (e.g. awk '{...} inputFile' | sort -n).
This solution does not require that the input is sorted.

awk 'id!=$1{if(id){print id"="sum;sum=0};id=$1}{sum+=$2*($3+$4)} END{print id"="sum}' file
3=-185.5
4=-30.28
8=-1133.4
9=43.36
10=-67.2

yet another similar awk
$ awk -v OFS="=" 'NR==1{p=$1}
p!=$1{print p,s; s=0; p=$1}
{s+=$2*($3+$4)}
END{print p,s}' file
3=-185.5
4=-30.28
8=-1133.4
9=43.36
10=-67.2
ps. Your calculation for "8" seems off.

Bash - count a pattern and print the line containing the pattern

everyone! While I was reading this discussion, "Count number of occurrences of a pattern in a file (even on same line)", I wondered if I could add the line containing the pattern next to the count values.
Somehow I wasn't able to add any comment on the discussion, so I'm posting a new question. Can somebody en-light me?
There must be some misunderstanding here, so I put an example.
Let's say, I have a DNA sequence like below and want to find out how many 'CG' are present in each line.
ACAAAGAACTCAAGAAGTTGGACCCCAGAGAACCAAATAACCCTATTAAA
AATTCGGAACAGAGATAAACAAAGAATTCTCAACTGAGGAAACTTGAATG
GGATTTTTTTTTAAGATTCACTTATTTTTATTTTCTGCATGAGTGTTTGC
CTCGATGTATGTACATATACGACATGTGTACGTGGTGCGCAAGTAAGCAG
Additionally, I want to print each line (not the pattern) along with the pattern counts.
0 ACAAAGAACTCAAGAAGTTGGACCCCAGAGAACCAAATAACCCTATTAAA
1 AATTCGGAACAGAGATAAACAAAGAATTCTCAACTGAGGAAACTTGAATG
0 GGATTTTTTTTTAAGATTCACTTATTTTTATTTTCTGCATGAGTGTTTGC
4 CTCGATGTATGTACATATACGACATGTGTACGTGGTGCGCAAGTAAGCAG
I wish the example above will help to understand the question better.
Thank you!

You can do:
printf 'pattern' | tee >(sed 's/$/ : /') | grep -cf - input.txt
Taking help of tee and process substitution.
Example:
% cat file.txt
foobar
spamegg
foo
% printf 'foo' | tee >(sed 's/$/ : /') | grep -cf - file.txt
foo : 2

cat fileName | grep pattern | uniq -c

I just found a really simple and elegant solution using EXCEL.
The formula goes like below...
=(LEN(B2)-LEN(SUBSTITUTE(B2,"CG","")))/2
What this formula basically does is it counts total length of strings in a cell and length after removal of the pattern ("CG" in this case), then subtract them. Since each "CG" is replaced by blanks, 2 strings are missing after substitution, and you can get the number of the pattern by dividing it with length of your pattern which is 2 in this case.
For example, following sequence contains 50 strings and 13 CG's.
CAGTGCACACAACACATGTACGCGCGCGCGCGCGCGCGCGCGCGCGTGTG 50
After substituting "CG" to blanks, you get 24 strings.
CAGTGCACACAACACATGTATGTG 24
To count the "CG" occurances,
(50-24)/2 = 13
If you are looking for "CAG", enter "CAG" instead of "CG" and divide by 3.
How simple is that!
You can see the original post in the following link.
http://fiveminutelessons.com/learn-microsoft-excel/count-occurrences-single-character-cell-excel#sthash.H4VfOkGB.dpbs
English is not my primary language, so please understand errors in my writing.
People are geniuses!

simply pass a variable into a regex OR string search in awk

This is driving me nuts. Here's what I want to do, and I've made it simple as possible:
This is written into an awk script:
#!/bin/bash/awk
# pass /^CHEM/, /^BIO/, /^ENG/ into someVariable and search file.txt
/someVariable/ {print NR, $0}
OR I would be fine with (but like less)
#!/bin/bash/awk
# pass "CHEM", "BIO", "ENG" into someVariable and search file.txt
$1=="someVariable" {print NR, $0}
I find all kinds of stuff on BASH/SHELL variables being passed but I don't want to learn BASH programming to simply pass a value to a variable.
Bonus: I actually have to search 125 values in each document, with 40 documents needing to be evaluated. It can't hurt to ask a bit more, but how would I take a separate file of these 125 values, pass them individually to someVariable?
I have all sorts of ways to do this in BASH but I don't understand them and there has got to be a way to simply cycle through a set of search terms dynamically in awk (perhaps by an array since I do not believe a list exists yet)
Thank you as I am tired of beating my head into a wall.

I actually have to search 125 values in each document, with 40 documents needing to be evaluated.
Let's put the strings that we want to search for in file1:
$ cat file1
apple
banana
pear
Let's call the file that we want to search file2:
$ cat file2
ear of corn
apple blossom
peas in a pod
banana republic
pear tree
To search file2 for any of the words in file1, use:
$ awk 'FNR==NR{a[$1]=1;next;} ($1 in a){print FNR,$0;}' file1 file2
2 apple blossom
4 banana republic
5 pear tree
How it works
FNR==NR{a[$1]=1;next;}
This stores every word that we are looking for as a key in array a.
In more detail, NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: file1. For every line in file1, we set a[$1] to 1.
next tells awk to skip the rest of the commands and start over with the next line.
($1 in a){print FNR,$0;}
If we get to this command, we are on file2.
If the first field is a key in array a, then we print the line number and the line.

"...For example I wanted the text between two regexp from file2. Let's say /apple/, /pear/. How would I substitute and extract the text between those two regexp?..."
while read b e; do awk "/^$b$/,/^$e$/" <(seq 1 100); done << !
> 1 5
> 2 8
> 90 95
> !
1
2
3
4
5
2
3
4
5
6
7
8
90
91
92
93
94
95
Here between the two exclamation points is the input for ranges and as the data file I used 1..100. Notice the double quotes instead of single quotes in the awk script.
If you have entered start end values in the file ranges, and your data in file data
while read b e; do awk "/^$b$/,/^$e$/" data; done < ranges
If you want to print the various ranges to different files, you can do something like this
while read b e; do awk "/^$b$/,/^$e$/ {print > $b$e}" data; done < ranges

A slight variation that you may or may not like... I sometimes use the BEGIN section to read the contents of a file into an array...
BEGIN {
count = 1
while ("cat file1" | getline)
{
a[count] = $3
count++
}
}
The rest continues in much the same way. Anyway, maybe that works for you as well.

Calculating in NP++ with regular expressions

My document has x|y| in the beginning of each line, where x,y are integers between 0 and 300. E.g.:
1|1|text1
1|2|text2
1|3|text3
Now I want to make the following simple change: Each second number of every line should be subtracted by 1. So the above lines should be changed to
1|0|text1
1|1|text2
1|2|text3
Is that possible?

Okay, so here's something funny you can do, assuming the text is formatted like you indicated:
Do a first search and replace, replacing (\d+\|(\d+)\|.*) with ####\2\n\1. this will grab the index and output
1|1|text1
####1
1|2|text2
####2
1|3|text3
####3
do second search and replace to update the index of the row following the #### marker, replacing
####(\d)\n(\d+\|)\d+(\|.*)
with
\2\1\3
You need to update manually the first and last line, and you're good to go !
Since the numbers in the second column are following each other, all you need to do is increment their row.

A perl way to do the job:
perl -i.back -ape 's/\|(\d+)\|/"|".($1-1)."|"/e' in.txt
This replace all second number by this number minus one directly in the file (in.txt).
This file is save before in in.txt.back
Input file before:
1|1|text1
1|2|text2
1|3|text3
2|1|text4
3|1|text5
3|2|text6
after:
1|0|text1
1|1|text2
1|2|text3
2|0|text4
3|0|text5
3|1|text6

Regex can't do math. If the file is huge I suggest you to look into a simple python script or small C++ program to do the parsing job for you. Good luck.

missing awk ??? awwwww !!
awk -F'[|]' 'NR % 2 {num = $2-1;print $1"|"num"|"$3}' text.txt
where text.txt is the file containing your contents.
you can write the output into a file using the > operator
like this
awk -F'[|]' 'NR % 2 {num = $2-1;print $1"|"num"|"$3}' text.txt > final.txt

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Efficiently extracting columns/regex on a large file - regex

You can use awk command like this: awk 'BEGIN{FS=OFS="\t"} {sub(/:.*$/, "", $2); print $2, $1, $4}' file rs149353603 10 74656 rs140638708 10 75794 rs201043140 10 76210 rs202007578 10 76294 rs75914453 10 77582

Since your file is very large, you might find that using "cut" first is faster, along the lines of the following pipeline: cut -f 1,2,4 | awk ....

Related

Awk: make an average in file with both number and string

AWK: Pattern match multiline data with variable line number

Bash - count a pattern and print the line containing the pattern

simply pass a variable into a regex OR string search in awk

Calculating in NP++ with regular expressions

Categories

Resources