Keeping specific rows with grep function - regex

I have a large data sets and the variable includes different format
Subject Result
1 3
2 4
3 <4
4 <3
5 I need to go to school<>
6 I need to <> be there
7 2.3 need to be< there
8 <.3
9 .<9
10 ..<9
11 >3 need to go to school
12 <16.1
13 <5.0
I just want to keep the rows which include the "< number" or "> number" and not the rows with the text format (forexample, I want to exclude >3 need to school, I need to go to school <>). The problem is that some records are something like .<3, ..<9, >9., >:9. So how can I remove ".","..",":" from the data set and then keep the rows with "< a number" notation. How can I use "grep" function?
Again, I just want to keep the following rows
Subject Result
> 3 <4
> 4 <3
> 8 <.3
> 9 .<9
> 10 ..<9
> 12 <16.1
> 13 <5.0

You can simply apply two greps, one to find the "<>" keys, and then one to eliminate fields with characters:
grep "[><]" | grep -v "[A-Za-z]"
If you want to be pedantic, you can also apply another grep to find those with numbers
grep "[><]" | grep -v "[A-Za-z]" | grep "[0-9]"
"grep -v" means match and don't return, by the way.

Assuming you're certain that [.,:;] are the only problematic punctuation:
df$Result<-gsub("[.,;:]","", df$Result) # remove any cases of [.,;:] from your results column
df[grep("^\\s*[<>][0-9]+$", df$Result),] # find all cases of numbers preceded by < or > (with possible spaces) and succeeded by nothing else.

Related

Grep ascending order of cards. Why does it work?

The collection of cards I need to grep is defined as:
{h ∈ H | h contains only cards in ascending order regardless of their suit}
Example:
h = Ah2c2d3s5h6d8s8d9h9cTdTcKh
h != 3d4dQc3sKcAh2sAc7hKdKsKh4h62 (Q is followed by lower rank 3)
The ascending ranks of cards are:
A(ace) 2 3 4 5 6 7 8 9 T(ten) J Q K
The suits are defined as such:
c(clover) s(spade) h(heart) d(diamond)
I have tried the following grep and it is correct but I still don't
understand why it works.
Edit*** added -P flag (forgot about it) as pointed out by tripleee that just grep -v is indeed invalid.
grep -Pv "[KQJT].*[2-9A].* |[KQ].*[JT].* |[6-9].*[2-5A].* "
What baffles me is how K followed by Q got matched with this pattern or even 5 followed by [A2-4]
The solution has a total of 31027 lines
The text file provided for the exercise can be found here:
http://computergebruik.ugent.be/oefeningenreeks1/kaarten1.txt
Your regex is not at all valid, so I don't understand why you say it works.
Plain grep does not understand | to mean alteration. You can add an -E option to specify ERE (traditionally, egrep) regex semantics, or with POSIX grep backslash the |; or you can specify multiple -e options. (See e.g. https://en.wikipedia.org/wiki/Regular_expression#Standards for some background about the various regex dialects in common use.)
grep -Ev "[KQJT].*[2-9A].* |[KQ].*[JT].* |[6-9].*[2-5A].* "
grep -v "[KQJT].*[2-9A].* \|[KQ].*[JT].* \|[6-9].*[2-5A].* "
grep -ve "[KQJT].*[2-9A].* " -e "[KQ].*[JT].* " -e "[6-9].*[2-5A].* "
Even with this fix, the regex is obviously insufficient for removing matches where e.g. 3 is followed by 2. The only way to make it cover all cases is to enumerate every possibility. (Disallow 1 followed by any higher number, 2 followed by any higher number, 3 followed by any higher number, etc.) An altogether better approach would be to use a scripting language of some sort, and basically just map the symbols to ones with the desired sort order, then check if the input is sorted.
If that is not an option, maybe try
grep -E '^(A.)*(2.)*(3.)*(4.)*(5.)*(6.)*(7.)*(8.)*(9.)*(T.)*(J.)*(Q.)*(K.)* '
which looks for zero or more aces, followed by zero or more twos, followed by zero or more threes, etc.

AWK: Pattern match multiline data with variable line number

I am trying to write a script which will analyze data from a pipe. The problem is, a single element is described in a variable number of lines. Look at the example data set:
3 14 -30.48 17.23
4 1 -18.01 12.69
4 3 -11.01 2.69
8 12 -21.14 -8.76
8 14 -18.01 -5.69
8 12 -35.14 -1.76
9 2 -1.01 22.69
10 1 -88.88 17.28
10 1 -.88 14.28
10 1 5.88 1.28
10 1 -8.88 -7.28
In this case, the first entry is what defines the event to which the following data belongs. In the case of event number 8, we have data in 3 lines. To simplify the rather complex problem that I am trying to solve, let us imagine, that I want to calculate the following expression:
sum_i($2 * ($3 + $4))
Where i is taken over all lines belonging to a given element. The output I want to produce would then look like:
3=-185.5 [14(-30.48+17.23) ]
4=-30.28 [1(-18.01+12.69) + 3(-11.01+2.69)]
8=-1106.4 [...]
I thus need a script which reads all the lines that have the same index entry.
I am an AWK newbie and I've started learning the language a couple of days ago. I am now uncertain whether I will be able to achieve what I want. Therefore:
Is this doable with AWK?
If not, whith what? SED?
If yes, how? I would be grateful if one provided a link describing how this can be implemented.
Finally, I know that there is a similar question: Can awk patterns match multiple lines?, however, I do not have a constant pattern which separates my data.
Thanks!
You could try this:
awk '{ar[$1]+=$2*($3+$4)}
END{for (key in ar)
{print key"="ar[key]}}' inputFile
For each line input we do the desired calculation and sum the result in an array. $1 serves as the key of the array.
When the entire file is read, we print the results in the END{...}-block.
The output for the given sample input is:
4=-30.28
8=-1133.4
9=43.36
10=-67.2
3=-185.5
If sorting of the output is required, you might want to have a look at gawk's asorti function or Linux' sort-command (e.g. awk '{...} inputFile' | sort -n).
This solution does not require that the input is sorted.
awk 'id!=$1{if(id){print id"="sum;sum=0};id=$1}{sum+=$2*($3+$4)} END{print id"="sum}' file
3=-185.5
4=-30.28
8=-1133.4
9=43.36
10=-67.2
yet another similar awk
$ awk -v OFS="=" 'NR==1{p=$1}
p!=$1{print p,s; s=0; p=$1}
{s+=$2*($3+$4)}
END{print p,s}' file
3=-185.5
4=-30.28
8=-1133.4
9=43.36
10=-67.2
ps. Your calculation for "8" seems off.

Bash - count a pattern and print the line containing the pattern

everyone! While I was reading this discussion, "Count number of occurrences of a pattern in a file (even on same line)", I wondered if I could add the line containing the pattern next to the count values.
Somehow I wasn't able to add any comment on the discussion, so I'm posting a new question. Can somebody en-light me?
There must be some misunderstanding here, so I put an example.
Let's say, I have a DNA sequence like below and want to find out how many 'CG' are present in each line.
ACAAAGAACTCAAGAAGTTGGACCCCAGAGAACCAAATAACCCTATTAAA
AATTCGGAACAGAGATAAACAAAGAATTCTCAACTGAGGAAACTTGAATG
GGATTTTTTTTTAAGATTCACTTATTTTTATTTTCTGCATGAGTGTTTGC
CTCGATGTATGTACATATACGACATGTGTACGTGGTGCGCAAGTAAGCAG
Additionally, I want to print each line (not the pattern) along with the pattern counts.
0 ACAAAGAACTCAAGAAGTTGGACCCCAGAGAACCAAATAACCCTATTAAA
1 AATTCGGAACAGAGATAAACAAAGAATTCTCAACTGAGGAAACTTGAATG
0 GGATTTTTTTTTAAGATTCACTTATTTTTATTTTCTGCATGAGTGTTTGC
4 CTCGATGTATGTACATATACGACATGTGTACGTGGTGCGCAAGTAAGCAG
I wish the example above will help to understand the question better.
Thank you!
You can do:
printf 'pattern' | tee >(sed 's/$/ : /') | grep -cf - input.txt
Taking help of tee and process substitution.
Example:
% cat file.txt
foobar
spamegg
foo
% printf 'foo' | tee >(sed 's/$/ : /') | grep -cf - file.txt
foo : 2
cat fileName | grep pattern | uniq -c
I just found a really simple and elegant solution using EXCEL.
The formula goes like below...
=(LEN(B2)-LEN(SUBSTITUTE(B2,"CG","")))/2
What this formula basically does is it counts total length of strings in a cell and length after removal of the pattern ("CG" in this case), then subtract them. Since each "CG" is replaced by blanks, 2 strings are missing after substitution, and you can get the number of the pattern by dividing it with length of your pattern which is 2 in this case.
For example, following sequence contains 50 strings and 13 CG's.
CAGTGCACACAACACATGTACGCGCGCGCGCGCGCGCGCGCGCGCGTGTG 50
After substituting "CG" to blanks, you get 24 strings.
CAGTGCACACAACACATGTATGTG 24
To count the "CG" occurances,
(50-24)/2 = 13
If you are looking for "CAG", enter "CAG" instead of "CG" and divide by 3.
How simple is that!
You can see the original post in the following link.
http://fiveminutelessons.com/learn-microsoft-excel/count-occurrences-single-character-cell-excel#sthash.H4VfOkGB.dpbs
English is not my primary language, so please understand errors in my writing.
People are geniuses!

simply pass a variable into a regex OR string search in awk

This is driving me nuts. Here's what I want to do, and I've made it simple as possible:
This is written into an awk script:
#!/bin/bash/awk
# pass /^CHEM/, /^BIO/, /^ENG/ into someVariable and search file.txt
/someVariable/ {print NR, $0}
OR I would be fine with (but like less)
#!/bin/bash/awk
# pass "CHEM", "BIO", "ENG" into someVariable and search file.txt
$1=="someVariable" {print NR, $0}
I find all kinds of stuff on BASH/SHELL variables being passed but I don't want to learn BASH programming to simply pass a value to a variable.
Bonus: I actually have to search 125 values in each document, with 40 documents needing to be evaluated. It can't hurt to ask a bit more, but how would I take a separate file of these 125 values, pass them individually to someVariable?
I have all sorts of ways to do this in BASH but I don't understand them and there has got to be a way to simply cycle through a set of search terms dynamically in awk (perhaps by an array since I do not believe a list exists yet)
Thank you as I am tired of beating my head into a wall.
I actually have to search 125 values in each document, with 40 documents needing to be evaluated.
Let's put the strings that we want to search for in file1:
$ cat file1
apple
banana
pear
Let's call the file that we want to search file2:
$ cat file2
ear of corn
apple blossom
peas in a pod
banana republic
pear tree
To search file2 for any of the words in file1, use:
$ awk 'FNR==NR{a[$1]=1;next;} ($1 in a){print FNR,$0;}' file1 file2
2 apple blossom
4 banana republic
5 pear tree
How it works
FNR==NR{a[$1]=1;next;}
This stores every word that we are looking for as a key in array a.
In more detail, NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: file1. For every line in file1, we set a[$1] to 1.
next tells awk to skip the rest of the commands and start over with the next line.
($1 in a){print FNR,$0;}
If we get to this command, we are on file2.
If the first field is a key in array a, then we print the line number and the line.
"...For example I wanted the text between two regexp from file2. Let's say /apple/, /pear/. How would I substitute and extract the text between those two regexp?..."
while read b e; do awk "/^$b$/,/^$e$/" <(seq 1 100); done << !
> 1 5
> 2 8
> 90 95
> !
1
2
3
4
5
2
3
4
5
6
7
8
90
91
92
93
94
95
Here between the two exclamation points is the input for ranges and as the data file I used 1..100. Notice the double quotes instead of single quotes in the awk script.
If you have entered start end values in the file ranges, and your data in file data
while read b e; do awk "/^$b$/,/^$e$/" data; done < ranges
If you want to print the various ranges to different files, you can do something like this
while read b e; do awk "/^$b$/,/^$e$/ {print > $b$e}" data; done < ranges
A slight variation that you may or may not like... I sometimes use the BEGIN section to read the contents of a file into an array...
BEGIN {
count = 1
while ("cat file1" | getline)
{
a[count] = $3
count++
}
}
The rest continues in much the same way. Anyway, maybe that works for you as well.

Substring in UNIX

Suppose I have a string "123456789".
I want to extract the 3rd, 6th, and 8th element. I guess I can use
cut -3, -6, -8
But if this gives
368
Suppose I want to separate them by a white space to get
3 6 8
What should I do?
Actually shell parameter expansion lets you do substring slicing directly, so you could just do:
x='123456789'
echo "${x:3:1}" "${x:6:1}" "${x:8:1}"
Update
To do this over an entire file, read the line in a loop:
while read x; do
echo "${x:3:1}" "${x:6:1}" "${x:8:1}"
done < file
(By the way, bash slicing is zero-indexed, so if you want the numbers '3', '6' and '8' you'd really want ${x:2:1} ${x:5:1} and {$x:7:1}.)
You can use the sed tool and issue this command in your teminal:
sed -r "s/^..(.)..(.).(.).*$/\1 \2 \3/"
Explained RegEx: http://regex101.com/r/fH7zW6
To "generalize" this on a file you can pipe it after a cat like so:
cat file.txt|sed -r "s/^..(.)..(.).(.).*$/\1 \2 \3/"
Perl one-liner.
perl -lne '#A = split //; print "$A[2] $A[5] $A[7]"' file
Using cut:
$ cat input
1234567890
2345678901
3456789012
4567890123
5678901234
$ cut -b3,6,8 --output-delimiter=" " input
3 6 8
4 7 9
5 8 0
6 9 1
7 0 2
The -b option selects only the specified bytes. The output delimiter can be specified using --output-delimiter.