Regular Expression - Pattern - regex

I am new to Shell scripting. I am trying to write a code that should grep few lines from a huge file based on certain condition.
Contents of file, say names.txt
1 ae1aee2sonata om,vadodara,23-Aug-2016
2 chdc501ae om,patna,26-Aug-2016
3 chdc4326aee6 om,bhuvi,01-Oct-2016
4 ae3aee6prsons hqr,bangalore,29-Aug-2016
5 praaeei5 om,lucknow,11-Nov-2016
6 aetaeen6pana om,phanto,13-Oct-2016
and goes on for 500 or more entries.
Now, I am looking for output for the following :
Filter lines with only "aee" available in it. So, the output will look
like:
3 chdc4326aee6.om,bhuvi,01-Oct-2016
5 praaeei5 om,lucknow,11-Nov-2016
Filter lines with only "ae" and "ae + "aee" available in the file. So,
the output will look like:
1 ae1aee2sonata.hqr,vadodara,23-Aug-2016
2 chdc501ae.om,patna,26-Aug-2016
4 ae3aee6prsons hqr,bangalore,29-Aug-2016
6 aetaeen6pana om,phanto,13-Oct-2016
Filter lines with only "ae" from the file. So, the output will look like:
2 chdc501ae.om,patna,26-Aug-2016
Any suggestions please. You can point to a good place for getting more information about this, so I can learn.

Use grep with option -P and lookahead
The file:
$ cat data.txt
1 ae1aee2sonata om,vadodara,23-Aug-2016
2 chdc501ae om,patna,26-Aug-2016
3 chdc4326aee6 om,bhuvi,01-Oct-2016
4 ae3aee6prsons hqr,bangalore,29-Aug-2016
5 praaeei5 om,lucknow,11-Nov-2016
6 aetaeen6pana om,phanto,13-Oct-2016
Find aee but not ae :
$ grep -P '^(?:(?=.*aee[^e]))?(?!.*ae[^e]).*(aee)[^e]' data.txt
3 chdc4326aee6 om,bhuvi,01-Oct-2016
5 praaeei5 om,lucknow,11-Nov-2016
Find ae or ae + aee :
$ grep -P '^(?:(?!.*aee[^e]))?(?=.*ae[^e]).*(aee?)[^e]' data.txt
1 ae1aee2sonata om,vadodara,23-Aug-2016
2 chdc501ae om,patna,26-Aug-2016
4 ae3aee6prsons hqr,bangalore,29-Aug-2016
6 aetaeen6pana om,phanto,13-Oct-2016
Find ae only :
$ grep -P '^(?!.*aee[^e])(?=.*ae[^e]).*(ae)[^e]' data.txt
2 chdc501ae om,patna,26-Aug-2016

Related

Keeping specific rows with grep function

I have a large data sets and the variable includes different format
Subject Result
1 3
2 4
3 <4
4 <3
5 I need to go to school<>
6 I need to <> be there
7 2.3 need to be< there
8 <.3
9 .<9
10 ..<9
11 >3 need to go to school
12 <16.1
13 <5.0
I just want to keep the rows which include the "< number" or "> number" and not the rows with the text format (forexample, I want to exclude >3 need to school, I need to go to school <>). The problem is that some records are something like .<3, ..<9, >9., >:9. So how can I remove ".","..",":" from the data set and then keep the rows with "< a number" notation. How can I use "grep" function?
Again, I just want to keep the following rows
Subject Result
> 3 <4
> 4 <3
> 8 <.3
> 9 .<9
> 10 ..<9
> 12 <16.1
> 13 <5.0
You can simply apply two greps, one to find the "<>" keys, and then one to eliminate fields with characters:
grep "[><]" | grep -v "[A-Za-z]"
If you want to be pedantic, you can also apply another grep to find those with numbers
grep "[><]" | grep -v "[A-Za-z]" | grep "[0-9]"
"grep -v" means match and don't return, by the way.
Assuming you're certain that [.,:;] are the only problematic punctuation:
df$Result<-gsub("[.,;:]","", df$Result) # remove any cases of [.,;:] from your results column
df[grep("^\\s*[<>][0-9]+$", df$Result),] # find all cases of numbers preceded by < or > (with possible spaces) and succeeded by nothing else.

Removing Leading 0 and applying Regex to Sed

I have several file names, for ease I've put them in a file as follows:
01.action1.txt
04action2.txt
12.action6.txt
2.action3.txt
020.action9.txt
10action4.txt
15action7.txt
021action10.txt
11.action5.txt
18.action8.txt
As you can see the formats aren't consistent what I'm trying to do is extract the first numbers from these file names 1,4,12,2,20 etc
I have the following regex
(\.)?action\d{1,}.txt
Which is successfully matching .action[number].txt but I need to also match the leading 0 and apply it to my substitute with blank in sed so i'm only left with the leading numbers. I'm having trouble matching the leading 0 and applying the whole thing to sed.
Thanks
With GNU sed:
sed -r 's/0*([0-9]*).*/\1/' file
Output:
1
4
12
2
20
10
15
21
11
18
See: The Stack Overflow Regular Expressions FAQ
I don't know if the below awk is helpful but it works as well:
awk '{print $1 + 0}' file
1
4
12
2
20
10
15
21
11
18

Using AWK compare two files having single columns in each and get count againts each matched item

I am going to split my problem as two problems
Problem 1
I have two numerically sorted files having single column as below. File t1.txt has unique values. File t2.txt has duplicate values.
file1: t1.txt
1
2
3
4
5
file2: t2.txt
0
2
2
3
4
7
8
9
9
The output I require is as below:
item matched ---> times it matched in t2.txt
With awk I am using this:
awk 'FNR==NR {a[$1]; next} $1 in a' t2.txt t1.txt
The output I get is:
2
3
4
However I want this:
2 --> 2
3 --> 1
4 --> 1
Problem 2
I am going to run this on large files. The actual target files have below line count:
t1.txt 9702304
t2.txt 32412065
How can we enhance the performance of the script/solution as the file size increases. Please consider that both files will have exactly one column and will be numerically sorted.
Will appreciate your help here. Thanks!
If you don't need to use awk, this pipeline gets you most of the way there:
$ grep -Fxf t1.txt t2.txt | sort | uniq -c
2 2
1 3
1 4
$ join <(sort t1.txt) <(sort t2.txt) | uniq -c | awk '{ print $2 " --> " $1}'
2 --> 2
3 --> 1
4 --> 1
(Of course you can skip the sort if the files are really already sorted, though I noticed in your sample data that 0 follows 9.)
For your problem1, this one-liner should help.
awk 'NR==FNR{a[$1];next}$1 in a{b[$1]++}END{for(x in b)printf "%s --> %s\n", x, b[x]}' f1 f2
tested with your data:
kent$ head f*
==> f1 <==
1
2
3
4
5
==> f2 <==
2
3
4
2
7
8
9
9
0
kent$ awk 'NR==FNR{a[$1];next}$1 in a{b[$1]++}END{for(x in b)printf "%s --> %s\n", x, b[x]}' f1 f2
2 --> 2
3 --> 1
4 --> 1
For the problem 2, you can test this one-liner on your files, see if performance is ok.

grep -c value NH:i:1 only for every line in file, not also NH:i:12

cat samtry.txt | grep -c NH:i:1
See an example of three lines below. the bold information is whats important
HWI-ST697:178:D1U9CACXX:1:2111:12787:5687 153 scaffold_1 33005 50 101M * 0 0 GACTAAGGAAGTCATCTGCAGTGCCCCTTGCACTTCCTAATGGGACTTTCCCTGGTTGACTATTCTTACTATGAGAACAATGAGCACCAGCTTCATTCACA DCDDDDDDDDDDDEEEEEEEEFGHGJIHGHFHJIJIJJIJJJJIHJJIJIIIFJJIGGGIJJJIIJJHIGJIJJJGHJJIJIJIGFJJGHHHHFFFFFCCC AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T26G55YT:Z:UU **NH:i:1**
HWI-ST697:178:D1U9CACXX:3:1310:18383:72540 89 scaffold_1 33005 50 101M * 0 0 GACTAAGGAAGTCATCTGCAGTGCCCCTTGCACTTCCTAATGGGACTTTCCCTGGTTGACTATTCTTACTATGAGAACAATGAGCACCAGCTTCATTCACA DDDDDDDDDDDDDEEEEEEFFFHHHIIJJIIIJIJJJJJJJJJJHJJJJJJJJJJJJJIJJJJJJJJIJJJIJJIJJJJJJJJIHFJJHHHHHFFFFFCCC AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T26G55YT:Z:UU **NH:i:11**
HWI-ST697:178:D1U9CACXX:7:1212:17559:76798 89 scaffold_1 33007 50 101M * 0 0 CTAAGGAAGTCATCTGCAGTGCCCCTTGCACTTCCTAATGGGACTTTCCCTGGTTGACTATTCTTACTATGAGAACAATGAGCACCAGCTTCATTCACAAG DDDDDDDDDDDDDEEEECDFFHGHIGJIIHJJJIIJJJJJJHHJJJJJJJJJJJIIIJJJJGIIGBJJIJJJJIJJJJJIHHHFJJIJHHHHGFFFFFCCC AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:16T26G57YT:Z:UU **NH:i:1**
I am trying to use a shell script to count all the lines in a tab-delimited-file (testfile: samtry.txt, contains 10 lines to test on) that contains the following Regular expression NH:i:1
The problem is of course that I get the information I wanted; but it also counts the lines with the following outcome: NH:i:1x (where x is any possible digit: 0-9)
The position of the NH:i:x (x = any digit until around 50) is in every line of the file on 20, its not the last position of the line. Every line has 23 'positions'.
Does anyone know how to do this with grep or another tool?
I've got around 100 files which each have a size of around 3GB each, and I don't know how to solve this problem
I hope I give enough information, I am happy for every answer
Try grep with word boundaries:
grep -c '\<NH:i:1\>' samtry.txt
OR grep -w:
grep -wc 'NH:i:1' samtry.txt

Substring in UNIX

Suppose I have a string "123456789".
I want to extract the 3rd, 6th, and 8th element. I guess I can use
cut -3, -6, -8
But if this gives
368
Suppose I want to separate them by a white space to get
3 6 8
What should I do?
Actually shell parameter expansion lets you do substring slicing directly, so you could just do:
x='123456789'
echo "${x:3:1}" "${x:6:1}" "${x:8:1}"
Update
To do this over an entire file, read the line in a loop:
while read x; do
echo "${x:3:1}" "${x:6:1}" "${x:8:1}"
done < file
(By the way, bash slicing is zero-indexed, so if you want the numbers '3', '6' and '8' you'd really want ${x:2:1} ${x:5:1} and {$x:7:1}.)
You can use the sed tool and issue this command in your teminal:
sed -r "s/^..(.)..(.).(.).*$/\1 \2 \3/"
Explained RegEx: http://regex101.com/r/fH7zW6
To "generalize" this on a file you can pipe it after a cat like so:
cat file.txt|sed -r "s/^..(.)..(.).(.).*$/\1 \2 \3/"
Perl one-liner.
perl -lne '#A = split //; print "$A[2] $A[5] $A[7]"' file
Using cut:
$ cat input
1234567890
2345678901
3456789012
4567890123
5678901234
$ cut -b3,6,8 --output-delimiter=" " input
3 6 8
4 7 9
5 8 0
6 9 1
7 0 2
The -b option selects only the specified bytes. The output delimiter can be specified using --output-delimiter.