grep -c value NH:i:1 only for every line in file, not also NH:i:12 - regex

cat samtry.txt | grep -c NH:i:1
See an example of three lines below. the bold information is whats important
HWI-ST697:178:D1U9CACXX:1:2111:12787:5687 153 scaffold_1 33005 50 101M * 0 0 GACTAAGGAAGTCATCTGCAGTGCCCCTTGCACTTCCTAATGGGACTTTCCCTGGTTGACTATTCTTACTATGAGAACAATGAGCACCAGCTTCATTCACA DCDDDDDDDDDDDEEEEEEEEFGHGJIHGHFHJIJIJJIJJJJIHJJIJIIIFJJIGGGIJJJIIJJHIGJIJJJGHJJIJIJIGFJJGHHHHFFFFFCCC AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T26G55YT:Z:UU **NH:i:1**
HWI-ST697:178:D1U9CACXX:3:1310:18383:72540 89 scaffold_1 33005 50 101M * 0 0 GACTAAGGAAGTCATCTGCAGTGCCCCTTGCACTTCCTAATGGGACTTTCCCTGGTTGACTATTCTTACTATGAGAACAATGAGCACCAGCTTCATTCACA DDDDDDDDDDDDDEEEEEEFFFHHHIIJJIIIJIJJJJJJJJJJHJJJJJJJJJJJJJIJJJJJJJJIJJJIJJIJJJJJJJJIHFJJHHHHHFFFFFCCC AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T26G55YT:Z:UU **NH:i:11**
HWI-ST697:178:D1U9CACXX:7:1212:17559:76798 89 scaffold_1 33007 50 101M * 0 0 CTAAGGAAGTCATCTGCAGTGCCCCTTGCACTTCCTAATGGGACTTTCCCTGGTTGACTATTCTTACTATGAGAACAATGAGCACCAGCTTCATTCACAAG DDDDDDDDDDDDDEEEECDFFHGHIGJIIHJJJIIJJJJJJHHJJJJJJJJJJJIIIJJJJGIIGBJJIJJJJIJJJJJIHHHFJJIJHHHHGFFFFFCCC AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:16T26G57YT:Z:UU **NH:i:1**
I am trying to use a shell script to count all the lines in a tab-delimited-file (testfile: samtry.txt, contains 10 lines to test on) that contains the following Regular expression NH:i:1
The problem is of course that I get the information I wanted; but it also counts the lines with the following outcome: NH:i:1x (where x is any possible digit: 0-9)
The position of the NH:i:x (x = any digit until around 50) is in every line of the file on 20, its not the last position of the line. Every line has 23 'positions'.
Does anyone know how to do this with grep or another tool?
I've got around 100 files which each have a size of around 3GB each, and I don't know how to solve this problem
I hope I give enough information, I am happy for every answer

Try grep with word boundaries:
grep -c '\<NH:i:1\>' samtry.txt
OR grep -w:
grep -wc 'NH:i:1' samtry.txt

Related

grep single digit occurs one time in line

I need help with one grep command
-single digit occurs one time in line
my solution doesn't work
egrep "^(\s*[1]\s*)(\s*[^1]+\s*)+$|^(\s*[^1]\s*)(\s*[1]+\s*)+$|^(\s*[2]\s*)(\s*[^2]+\s*)+$|^(\s*[^2]\s*)(\s*[2]+\s*)+$|^(\s*[3]\s*)(\s*[^3]+\s*)+$|^(\s*[^3]\s*)(\s*[3]+\s*)+$|^(\s*[4]\s*)(\s*[^4]+\s*)+$|^(\s*[^4]\s*)(\s*[4]+\s*)+$|^(\s*[5]\s*)(\s*[^5]+\s*)+$|^(\s*[^5]\s*)(\s*[5]+\s*)+$|^(\s*[6]\s*)(\s*[^6]+\s*)+$|^(\s*[^6]\s*)(\s*[6]+\s*)+$|^(\s*[7]\s*)(\s*[^7]+\s*)+$|^(\s*[^7]\s*)(\s*[7]+\s*)+$|^(\s*[8]\s*)(\s*[^8]+\s*)+$|^(\s*[^8]\s*)(\s*[8]+\s*)+$|^(\s*[9]\s*)(\s*[^9]+\s*)+$|^(\s*[^9]\s*)(\s*[9]+\s*)+$"
example
for example in this text
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
grep color only second line.
I want to grep color every line because in each line any digit occurs one time.In first line this is 5 in second line this is 5 in third line this is 7
A pattern that detects if a digit is unique on a line (if I'm understanding the question correctly):
For the digit 5:
^[^5]*(5)[^5]*$
^ // start of line
[^5]* // any char not 5, 0-or-more
(5) // 5
[^5]* // any char not 5, 0-or-more
$ // end of line
To test all digits, it becomes:
^(?:[^0]*(0)[^0]*|[^1]*(1)[^1]*)$ etc for all digits. The digit is captured in the first group.
Demo
Steps: 509 steps
Flags: g, m
I'm really unsure what the expected output should be (PLEASE UPDATE IT PROPERLY TO THE QUESTION), but here using GNU awk. First test data:
$ cat foo
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
234 12 43
Then:
$ awk -F '' '{
delete a
for(i=1;i<=NF;i++)
if($i~/[0-9]/)
a[$i]++
for(i in a)
if(a[i]==1 && match($0, "[^" i "]*" i "[^" i "]*")) {
print $0
next # second data line has 2 matches
}
}' foo
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
234 12 43
Then again, its shorter just to:
$ awk '{for(i=0;i<=9;i++)if(gsub(i,i,$0)==1){print;next}}' foo
I'm not absolutely sure what you're after, but if it's matching lines that only contain one instance of a digit, try this:
[^0]*0[^0]*|[^1]*1[^1]*|[^2]*2[^2]*|[^3]*3[^3]*|[^4]*4[^4]*|[^5]*5[^5]*|[^6]*6[^6]*|[^7]*7[^7]*|[^8]*8[^8]*|[^9]*9[^9]*
or grepified
grep -x "[^0]*0[^0]*\|[^1]*1[^1]*\|[^2]*2[^2]*\|[^3]*3[^3]*\|[^4]*4[^4]*\|[^5]*5[^5]*\|[^6]*6[^6]*\|[^7]*7[^7]*\|[^8]*8[^8]*\|[^9]*9[^9]*"
(-x makes grep match the full line.)
The regex uses 10 identical alternations, one for each digit. Each of the alternations
make sure zero or more of anything but the digit starts the line.
match the one allowed digit
make sure zero or more of anything but the digit ends the line.
See it here at regex101.

Regular Expression - Pattern

I am new to Shell scripting. I am trying to write a code that should grep few lines from a huge file based on certain condition.
Contents of file, say names.txt
1 ae1aee2sonata om,vadodara,23-Aug-2016
2 chdc501ae om,patna,26-Aug-2016
3 chdc4326aee6 om,bhuvi,01-Oct-2016
4 ae3aee6prsons hqr,bangalore,29-Aug-2016
5 praaeei5 om,lucknow,11-Nov-2016
6 aetaeen6pana om,phanto,13-Oct-2016
and goes on for 500 or more entries.
Now, I am looking for output for the following :
Filter lines with only "aee" available in it. So, the output will look
like:
3 chdc4326aee6.om,bhuvi,01-Oct-2016
5 praaeei5 om,lucknow,11-Nov-2016
Filter lines with only "ae" and "ae + "aee" available in the file. So,
the output will look like:
1 ae1aee2sonata.hqr,vadodara,23-Aug-2016
2 chdc501ae.om,patna,26-Aug-2016
4 ae3aee6prsons hqr,bangalore,29-Aug-2016
6 aetaeen6pana om,phanto,13-Oct-2016
Filter lines with only "ae" from the file. So, the output will look like:
2 chdc501ae.om,patna,26-Aug-2016
Any suggestions please. You can point to a good place for getting more information about this, so I can learn.
Use grep with option -P and lookahead
The file:
$ cat data.txt
1 ae1aee2sonata om,vadodara,23-Aug-2016
2 chdc501ae om,patna,26-Aug-2016
3 chdc4326aee6 om,bhuvi,01-Oct-2016
4 ae3aee6prsons hqr,bangalore,29-Aug-2016
5 praaeei5 om,lucknow,11-Nov-2016
6 aetaeen6pana om,phanto,13-Oct-2016
Find aee but not ae :
$ grep -P '^(?:(?=.*aee[^e]))?(?!.*ae[^e]).*(aee)[^e]' data.txt
3 chdc4326aee6 om,bhuvi,01-Oct-2016
5 praaeei5 om,lucknow,11-Nov-2016
Find ae or ae + aee :
$ grep -P '^(?:(?!.*aee[^e]))?(?=.*ae[^e]).*(aee?)[^e]' data.txt
1 ae1aee2sonata om,vadodara,23-Aug-2016
2 chdc501ae om,patna,26-Aug-2016
4 ae3aee6prsons hqr,bangalore,29-Aug-2016
6 aetaeen6pana om,phanto,13-Oct-2016
Find ae only :
$ grep -P '^(?!.*aee[^e])(?=.*ae[^e]).*(ae)[^e]' data.txt
2 chdc501ae om,patna,26-Aug-2016

Removing Leading 0 and applying Regex to Sed

I have several file names, for ease I've put them in a file as follows:
01.action1.txt
04action2.txt
12.action6.txt
2.action3.txt
020.action9.txt
10action4.txt
15action7.txt
021action10.txt
11.action5.txt
18.action8.txt
As you can see the formats aren't consistent what I'm trying to do is extract the first numbers from these file names 1,4,12,2,20 etc
I have the following regex
(\.)?action\d{1,}.txt
Which is successfully matching .action[number].txt but I need to also match the leading 0 and apply it to my substitute with blank in sed so i'm only left with the leading numbers. I'm having trouble matching the leading 0 and applying the whole thing to sed.
Thanks
With GNU sed:
sed -r 's/0*([0-9]*).*/\1/' file
Output:
1
4
12
2
20
10
15
21
11
18
See: The Stack Overflow Regular Expressions FAQ
I don't know if the below awk is helpful but it works as well:
awk '{print $1 + 0}' file
1
4
12
2
20
10
15
21
11
18

Bash script to split a file by grep everything till the second time match in a column into one file and the rest into another

I am trying to split a file with data like
2 0.2345
58 0.3608
59 0.3504
60 0.4175
65 0.3995
66 0.3972
67 0.4411
411 0.3455
2 1.3867
3 1.4532
4 1.2925
5 1.2473
6 1.2605
7 1.2463
8 1.1667
9 1.1312
10 1.1502
11 1.1190
12 1.0346
13 1.0291
409 0.8025
410 0.8695
411 0.9154
For this kind of data, I am trying to split this into two files:
File 1 : 2 -411 (first Column match)
File 2 : 2-411 (second occurrence in the first column)
For this, I wrote these two one liners:
awk '1;/411/{exit}' $1 > File1_$1 ;
awk '/411/,0' $1 | awk '{if (NR!=1) {print}}' > File2_$1
The problem is that if there is a match of "411" (as in "67 0.4411") on the second column, my script prematurely cuts from that line.
I am unable to make the match on the first column only as occurrence of 411 on the second column can be number of times and not of interest.
Any help would be greatly appreciated.
an idea could be to use this command combination
awk '{ if ($1 >= 2 && $1 <= 411) print $0 }{if ($1=="411") exit}' input > f1
then
grep -v -f f1 input > f2
if your input file is more bigger you should repeat step2.
I don't know nothing about Bash, but for regex i think you should indicate that the line begins with 411 like that \b411.

Retrieving digits from multiple file names using regex

Given files:
aaabbcc.43.311b.file
ddeeff.x51.311b.file
ffg.1.311b.file
hh.ii.jj.x26.311b.file
ll.m.311.311b.file
How would I get the numbers within the file name but not 311b? So I would like to get 43, 51, 1, 26 and 311.
You can do it with grep:
grep -o '[0-9]\+\b' test.text
sed 's#[^0-9]\+\([0-9]\+\).*#\1#' INPUTFILE
Will give you the needed output for the exampled lines. It searches the input lines for the first group of digit characters, and prints only them.
% ls
aaabbcc.43.311b.file ddeeff.x51.311b.file ffg.1.311b.file hh.ii.jj.x26.311b.file ll.m.311.311b.file
% ls|grep -o -P '\d+(?=\.311b\.file)'
43
51
1
26
311