Retrieving digits from multiple file names using regex - regex

Given files:
aaabbcc.43.311b.file
ddeeff.x51.311b.file
ffg.1.311b.file
hh.ii.jj.x26.311b.file
ll.m.311.311b.file
How would I get the numbers within the file name but not 311b? So I would like to get 43, 51, 1, 26 and 311.

You can do it with grep:
grep -o '[0-9]\+\b' test.text

sed 's#[^0-9]\+\([0-9]\+\).*#\1#' INPUTFILE
Will give you the needed output for the exampled lines. It searches the input lines for the first group of digit characters, and prints only them.

% ls
aaabbcc.43.311b.file ddeeff.x51.311b.file ffg.1.311b.file hh.ii.jj.x26.311b.file ll.m.311.311b.file
% ls|grep -o -P '\d+(?=\.311b\.file)'
43
51
1
26
311

Related

Using grep, how to match beginning of line with pattern from stdin

I have a one-liner that prints out a series a numbers:
124
132
186
I am then piping this output into grep to match these numbers to the beginning of lines in another file but sometimes the second number in the line matches one of the patterns and I get an incorrect match like so:
$ get_id_command | grep -f - users.list
124 => 3456, Charles Charmichael, ccharmichael
132 => 2498, Sarah Walker, swalker
186 => 8934, John Casey, jcasey
240 => 1245, Morgan Grimes, mgrimes
What options do I need for grep to only match patterns at the beginning of the line? I would really like to keep this as a one-linter.
Prepend a circumflex to each line of your file and it will work. Circumflex does indicate the line start within the pattern. So modify your users.list as described, e.g.
sed -Ei 's|(.*)|^\1|' users.list
After that you should get the desired result by your command
$ get_id_command | grep -f - users.list

Phone Numbers in separate lines in UNIX

In UNIX----
I have a Sample file i want all the phone numbers starting from 987 in another file as a list,
that means if in single row there are 2 phone numbers they should be in separate lines.
Sample File Contents
ajfhvjfdhvjdfb jfbhfb fg 9871177454 9563214578 shgfsehfgvhb vhf 9877745212
sjdjfgsfhvg b 9874789645 sfjkvhbjfbg shgfhbfg 2563145278
9874561231
This should work,
echo "ajfhvjfdhvjdfb jfbhfb fg 9871177454 9563214578 shgfsehfgvhb vhf 9877745212 sjdjfgsfhvg b 9874789645 sfjkvhbjfbg shgfhbfg 2563145278 9874561231" > sample.txt
egrep -o '987([0-9]+)' sample.txt
returns,
9871177454
9877745212
9874789645
9874561231
or to be specific for 10 digit phone numbers,
egrep -o '987([0-9]{7})' sample.txt
returns similar results.

Removing Leading 0 and applying Regex to Sed

I have several file names, for ease I've put them in a file as follows:
01.action1.txt
04action2.txt
12.action6.txt
2.action3.txt
020.action9.txt
10action4.txt
15action7.txt
021action10.txt
11.action5.txt
18.action8.txt
As you can see the formats aren't consistent what I'm trying to do is extract the first numbers from these file names 1,4,12,2,20 etc
I have the following regex
(\.)?action\d{1,}.txt
Which is successfully matching .action[number].txt but I need to also match the leading 0 and apply it to my substitute with blank in sed so i'm only left with the leading numbers. I'm having trouble matching the leading 0 and applying the whole thing to sed.
Thanks
With GNU sed:
sed -r 's/0*([0-9]*).*/\1/' file
Output:
1
4
12
2
20
10
15
21
11
18
See: The Stack Overflow Regular Expressions FAQ
I don't know if the below awk is helpful but it works as well:
awk '{print $1 + 0}' file
1
4
12
2
20
10
15
21
11
18

Regular expressions in shell script

I am trying to search a line in a log file, based on the regular expression. When I use below command I am getting the proper output. Platform: Solaris, Shell: Bash
grep 14:[00-29]
O/P: Apr 02 14:07:35 [192.168.162.117.113.169]
But when I use the below command I am getting blank output
grep 14:[00-29]:[00-59].
Am I missing something?
[00-29] matches only the characters 0, 1, 2 and 9.
[00-59] matches only the characters 0, 1, 2, 3, 4, 5 and 9.
The [] construct creates a character class, not a numeric range.
You might want grep -E 14:[0-2][0-9].
Your regex is not doing what you think. What it is actually doing is saying get me 14:[ONE number that is 0, 0-2 or 9]:[ONE number that is 0, 0-5, or 9]. You should change it to 14:[0-9]|([0-2][0-9]):[0-9]|([0-5][0-9])
You need to use another regex. For example this makes it:
grep "14:[0-2][0-9]:[0-5][0-9]" file
^^^ ^^^ ^^^ ^^^
| | | any number
| | from 0 to 5
from 0 to 2 any number
See the output:
$ grep "14:[0-2][0-9]:[0-5][0-9]" file
Apr 02 14:07:35 [192.168.162.117.113.169]
The first one was matching but casually:
$ grep 14:[00-29] file
Apr 02 14:07:35 [192.168.162.117.113.169]
^^
| |
| 7 is not matched
[00-29] matches this 0
Both of these regex:
14:[00-29]
and
14:[00-29]:[00-59]
are incorrect and they aren't really matching values from 0 to 29 for example.
For range of 00 to 59 you can use:
\b(0[0-9]|[1-5][0-9])\b
And for the range of 0 to 29:
\b(0[0-9]|[12][0-9])\b

grep -c value NH:i:1 only for every line in file, not also NH:i:12

cat samtry.txt | grep -c NH:i:1
See an example of three lines below. the bold information is whats important
HWI-ST697:178:D1U9CACXX:1:2111:12787:5687 153 scaffold_1 33005 50 101M * 0 0 GACTAAGGAAGTCATCTGCAGTGCCCCTTGCACTTCCTAATGGGACTTTCCCTGGTTGACTATTCTTACTATGAGAACAATGAGCACCAGCTTCATTCACA DCDDDDDDDDDDDEEEEEEEEFGHGJIHGHFHJIJIJJIJJJJIHJJIJIIIFJJIGGGIJJJIIJJHIGJIJJJGHJJIJIJIGFJJGHHHHFFFFFCCC AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T26G55YT:Z:UU **NH:i:1**
HWI-ST697:178:D1U9CACXX:3:1310:18383:72540 89 scaffold_1 33005 50 101M * 0 0 GACTAAGGAAGTCATCTGCAGTGCCCCTTGCACTTCCTAATGGGACTTTCCCTGGTTGACTATTCTTACTATGAGAACAATGAGCACCAGCTTCATTCACA DDDDDDDDDDDDDEEEEEEFFFHHHIIJJIIIJIJJJJJJJJJJHJJJJJJJJJJJJJIJJJJJJJJIJJJIJJIJJJJJJJJIHFJJHHHHHFFFFFCCC AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T26G55YT:Z:UU **NH:i:11**
HWI-ST697:178:D1U9CACXX:7:1212:17559:76798 89 scaffold_1 33007 50 101M * 0 0 CTAAGGAAGTCATCTGCAGTGCCCCTTGCACTTCCTAATGGGACTTTCCCTGGTTGACTATTCTTACTATGAGAACAATGAGCACCAGCTTCATTCACAAG DDDDDDDDDDDDDEEEECDFFHGHIGJIIHJJJIIJJJJJJHHJJJJJJJJJJJIIIJJJJGIIGBJJIJJJJIJJJJJIHHHFJJIJHHHHGFFFFFCCC AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:16T26G57YT:Z:UU **NH:i:1**
I am trying to use a shell script to count all the lines in a tab-delimited-file (testfile: samtry.txt, contains 10 lines to test on) that contains the following Regular expression NH:i:1
The problem is of course that I get the information I wanted; but it also counts the lines with the following outcome: NH:i:1x (where x is any possible digit: 0-9)
The position of the NH:i:x (x = any digit until around 50) is in every line of the file on 20, its not the last position of the line. Every line has 23 'positions'.
Does anyone know how to do this with grep or another tool?
I've got around 100 files which each have a size of around 3GB each, and I don't know how to solve this problem
I hope I give enough information, I am happy for every answer
Try grep with word boundaries:
grep -c '\<NH:i:1\>' samtry.txt
OR grep -w:
grep -wc 'NH:i:1' samtry.txt