Not able to match colon using grep regexp - regex

I have a large ASCII file. Each line contains a field like:
"id":"N119PM-1442267121-144-0"
The double quotes are actually in the file, not my addition. The fields are delimited by commas but they do not necessarily appear in the same order from line to line, which means that using cut is not a viable option.
I have been using:
grep -o '"id":"[A-Aa-z0-9-]\+' <filename>
and it works for the type of field shown above. But, there is a problem. A large number of these fields look like
"id":"JBU19-1442091600-schedule-0000:4"
In other words, they have an extra colon and number at the end. I have not been able to select fields with these extra characters.
I've tried:
grep -o '"id":"[A-Aa-z0-9:-]\+' <filename>
grep -o '"id":"[A-Aa-z0-9\:-]\+' <filename>
grep -o '"id":"[A-Aa-z0-9-]\+\(:[0-9]\+\)' <filename>
without success. Any help would be appreciated.
EDIT: I have also tried changing the : to % first then search on %, but this didn't work, either.

If you are using GNU GREP, you can use -P in grep command
grep -oP '"id":"[A-Za-z0-9-:]+"' <filename>
"id":"N119PM-1442267121-144-0"
"id":"JBU19-1442091600-schedule-0000:4"
-P, --perl-regexp PATTERN is a Perl regular expression

Related

grep -o multiple occurrences of variable string in same line

I have the following line of text in a file:
(http://onsnetwork.org/kubu4/2018/10/16/qpcr-c-gigas-primer-and-gdna-tests-with-18s-and-ef1-primers/), I checked Ronit's [DNased ctenidia RNA (from 20181016)](http://onsnetwork.org/kubu4/2018/10/16/dnase-treatment-ronits-c-gigas-ploiyddessication-ctenidia-rna/)
I would like to extract each of the strings that match this pattern:
(http://onsnetwork.org/kubu4/.*/)
I've tried the following command, but it returns the entire line, despite the -o flag:
grep -o "(http://onsnetwork.org/kubu4/.*/)" file.txt
The output I'd like is this:
(http://onsnetwork.org/kubu4/2018/10/16/qpcr-c-gigas-primer-and-gdna-tests-with-18s-and-ef1-primers/)
(http://onsnetwork.org/kubu4/2018/10/16/dnase-treatment-ronits-c-gigas-ploiyddessication-ctenidia-rna/)
I'll be applying the grep command to a series of files that will have different text after (http://onsnetwork.org/kubu4/, so the command needs to allow for that flexibility.
I'm just not sure why the regex portion of the grep causes grep to return the entire line instead of each matching occurrence.
You should check urls which are inside parenthesis:
grep -o '(http://onsnetwork.org/kubu4/[^)]*/)' # So, [^)]* and not .*
With .*/, grep while extract from ( to the last / encountered.

Grep and Egrep options

When I use grep -ow it affects the regex so I'm wondering what the regex would be without these options
I know that:
-o means show the line that matches the pattern
-w select lines that only match whole words
I'd like to convert egrep -ow '[1-9][0-9][0-9]+' text
egrep '[1-9][0-9][0-9]+' text but this regex is wrong with no options
You need to add word boundary.
egrep -o '\b[1-9][0-9][0-9]+\b' file
OR
Since egrep is depreciated, it's better to use grep with -E parameter.
grep -Eo '\b[1-9][0-9][0-9]+\b' file

simulating tail -1 command by using grep command

I want to simulate tail -1 command using grep i.e. I want to print the last line of the file using grep. It can be done easily using sed or awk. but I couldn't find any option with grep
Why on earth you want to do that ? There are better tools for this task as all are suggesting.
This is the solution you wanted :
grep "^" -n filename | grep -Po "(?<=^$(grep -c "^" filename):)(.*)"
The trick is to display all lines with line numbers (-n option).
Then match the line preceding the line count of the file.
The grep -c "^" filename part gives the line count.
The -P allows to use PCRE since a positive lookbehind match is needed.
If you don't have access to -P(I doubt it), use another filtering like follows although it won't work for lines containing : character :
grep "^" -n filename | grep "^$(grep -c "^" filename):" | grep -o "[^:]*$"
The reason behind this post is to show that this can be done only using grep.
Moral : ! ( It's highly recommended )

Matching arbitrary number of digits using grep regex

I've got a file that has lines in it that look similar as follows
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later
What I am looking to do is use regex to match any line that starts with data and ends with later AND has numbers in between. Here is what I've concocted so far:
^[D,d]ata[0-9]*later$
However the output includes all datalater lines. I suppose I could pipe the output and grep -v datalater, but I feel like a single expression should do the trick.
Use + instead of *.
+ matches at least one or more of the preceding.
* matches zero or more.
^[Dd]ata[0-9]+later$
In grep you need to escape the +, and we can use \d which is a character class and matches single digits.
^[Dd]ata\d\+later$
In you example file you also have a line:
datafhj893724897290384later
This currently will not be matched due to there being letters in-between data and the numbers. We can fix this by adding a [^0-9]* to match anything after data until the digits.
Our final command will be:
grep '^[Dd]ata[^0-9]*\d\+later$' filename
Using Cygwin, the above commands didn't work. I had to modify the commands given above to get the desired results.
$ cat > file.txt <<EOL
> data
> datalater
> 983290842
> Data387428later
> datafhj893724897290384later
> 4329804928later
> EOL
I always like to make sure my file has what I expect it to have:
$ cat file.txt
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later
$
I needed to run Perl-style expressions with the -P flag. This meant I couldn't use the [^0-9]+, whose necessity #Tom_Cammann aptly pointed out. Instead, I used .* which matches any sequence of characters not matching the next part of the pattern. Here are my command and output.
$ grep -P '^[Dd]ata.*\d+later$' file.txt
Data387428later
datafhj893724897290384later
$
I wish I could give a better explanation of WHY Perl expressions are needed, but I just know that Cygwin's grep works a bit differently.
System Info
$ uname -a
CYGWIN_NT-10.0 A-1052207 2.5.2(0.297/5/3) 2016-06-23 14:29 x86_64 Cygwin
My Results from the previous answers
$ grep '^[Dd]ata[^0-9]*\d\+later$' file2.txt
$ grep '^[Dd]ata\d+later$' file2.txt
$ grep -P '^[Dd]ata[^0-9]*\d\+later$' file2.txt
$ grep -P '^[Dd]ata\d+later$' file2.txt
Data387428later
$
You're matching zero or more digits with the * qualifier. Try
^[Dd]ata\d+later$
instead. You were also finding commas at the beginning of the string (e.g. ",ata1234later"). And \d is a shortcut to finding any digit character. So I changed those as well.
You should put a "+" (which means one or several) instead of "*" (which means zero, one or several
The "+" syntax only works for extended-regexp, not standard grep.
At least, that's my experience on RHEL.
To use extended-regexp, run egrep or pass "-E" / "--extended-regexp"
Examples...
Standard grep
echo abc123n1 | grep "abc[0-9]+n1"
<no output>
egrep
echo abc123n1 | egrep "abc[0-9]+n1"
abc123n1
grep with -E
echo abc123n1 | grep -E "abc[0-9]+n1"
abc123n1
HTH
grep -Eio "^(data)[0-9]+(later)$"
^[dD]ata=^d later$=r$
🎯 MOTIVATION
The rest of answers don't work on all systems.
🗒️ REQUISITES
grep
The option: --extended-regexp
Character groups, aka: [:group:]
Matching one or more of the preceding, aka: +
Optionally setting as starting or ending: ^whatever$
📟 COMMAND
grep --extended-regexp "[[:group:]]+"
🗂️ GROUPS
alnum
alpha
blank
cntrl
digit
graph
lower
print
punct
space
upper
xdigit

Having trouble with GREP and REGEX

I have a text file that stores combinations of four numbers in the following format:
Num1,Num2,Num3,Num4
Num5,Num6,Num7,Num8
.............
I have a whole bunch of such files and what I need is to grep for all filenames that contains the pattern described above.
I constructed my grep as follows:
grep -l "{d+},{d+},{d+},{d+}" /some/path/to/file/name
The grep terminates without returning anything.
Can somebody point out what I might be doing wrong with my grep statement?
Thanks
This should do what you want:
egrep -l '[[:digit:]]+,[[:digit:]]+,[[:digit:]]+,[[:digit:]]+' /some/path/to/file/name
One way is using a perl regexp:
grep -Pl "\d+,\d+,\d+,\d+" /some/path/to/file/name
In your syntax d is literal. It should be escaping that letter, but is not accepted by grep regular regexp.