Grep does not show results, online regex tester does - regex

I am fairly unexperienced with the behavior of grep. I have a bunch of XML files that contain lines like these:
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
<identifier type="abc">abc:def.ghi/g5678m.ab678901</identifier>
I wanted to get the identifier part after the slash and constructed a regex using RegexPal:
[a-z]\d{4}[a-z]*\.[a-z]*\d*
It highlights everything that I wanted. Perfect. Now when I run grep on the very same file, I don't get any results. And as I said, I really don't know much about grep, so I tried all different combinations.
grep [a-z]\d{4}[a-z]*\.[a-z]*\d* test.xml
grep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
egrep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
grep '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
grep -E '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
What am I doing wrong?

Your regex doesn't match the input. Let's break it down:
[a-z] matches g
\d{4} matches 1234
[a-z]* doesn't match .
Also, I believe grep and family don't like the \d syntax. Try either [0-9] or [:digit:]
Finally, when using regular expressions, prefer egrep to grep. I don't remember the exact details, but egrep supports more regex operators. Also, in many shells (including bash on OS X as you mentioned, use single quotes instead of double quotes, otherwise * will be expanded by the shell to a list of files in the current directory before grep sees it (and other shell meta-characters will get expanded too). Bash won't touch anything in single quotes.

grep doesn't support \d by defaul. To match a digit, use [0-9], or allow Perl compatible regular expressions:
$ grep -P "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
or:
$ egrep "[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*" test.xml

grep uses "basic" regular expressions : (excerpt from man pages )
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their
special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and
\).
Traditional egrep did not support the { meta-character, and some egrep
implementations support \{ instead, so portable scripts should avoid { in
grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not
special if it would be the start of an invalid interval specification. For
example, the command grep -E '{1' searches for the two-character string {1
instead of reporting a syntax error in the regular expression. POSIX.2 allows
this behavior as an extension, but portable scripts should avoid it.
Also depending on which shell you are executing in the '*' character might get expanded.

You can make use of the following command:
$ cat file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# Use -P option to enable Perl style regex \d.
$ grep -P '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# to get only the part of the input that matches use -o option:
$ grep -P -o '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
g1234.ab012345
# You can use [0-9] inplace of \d and use -E option.
$ grep -E -o '[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*' file
g1234.ab012345
$

Try this:
[a-z]\d{5}[.][a-z]{2}\d{6}

Try this expression in grep:
[a-z]\d{4}[a-z]*\.[a-z]*\d*

Related

Why doesn't grep -n "[^aeiou]+" return lines that don't contain vowels when they exist?

I am running bash shell on Android using termux
Aim is to print chars or words which don't contain any vowels in them.
Seq of cmds typed:
$ cat f4
a
b
c
bb
$ grep -n "[^aeiou]+" f4
$
Unable to understand why the regular expression is not giving the expected output.
Actually in GNU grep you don't need to enable the -E for extended regular expression support, just escape the + to deprive of its special meaning
grep -n "[^aeiou]\+" file
2:b
3:c
4:bb
Quoting from the page Basic vs Extended Regular Expressions,
In basic regular expressions the meta-characters ‘?’, ‘+’, ‘{’, ‘|’, ‘(’, and ‘)’ lose their special meaning; instead use the backslashed versions ‘\?’, ‘\+’, ‘\{’, ‘\|’, ‘\(’, and ‘\)’.
Traditional egrep did not support the ‘{’ meta-character, and some egrep implementations support ‘{’ instead, so portable scripts should avoid ‘{’ in ‘grep -E’ patterns and should use ‘[{]’ to match a literal ‘{’.
Also you can simply enable the -E, --extended-regexp flag in GNU grep for that
grep -En "[^aeiou]+" file
2:b
3:c
4:bb
Refer the Bracket Expressions from the embedded link.
First: + is an ERE extension. To build an equivalent BRE command might look like:
grep '[^aeiou]\{1,\}$'
...or you can add the -E argument or use egrep to enable such extensions.
Second: If your aim is to find words with no vowels, rather than simply words that contain at least one non-vowel character, you need to anchor your regex:
grep '^[^aeiou]\{1,\}$'
or, as an ERE,
grep -E '^[^aeiou]+$'
The ^ on the front and the $ on the back are anchors: They ensure that what you're matching goes all the way from the start of the line to the end of it, rather than that that exists somewhere in the line.

Grep hashes via regex in bash

I want to grep for hexadecimal hashes in strings and only extract those hashes.
I've tested a regex in online regex testing tools that does the trick:
\b[0-9a-f][0-9a-f]+[0-9a-f]\b
The \b is used to set word boundaries (start & end) that should be any character 0-9 or a-f. Since I do not know if the hashes are 128bit or higher, I do not know the length of the hashes in advance. Therefore I set [0-9a-f]+ in the middle in order match any number of [0-9a-f], but at least one (since no hash consists just of two characters that are checked with the boundaries \b).
However, I noticed that
grep --only-matching -e "\b[0-9a-f][0-9a-f]+[0-9a-f]\b"
does not work in the shell, while the regex \b[0-9a-f][0-9a-f]*[0-9a-f]\b works in online regex testing tools.
In fact, the shell version does only work if I escape the quantifier + with a backslash:
grep --only-matching -e "\b[0-9a-f][0-9a-f]\+[0-9a-f]\b"
^
|_ escaped +
Why does grep needs this escaping in the shell?
Is there any downside of my rather simple approach?
I don't know why a metacharacter would need to be escaped in the bash, but your regex could be rewritten as this:
grep --only-matching -e "\b[0-9a-f]{3,}\b"
The + quantifier is not part of the POSIX Basic Regular Expressions (aka BRE) so you must escape it with grep in BRE mode.
As an alternative, you can:
add the -E flag to grep:
grep -E --only-matching -e "\b[0-9a-f][0-9a-f]+[0-9a-f]\b"
use [0-9a-f][0-9a-f]* or [0-9a-f]{1,}
Grep runs basic regular expressions by default. You need to escape the + quantifier with a backslash as it is said in the documentation:
In basic regular expressions the meta-characters ?, +, {, |,
(, and ) lose their special meaning; instead use the backslashed
versions \?, \+, \{, \|, \(, and \).
Also, there is no need for -e option, just
grep -o '\b[0-9a-f]\+\b' file

Grep Regex Exclusion Special Character

I am having a difficult time trying to search for a phrase but exclude the phrase if it is directly followed by a colon-space.
I am looking for Delet! (i.e. "Delet.*" in regex syntax) but I do not want anything returned that is "Deleted: " (includes a space after the colon). However, I would like anything returned that is "Deleted" followed by anything other than a colon-space.
I have tried the following expressions
grep -ri 'delet.*[^:]'
grep -ri 'delet[a-zA-Z0-9\;\".....]{0,10}'
(including all special characters in the range preceded by escapes)
Using a lookahead expression:
grep -Pi 'Delet(?!ed: )'
Note the modification of the parameters of grep: -P enables the use of lookahead expressions.
Try this. The ? after the * instructs it to select as few non-space characters as possible, followed by any one character that is not a colon, followed by a space.
grep -ri 'delet[^ ]*?[^:] '
If I got you correctly you want anything starting with delet, and not starting with deleted::
grep -Ei '^delet((([^e]|e$)|e([^d]|d$)|ed([^:]|:$)|ed:[^ ]).*)?$'
This basically says:
Match [start]deletX[anything][end] or [start]delete[end] where X is not e
Match [start]deleteX[anything][end] or [start]deleted[end] where X is not d
Match [start]deletedX[anything][end] or [start]deleted:[end] where X is not :
Match [start]deleted:X[anything][end] where X is not space.
It would have been far easier to use pipe and second negative grep if that is applicable:
grep -i ^delet | grep -vi '^deleted: '
It sounds like all you need is:
awk -v IGNORECASE=1 '/delet/ && !/deleted: /' file
The above uses GNU awk for IGNORECASE, other awks would use tolower().
The benefit of awk over grep is that awk tests for conditions, not just regexps, so you can create compound conditions using && and || out of tests for regexps which makes it MUCH simpler and clearer to just code the condition you want to test - that the line contains delet and (&&) not (!) deleted:.

How can I grep for a string that contains multiple consecutive dashes?

I want to grep for the string that contains with dashes like this:
---0 [58ms, 100%, 100%]
There's at least one dash.
I found this question: How can I grep for a string that begins with a dash/hyphen?
So I want to use:
grep -- -+ test.txt
But I get nothing.
Finally, my colleague tells me that this will work:
grep '\-\+' test.txt
Yes, it works. But neither he nor I don't know why after searched many documents.
This also works:
grep -- -* test.txt
With -+ you are saying: multiple -. But this is not understood automatically by grep. You need to tell it that + has a special meaning.
You can do it by using an extended regex -E:
grep -E -- "-+" file
or by escaping the +:
grep -- "-\+" file
Test
$ cat a
---0 [58ms, 100%, 100%]
hell
$ grep -E -- "-+" a
---0 [58ms, 100%, 100%]
$ grep -- "-\+" a
---0 [58ms, 100%, 100%]
From man grep:
REGULAR EXPRESSIONS
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |,
(, and ) lose their special meaning; instead use the backslashed
versions \?, \+, \{, \|, \(, and \).

How do you use a plus symbol with a character class as part of a regular expression?

in cygwin, this does not return a match:
$ echo "aaab" | grep '^[ab]+$'
But this does return a match:
$ echo "aaab" | grep '^[ab][ab]*$'
aaab
Are the two expressions not identical?
Is there any way to express "one or more characters of the character class" without typing the character class twice (like in the seconds example)?
According to this link the two expressions should be the same, but perhaps Regular-Expressions.info does not cover bash in cygwin.
grep has multiple "modes" of matching, and by default only uses a basic set, which does not recognize a number of metacharacters unless they're escaped. You can put grep into extended or perl modes to let + be evaluated.
From man grep:
Matcher Selection
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression. This is highly experimental and grep -P may warn of unimplemented features.
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).
Traditional egrep did not support the { meta-character, and some egrep implementations support \{ instead, so portable scripts should avoid { in grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not special if it would be the start of an invalid interval specification. For example, the command grep -E '{1' searches for the two-character string {1 instead of reporting a syntax
error in the regular expression. POSIX.2 allows this behavior as an extension, but portable scripts should avoid it.
Alternately, you can use egrep instead of grep -E.
In basic regular expressions the metacharacters ?, +, {, |, (, and )
lose their special meaning; instead use the backslashed versions \?,
\+, \{, \|, \(, and \).
So use the backslashed version:
$ echo aaab | grep '^[ab]\+$'
aaab
Or activate extended syntax:
$ echo aaab | egrep '^[ab]+$'
aaab
Masking by backslash, or egrep as extended grep, alias grep -e:
echo "aaab" | egrep '^[ab]+$'
aaab
echo "aaab" | grep '^[ab]\+$'
aaab