How to use square brackets in grep for MINGW64? - regex

Currently, I have a following regex. It should match a string that I am echoing:
echo "TBGFSGFI22800_D_REP_D_RISIKOEINHEIT" | grep -E 'TBGFSGFI\d\d\d\d\d[A-Za-z_]{1,100}'
It works as expected in OsX on my Mac and in Notepad++, but in Bash for windows (MINGW64) I get an empty string. How can I use the grep with flags, or how should I rewrite the regex to match the pattern?
My grep version is 3.1. Bash: 4.4.23(1)
Thanks for help in advance!

You are using a POSIX ERE regex with the -E option, and that flavor does not support \d construct. You also need -o option to actually extract the matches.
Note you do not need to repeat \d five times, you can use a range quantifier, \d{5}.
You can use
echo "TBGFSGFI22800_D_REP_D_RISIKOEINHEIT" | grep -Po "TBGFSGFI\d{5}[A-Za-z_]{1,100}"
Where
-P means the regex is of a PCRE flavor
-o extracts matches only
TBGFSGFI\d{5}[A-Za-z_]{1,100} - a regex that matches TBGFSGFI, then any five digits and then 1-100 ASCII letters or _.

Related

Regex to match exact version phrase

I have versions like:
v1.0.3-preview2
v1.0.3-sometext
v1.0.3
v1.0.2
v1.0.1
I am trying to get the latest version that is not preview (doesn't have text after version number) , so result should be:
v1.0.3
I used this grep: grep -m1 "[v\d+\.\d+.\d+$]"
but it still outputs: v1.0.3-preview2
what I could be missing here?
To return first match for pattern v<num>.<num>.<num>, use:
grep -m1 -E '^v[0-9]+(\.[0-9]+){2}$' file
v1.0.3
If you input file is unsorted then use grep | sort -V | head as:
grep -E '^v[0-9]+(\.[0-9]+){2}$' f | sort -rV | head -1
When you use ^ or $ inside [...] they are treated a literal character not the anchors.
RegEx Details:
^: Start
v: Match v
[0-9]+: Match 1+ digits
(\.[0-9]+){2}: Match a dot followed by 1+ dots. Repeat this group 2 times
$: End
To match the digits with grep, you can use
grep -m1 "v[[:digit:]]\+\.[[:digit:]]\+\.[[:digit:]]\+$" file
Note that you don't need the [ and ] in your pattern, and to escape the dot to match it literally.
With awk you could try following awk code.
awk 'match($0,/^v[0-9]+(\.[0-9]+){2}$/){print;exit}' Input_file
Explanation of awk code: Simple explanation of awk program would be, using match function of awk to match regex to match version, once match is found print the matched value and exit from program.
Regular expressions match substrings, not whole strings. You need to explicitly match the start (^) and end ($) of the pattern.
Keep in mind that $ has special meaning in double quoted strings in shell scripts and needs to be escaped.
The boundary characters need to be outside of any group ([]).

Grep hashes via regex in bash

I want to grep for hexadecimal hashes in strings and only extract those hashes.
I've tested a regex in online regex testing tools that does the trick:
\b[0-9a-f][0-9a-f]+[0-9a-f]\b
The \b is used to set word boundaries (start & end) that should be any character 0-9 or a-f. Since I do not know if the hashes are 128bit or higher, I do not know the length of the hashes in advance. Therefore I set [0-9a-f]+ in the middle in order match any number of [0-9a-f], but at least one (since no hash consists just of two characters that are checked with the boundaries \b).
However, I noticed that
grep --only-matching -e "\b[0-9a-f][0-9a-f]+[0-9a-f]\b"
does not work in the shell, while the regex \b[0-9a-f][0-9a-f]*[0-9a-f]\b works in online regex testing tools.
In fact, the shell version does only work if I escape the quantifier + with a backslash:
grep --only-matching -e "\b[0-9a-f][0-9a-f]\+[0-9a-f]\b"
^
|_ escaped +
Why does grep needs this escaping in the shell?
Is there any downside of my rather simple approach?
I don't know why a metacharacter would need to be escaped in the bash, but your regex could be rewritten as this:
grep --only-matching -e "\b[0-9a-f]{3,}\b"
The + quantifier is not part of the POSIX Basic Regular Expressions (aka BRE) so you must escape it with grep in BRE mode.
As an alternative, you can:
add the -E flag to grep:
grep -E --only-matching -e "\b[0-9a-f][0-9a-f]+[0-9a-f]\b"
use [0-9a-f][0-9a-f]* or [0-9a-f]{1,}
Grep runs basic regular expressions by default. You need to escape the + quantifier with a backslash as it is said in the documentation:
In basic regular expressions the meta-characters ?, +, {, |,
(, and ) lose their special meaning; instead use the backslashed
versions \?, \+, \{, \|, \(, and \).
Also, there is no need for -e option, just
grep -o '\b[0-9a-f]\+\b' file

Matching string with regex

I keep banging my head against the wall looking for a regex that matches a string like any of these:
--7928ae02-A--
--7928ae02-B--
--7928ae02-F--
--7928ae02-H--
--7928ae02-Z--
the string is two dashes, 8 characters of any letter or number, a dash, an uppercase A-Z and two dashes.
Here's any example of where I'm at:
grep '^--[a-fA-F0-9]{8}-[A-Z]--$'
This might work
grep -E -- '^--[[:alnum:]]{8}-[[:upper:]]--$'
You probably should use the
grep -P '^--[a-fA-F0-9]{8}-[A-Z]--$'
with the -P or -E flag. -P interprets the regex as a Perl regex, -E runs the regex in extended mode. This regex allows all kinds of bells and whistles Like the {}-operator.
Running this on your given tests, they all pass.

Grep does not show results, online regex tester does

I am fairly unexperienced with the behavior of grep. I have a bunch of XML files that contain lines like these:
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
<identifier type="abc">abc:def.ghi/g5678m.ab678901</identifier>
I wanted to get the identifier part after the slash and constructed a regex using RegexPal:
[a-z]\d{4}[a-z]*\.[a-z]*\d*
It highlights everything that I wanted. Perfect. Now when I run grep on the very same file, I don't get any results. And as I said, I really don't know much about grep, so I tried all different combinations.
grep [a-z]\d{4}[a-z]*\.[a-z]*\d* test.xml
grep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
egrep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
grep '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
grep -E '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
What am I doing wrong?
Your regex doesn't match the input. Let's break it down:
[a-z] matches g
\d{4} matches 1234
[a-z]* doesn't match .
Also, I believe grep and family don't like the \d syntax. Try either [0-9] or [:digit:]
Finally, when using regular expressions, prefer egrep to grep. I don't remember the exact details, but egrep supports more regex operators. Also, in many shells (including bash on OS X as you mentioned, use single quotes instead of double quotes, otherwise * will be expanded by the shell to a list of files in the current directory before grep sees it (and other shell meta-characters will get expanded too). Bash won't touch anything in single quotes.
grep doesn't support \d by defaul. To match a digit, use [0-9], or allow Perl compatible regular expressions:
$ grep -P "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
or:
$ egrep "[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*" test.xml
grep uses "basic" regular expressions : (excerpt from man pages )
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their
special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and
\).
Traditional egrep did not support the { meta-character, and some egrep
implementations support \{ instead, so portable scripts should avoid { in
grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not
special if it would be the start of an invalid interval specification. For
example, the command grep -E '{1' searches for the two-character string {1
instead of reporting a syntax error in the regular expression. POSIX.2 allows
this behavior as an extension, but portable scripts should avoid it.
Also depending on which shell you are executing in the '*' character might get expanded.
You can make use of the following command:
$ cat file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# Use -P option to enable Perl style regex \d.
$ grep -P '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# to get only the part of the input that matches use -o option:
$ grep -P -o '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
g1234.ab012345
# You can use [0-9] inplace of \d and use -E option.
$ grep -E -o '[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*' file
g1234.ab012345
$
Try this:
[a-z]\d{5}[.][a-z]{2}\d{6}
Try this expression in grep:
[a-z]\d{4}[a-z]*\.[a-z]*\d*

Regex to match unique substrings

Here's a basic regex technique that I've never managed to remember. Let's say I'm using a fairly generic regex implementation (e.g., grep or grep -E). If I were to do a list of files and match any that end in either .sty or .cls, how would I do that?
ls | grep -E "\.(sty|cls)$"
\. matches literally a "." - an unescaped . matches any character
(sty|cls) - match "sty" or "cls" - the | is an or and the brackets limit the expression.
$ forces the match to be at the end of the line
Note, you want grep -E or egrep, not grep -e as that's a different option for lists of patterns.
egrep "\.sty$|\.cls$"
This regex:
\.(sty|cls)\z
will match any string ends with .sty or .cls
EDIT:
for grep \z should be replaced with $ i.e.
\.(sty|cls)$
as jelovirt suggested.