Regex error "Repetition not preceded by valid expression" in Grep - regex

I have in a file with strings such as {?ENV1} {?ENV2}
I want to use grep to find these using
grep -o '\{\?\S+?\}' myfile
but I get
grep: bad regex '\{\?\S+?\}': Repetition not preceded by valid expression
in regex101 website the regex works. Is grep working differently?

POSIX grep default regex engine is POSIX BRE. You shouldn't escape braces or use \S. Former leads to a special meaning (the cause of error you see) and latter isn't supported. Try:
grep -o '{?[^{}]*}' file
Or to keep \S purpose:
grep -o '{?[^[:space:]]*}' file
Or even to work around + quantifier:
grep -o '{?[^[:space:]][^[:space:]]*}' file

Use -P flag for Perl based regular expressions(though still in experimental phase)
grep -oP '\{\?\S*\}' myfile
OR
Try this with updated regex. You can use *
grep -o '\{\?\S*\}' myfile

Related

How to use square brackets in grep for MINGW64?

Currently, I have a following regex. It should match a string that I am echoing:
echo "TBGFSGFI22800_D_REP_D_RISIKOEINHEIT" | grep -E 'TBGFSGFI\d\d\d\d\d[A-Za-z_]{1,100}'
It works as expected in OsX on my Mac and in Notepad++, but in Bash for windows (MINGW64) I get an empty string. How can I use the grep with flags, or how should I rewrite the regex to match the pattern?
My grep version is 3.1. Bash: 4.4.23(1)
Thanks for help in advance!
You are using a POSIX ERE regex with the -E option, and that flavor does not support \d construct. You also need -o option to actually extract the matches.
Note you do not need to repeat \d five times, you can use a range quantifier, \d{5}.
You can use
echo "TBGFSGFI22800_D_REP_D_RISIKOEINHEIT" | grep -Po "TBGFSGFI\d{5}[A-Za-z_]{1,100}"
Where
-P means the regex is of a PCRE flavor
-o extracts matches only
TBGFSGFI\d{5}[A-Za-z_]{1,100} - a regex that matches TBGFSGFI, then any five digits and then 1-100 ASCII letters or _.

Matching decimal number in grep

I have a file that has the line:
Time 97.7518 seconds
I want to get the decimal time. Why is the following simple grep command not working?
grep -Ei "\d+\.\d+" Nasa-1024-256.txt
You seem to need the -o option to extract the match, and using the [0-9] bracket expression is safer with ERE regex flavor (it is set by the -E option):
grep -Eo "[0-9]+\.[0-9]+" Nasa-1024-256.txt

Grep first group regexp

Is there a way to specify what regexp group I want to append to my file?
In the example below I only want to store (\d{8}) in my file:
grep -P1 -o kamilla(\d{8}) >> whatever.txt
You'll need to use a Positive Lookbehind assertion or alternative so that it isn't included in the match.
Positive Lookbehind:
grep -Poi '(?<=kamilla)\d{8}'
The look-behind asserts that at the current position in the string, what precedes is "kamilla". If the assertion succeeds, the regular expression engine matches eight digits.
Alternative \K escape sequence:
grep -Poi 'kamilla\K\d{8}'
The \K escape sequence resets the starting point of the reported match. Any previously matched characters are not included in the final matched sequence.
-o option shows only the matching part that matches the pattern.
You can use the -o switch and a \K, which removes the preceding part of the match:
$ grep -Poi 'kamilla\K\d{8}' <<<"kamilla83222237"
83222237
As you're using Perl-style regular expressions, you could also just use Perl:
$ perl -nE 'say $1 if /kamilla(\d{8})/' <<<"kamilla83222237"
83222237
Another way:
$ grep -P -o '(?<=kamilla)\d{8}' <<< kamilla12345678
12345678
You can use sed instead:
sed -E "s/.*kamilla(\d{8}).*/\1/g" input.txt >> output.txt
This is replacing input line with first matching group \1 and printing it.
This also allows you to manipulate input file is some non-trivial ways. For example, you can match two groups and output them in non-default order, like \2\1 and so on.

grep not returning expected result with regex on xml

I'm running a grep command on some xml, and it appears to be misinterpretting the regular expression I'm trying to use.
Here's the command
grep '<ernm:NewReleaseMessage.*?>' ./075679942012_ORIGNAL.xml
what appears to be happening is that the ?> aspect of the regex seems to cause no matching rather than matching to the first occurence of >
Any ideas?
If you want to get the text upto the first occurrence of > character then try the below command,
grep -o '<ernm:NewReleaseMessage[^>]*>' file
If you want the whole line then remove -o parameter.
Example:
$ cat aa1.txt
<ernm:NewReleaseMessage blah> foo bar>
$ grep -o '<ernm:NewReleaseMessage[^>]*>' aa1.txt
<ernm:NewReleaseMessage blah>
grep with -o prints only the matched text.
[^>]* - Not of > character zero or more. So it matches upto the first occurance of > character.
By default, grep uses basic regular expression and considers ? as a literal question-mark. For it to be considered regular expression syntax, you need to escape that character.
grep '<ernm:NewReleaseMessage.*\?>' ./075679942012_ORIGNAL.xml
You can use the -E option which interprets the pattern as an extended regular expression.
grep -E '<ernm:NewReleaseMessage.*?>' ./075679942012_ORIGNAL.xml
Note: This above will return the whole line that matches your pattern, if you only want the matched text, use the -o option which prints only the matched parts of matching lines.
grep -o '<ernm:NewReleaseMessage.*\?>' ./075679942012_ORIGNAL.xml
OR
grep -Eo '<ernm:NewReleaseMessage.*?>' ./075679942012_ORIGNAL.xml

Grep does not show results, online regex tester does

I am fairly unexperienced with the behavior of grep. I have a bunch of XML files that contain lines like these:
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
<identifier type="abc">abc:def.ghi/g5678m.ab678901</identifier>
I wanted to get the identifier part after the slash and constructed a regex using RegexPal:
[a-z]\d{4}[a-z]*\.[a-z]*\d*
It highlights everything that I wanted. Perfect. Now when I run grep on the very same file, I don't get any results. And as I said, I really don't know much about grep, so I tried all different combinations.
grep [a-z]\d{4}[a-z]*\.[a-z]*\d* test.xml
grep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
egrep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
grep '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
grep -E '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
What am I doing wrong?
Your regex doesn't match the input. Let's break it down:
[a-z] matches g
\d{4} matches 1234
[a-z]* doesn't match .
Also, I believe grep and family don't like the \d syntax. Try either [0-9] or [:digit:]
Finally, when using regular expressions, prefer egrep to grep. I don't remember the exact details, but egrep supports more regex operators. Also, in many shells (including bash on OS X as you mentioned, use single quotes instead of double quotes, otherwise * will be expanded by the shell to a list of files in the current directory before grep sees it (and other shell meta-characters will get expanded too). Bash won't touch anything in single quotes.
grep doesn't support \d by defaul. To match a digit, use [0-9], or allow Perl compatible regular expressions:
$ grep -P "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
or:
$ egrep "[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*" test.xml
grep uses "basic" regular expressions : (excerpt from man pages )
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their
special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and
\).
Traditional egrep did not support the { meta-character, and some egrep
implementations support \{ instead, so portable scripts should avoid { in
grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not
special if it would be the start of an invalid interval specification. For
example, the command grep -E '{1' searches for the two-character string {1
instead of reporting a syntax error in the regular expression. POSIX.2 allows
this behavior as an extension, but portable scripts should avoid it.
Also depending on which shell you are executing in the '*' character might get expanded.
You can make use of the following command:
$ cat file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# Use -P option to enable Perl style regex \d.
$ grep -P '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# to get only the part of the input that matches use -o option:
$ grep -P -o '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
g1234.ab012345
# You can use [0-9] inplace of \d and use -E option.
$ grep -E -o '[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*' file
g1234.ab012345
$
Try this:
[a-z]\d{5}[.][a-z]{2}\d{6}
Try this expression in grep:
[a-z]\d{4}[a-z]*\.[a-z]*\d*