Why can't I match the string
"1234567-1234567890"
with the given regular expression
\d{7}-\d{10}
with egrep from the shell like this:
egrep \d{7}-\d{10} file
?
egrep doesn't recognize \d shorthand for digit character class, so you need to use e.g. [0-9].
Moreover, while it's not absolutely necessary in this case, it's good habit to quote the regex to prevent misinterpretation by the shell. Thus, something like this should work:
egrep '[0-9]{7}-[0-9]{10}' file
See also
egrep mini tutorial
References
regular-expressions.info/Flavor comparison
Flavor note for GNU grep, ed, sed, egrep, awk, emacs
Lists the differences between grep vs egrep vs other regex flavors
For completeness:
Egrep does in fact have support for character classes. The classes are:
[:alnum:]
[:alpha:]
[:cntrl:]
[:digit:]
[:graph:]
[:lower:]
[:print:]
[:punct:]
[:space:]
[:upper:]
[:xdigit:]
Example (note the double brackets):
egrep '[[:digit:]]{7}-[[:digit:]]{10}' file
you can use \d if you pass grep the "perl regex" option, ex:
grep -P "\d{9}"
Use [0-9] instead of \d. egrep doesn't know \d.
try this one:
egrep '(\d{7}-\d{10})' file
Related
I am having a tricky regex issue
I have the string like below
some_Name _ _Bday Date Comm.txt
And here is my regex to match the spaces and underscore
\_?\s\_?
Now when i try to replace the string using sed and the above regex
echo "some_Name _ _Bday Date Comm.txt" | sed 's/\_?\s\_?/\_/g'
The output i want is
some_Name_Bday_Date_Comm.txt
Any ideas on how do i go about this ?
You are using a POSIX BRE regex engine with the \_?\s\_? pattern that matches a _?, a whitespace (if your sed supports \s shorthand) an a _? substring, i.e. the ? are treated as literal question mark symbols.
You may use
sed -E 's/[[:space:]_]+/_/g'
sed 's/[[:space:]_]\{1,\}/_/g'
See online sed demo
The [[:space:]_]+ POSIX ERE pattern (enabled with -E option) will match one or more whitespace or underscore characters.
The POSIX ERE + quantifier can be written as \{1,\} in POSIX BRE. Also, if you use a GNU sed, you may use \+ in the second sed command.
This might work for you (GNU sed):
sed -E 's/\s(\s*_)*/_/g' file
This will replace a space followed by zero or more of the following: zero or more spaces followed by an underscore.
So I started to learn regex using grep and sed in linux, and I don't understand why I have to save curly braces? So saving means escaping characters to match them literally, but when I type in grep 'test{2}' it will only match test{2} and when I type 'test\{2\}' it will match testtest. It's okay, but why backslash has another usage with other modifiers? For example in the case of . (dot), when I type test. it will match any text with test followed by any characters. In this case we need backslash to interpret it as a character. So when I use it like that: test\. it will only match test.
So summarized: why in the case of { backslash saves the curly braces to be interpreted as a character, and in the case of other modifiers, like . backslash saves the character to be interpreted as a special one...
I know it sounds hilarious but I don't understand it...
When grep is used with no -E you need to escape ("save") braces that are quantifiers because the regex flavor used is POSIX BRE:
grep 'test\{2\}' file # => Finds lines having testt, not testtest
and
grep '\(test\)\{2\}' file # => Finds lines having testtest
The identical POSIX ERE variants are
grep -E 'test{2}' file
grep -E '(test){2}' file
Another example is to match curly braces:
grep '{2}' file # => matches lines having {2} in them
grep -E '\{2}' file => same, note the } is not special
See more about BRE and ERE POSIX regex standard.
The differences between BRE and ERE POSIX syntax are just historical, there seems no specific idea behind that.
I want to search recursiv in files for a given pattern and replace them. The search is for a string like "['DB']['1']['HOST'] = 'localhost'". If testing the regex the following doesn't print anything. Can't see an error in this regex? Could anyone help?
sed -n '/\[\'HOST\'\]\s?=\s?(?:\'|")(.+)(?:\'|")/p' /path/to/file
POSIX regex does not support non-capturing groups. Besides, you have not specified the -E option and the pattern is parsed as a BRE POSIX pattern where the capturing parentheses should be escaped. Also, the single quotes cannot be escaped to be used in a sed regex pattern, use \x27 instead.
Use
sed -En '/\[\x27HOST\x27\]\s?=\s?[\x27"][^\x27"]+[\x27"]/p'
See an online demo:
s="a string like ['DB']['1']['HOST'] = 'localhost'."
sed -En '/\[\x27HOST\x27\]\s?=\s?[\x27"][^\x27"]+[\x27"]/p' <<< "$s"
Besides, instead of \s, it might be a good idea to use [[:space:]].
I want to grep for hexadecimal hashes in strings and only extract those hashes.
I've tested a regex in online regex testing tools that does the trick:
\b[0-9a-f][0-9a-f]+[0-9a-f]\b
The \b is used to set word boundaries (start & end) that should be any character 0-9 or a-f. Since I do not know if the hashes are 128bit or higher, I do not know the length of the hashes in advance. Therefore I set [0-9a-f]+ in the middle in order match any number of [0-9a-f], but at least one (since no hash consists just of two characters that are checked with the boundaries \b).
However, I noticed that
grep --only-matching -e "\b[0-9a-f][0-9a-f]+[0-9a-f]\b"
does not work in the shell, while the regex \b[0-9a-f][0-9a-f]*[0-9a-f]\b works in online regex testing tools.
In fact, the shell version does only work if I escape the quantifier + with a backslash:
grep --only-matching -e "\b[0-9a-f][0-9a-f]\+[0-9a-f]\b"
^
|_ escaped +
Why does grep needs this escaping in the shell?
Is there any downside of my rather simple approach?
I don't know why a metacharacter would need to be escaped in the bash, but your regex could be rewritten as this:
grep --only-matching -e "\b[0-9a-f]{3,}\b"
The + quantifier is not part of the POSIX Basic Regular Expressions (aka BRE) so you must escape it with grep in BRE mode.
As an alternative, you can:
add the -E flag to grep:
grep -E --only-matching -e "\b[0-9a-f][0-9a-f]+[0-9a-f]\b"
use [0-9a-f][0-9a-f]* or [0-9a-f]{1,}
Grep runs basic regular expressions by default. You need to escape the + quantifier with a backslash as it is said in the documentation:
In basic regular expressions the meta-characters ?, +, {, |,
(, and ) lose their special meaning; instead use the backslashed
versions \?, \+, \{, \|, \(, and \).
Also, there is no need for -e option, just
grep -o '\b[0-9a-f]\+\b' file
I am fairly unexperienced with the behavior of grep. I have a bunch of XML files that contain lines like these:
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
<identifier type="abc">abc:def.ghi/g5678m.ab678901</identifier>
I wanted to get the identifier part after the slash and constructed a regex using RegexPal:
[a-z]\d{4}[a-z]*\.[a-z]*\d*
It highlights everything that I wanted. Perfect. Now when I run grep on the very same file, I don't get any results. And as I said, I really don't know much about grep, so I tried all different combinations.
grep [a-z]\d{4}[a-z]*\.[a-z]*\d* test.xml
grep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
egrep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
grep '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
grep -E '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
What am I doing wrong?
Your regex doesn't match the input. Let's break it down:
[a-z] matches g
\d{4} matches 1234
[a-z]* doesn't match .
Also, I believe grep and family don't like the \d syntax. Try either [0-9] or [:digit:]
Finally, when using regular expressions, prefer egrep to grep. I don't remember the exact details, but egrep supports more regex operators. Also, in many shells (including bash on OS X as you mentioned, use single quotes instead of double quotes, otherwise * will be expanded by the shell to a list of files in the current directory before grep sees it (and other shell meta-characters will get expanded too). Bash won't touch anything in single quotes.
grep doesn't support \d by defaul. To match a digit, use [0-9], or allow Perl compatible regular expressions:
$ grep -P "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
or:
$ egrep "[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*" test.xml
grep uses "basic" regular expressions : (excerpt from man pages )
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their
special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and
\).
Traditional egrep did not support the { meta-character, and some egrep
implementations support \{ instead, so portable scripts should avoid { in
grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not
special if it would be the start of an invalid interval specification. For
example, the command grep -E '{1' searches for the two-character string {1
instead of reporting a syntax error in the regular expression. POSIX.2 allows
this behavior as an extension, but portable scripts should avoid it.
Also depending on which shell you are executing in the '*' character might get expanded.
You can make use of the following command:
$ cat file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# Use -P option to enable Perl style regex \d.
$ grep -P '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# to get only the part of the input that matches use -o option:
$ grep -P -o '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
g1234.ab012345
# You can use [0-9] inplace of \d and use -E option.
$ grep -E -o '[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*' file
g1234.ab012345
$
Try this:
[a-z]\d{5}[.][a-z]{2}\d{6}
Try this expression in grep:
[a-z]\d{4}[a-z]*\.[a-z]*\d*