Grep hashes via regex in bash

Grep hashes via regex in bash - regex

I want to grep for hexadecimal hashes in strings and only extract those hashes.
I've tested a regex in online regex testing tools that does the trick:
\b[0-9a-f][0-9a-f]+[0-9a-f]\b
The \b is used to set word boundaries (start & end) that should be any character 0-9 or a-f. Since I do not know if the hashes are 128bit or higher, I do not know the length of the hashes in advance. Therefore I set [0-9a-f]+ in the middle in order match any number of [0-9a-f], but at least one (since no hash consists just of two characters that are checked with the boundaries \b).
However, I noticed that
grep --only-matching -e "\b[0-9a-f][0-9a-f]+[0-9a-f]\b"
does not work in the shell, while the regex \b[0-9a-f][0-9a-f]*[0-9a-f]\b works in online regex testing tools.
In fact, the shell version does only work if I escape the quantifier + with a backslash:
grep --only-matching -e "\b[0-9a-f][0-9a-f]\+[0-9a-f]\b"
^
|_ escaped +
Why does grep needs this escaping in the shell?
Is there any downside of my rather simple approach?

I don't know why a metacharacter would need to be escaped in the bash, but your regex could be rewritten as this:
grep --only-matching -e "\b[0-9a-f]{3,}\b"

The + quantifier is not part of the POSIX Basic Regular Expressions (aka BRE) so you must escape it with grep in BRE mode.
As an alternative, you can:
add the -E flag to grep:
grep -E --only-matching -e "\b[0-9a-f][0-9a-f]+[0-9a-f]\b"
use [0-9a-f][0-9a-f]* or [0-9a-f]{1,}

Grep runs basic regular expressions by default. You need to escape the + quantifier with a backslash as it is said in the documentation:
In basic regular expressions the meta-characters ?, +, {, |,
(, and ) lose their special meaning; instead use the backslashed
versions \?, \+, \{, \|, \(, and \).
Also, there is no need for -e option, just
grep -o '\b[0-9a-f]\+\b' file

Related

Regex matches but sed fails replace

I am having a tricky regex issue
I have the string like below
some_Name _ _Bday Date Comm.txt
And here is my regex to match the spaces and underscore
\_?\s\_?
Now when i try to replace the string using sed and the above regex
echo "some_Name _ _Bday Date Comm.txt" | sed 's/\_?\s\_?/\_/g'
The output i want is
some_Name_Bday_Date_Comm.txt
Any ideas on how do i go about this ?

You are using a POSIX BRE regex engine with the \_?\s\_? pattern that matches a _?, a whitespace (if your sed supports \s shorthand) an a _? substring, i.e. the ? are treated as literal question mark symbols.
You may use
sed -E 's/[[:space:]_]+/_/g'
sed 's/[[:space:]_]\{1,\}/_/g'
See online sed demo
The [[:space:]_]+ POSIX ERE pattern (enabled with -E option) will match one or more whitespace or underscore characters.
The POSIX ERE + quantifier can be written as \{1,\} in POSIX BRE. Also, if you use a GNU sed, you may use \+ in the second sed command.

This might work for you (GNU sed):
sed -E 's/\s(\s*_)*/_/g' file
This will replace a space followed by zero or more of the following: zero or more spaces followed by an underscore.

How can I grep for a string that contains multiple consecutive dashes?

I want to grep for the string that contains with dashes like this:
---0 [58ms, 100%, 100%]
There's at least one dash.
I found this question: How can I grep for a string that begins with a dash/hyphen?
So I want to use:
grep -- -+ test.txt
But I get nothing.
Finally, my colleague tells me that this will work:
grep '\-\+' test.txt
Yes, it works. But neither he nor I don't know why after searched many documents.
This also works:
grep -- -* test.txt

With -+ you are saying: multiple -. But this is not understood automatically by grep. You need to tell it that + has a special meaning.
You can do it by using an extended regex -E:
grep -E -- "-+" file
or by escaping the +:
grep -- "-\+" file
Test
$ cat a
---0 [58ms, 100%, 100%]
hell
$ grep -E -- "-+" a
---0 [58ms, 100%, 100%]
$ grep -- "-\+" a
---0 [58ms, 100%, 100%]
From man grep:
REGULAR EXPRESSIONS
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |,
(, and ) lose their special meaning; instead use the backslashed
versions \?, \+, \{, \|, \(, and \).

How do you use a plus symbol with a character class as part of a regular expression?

in cygwin, this does not return a match:
$ echo "aaab" | grep '^[ab]+$'
But this does return a match:
$ echo "aaab" | grep '^[ab][ab]*$'
aaab
Are the two expressions not identical?
Is there any way to express "one or more characters of the character class" without typing the character class twice (like in the seconds example)?
According to this link the two expressions should be the same, but perhaps Regular-Expressions.info does not cover bash in cygwin.

grep has multiple "modes" of matching, and by default only uses a basic set, which does not recognize a number of metacharacters unless they're escaped. You can put grep into extended or perl modes to let + be evaluated.
From man grep:
Matcher Selection
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression. This is highly experimental and grep -P may warn of unimplemented features.
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).
Traditional egrep did not support the { meta-character, and some egrep implementations support \{ instead, so portable scripts should avoid { in grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not special if it would be the start of an invalid interval specification. For example, the command grep -E '{1' searches for the two-character string {1 instead of reporting a syntax
error in the regular expression. POSIX.2 allows this behavior as an extension, but portable scripts should avoid it.
Alternately, you can use egrep instead of grep -E.

In basic regular expressions the metacharacters ?, +, {, |, (, and )
lose their special meaning; instead use the backslashed versions \?,
\+, \{, \|, \(, and \).
So use the backslashed version:
$ echo aaab | grep '^[ab]\+$'
aaab
Or activate extended syntax:
$ echo aaab | egrep '^[ab]+$'
aaab

Masking by backslash, or egrep as extended grep, alias grep -e:
echo "aaab" | egrep '^[ab]+$'
aaab
echo "aaab" | grep '^[ab]\+$'
aaab

Grep does not show results, online regex tester does

I am fairly unexperienced with the behavior of grep. I have a bunch of XML files that contain lines like these:
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
<identifier type="abc">abc:def.ghi/g5678m.ab678901</identifier>
I wanted to get the identifier part after the slash and constructed a regex using RegexPal:
[a-z]\d{4}[a-z]*\.[a-z]*\d*
It highlights everything that I wanted. Perfect. Now when I run grep on the very same file, I don't get any results. And as I said, I really don't know much about grep, so I tried all different combinations.
grep [a-z]\d{4}[a-z]*\.[a-z]*\d* test.xml
grep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
egrep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
grep '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
grep -E '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
What am I doing wrong?

Your regex doesn't match the input. Let's break it down:
[a-z] matches g
\d{4} matches 1234
[a-z]* doesn't match .
Also, I believe grep and family don't like the \d syntax. Try either [0-9] or [:digit:]
Finally, when using regular expressions, prefer egrep to grep. I don't remember the exact details, but egrep supports more regex operators. Also, in many shells (including bash on OS X as you mentioned, use single quotes instead of double quotes, otherwise * will be expanded by the shell to a list of files in the current directory before grep sees it (and other shell meta-characters will get expanded too). Bash won't touch anything in single quotes.

grep doesn't support \d by defaul. To match a digit, use [0-9], or allow Perl compatible regular expressions:
$ grep -P "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
or:
$ egrep "[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*" test.xml

grep uses "basic" regular expressions : (excerpt from man pages )
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their
special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and
\).
Traditional egrep did not support the { meta-character, and some egrep
implementations support \{ instead, so portable scripts should avoid { in
grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not
special if it would be the start of an invalid interval specification. For
example, the command grep -E '{1' searches for the two-character string {1
instead of reporting a syntax error in the regular expression. POSIX.2 allows
this behavior as an extension, but portable scripts should avoid it.
Also depending on which shell you are executing in the '*' character might get expanded.

You can make use of the following command:
$ cat file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# Use -P option to enable Perl style regex \d.
$ grep -P '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# to get only the part of the input that matches use -o option:
$ grep -P -o '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
g1234.ab012345
# You can use [0-9] inplace of \d and use -E option.
$ grep -E -o '[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*' file
g1234.ab012345
$

Try this:
[a-z]\d{5}[.][a-z]{2}\d{6}

Try this expression in grep:
[a-z]\d{4}[a-z]*\.[a-z]*\d*

Pattern matching digits does not work in egrep?

Why can't I match the string
"1234567-1234567890"
with the given regular expression
\d{7}-\d{10}
with egrep from the shell like this:
egrep \d{7}-\d{10} file
?

egrep doesn't recognize \d shorthand for digit character class, so you need to use e.g. [0-9].
Moreover, while it's not absolutely necessary in this case, it's good habit to quote the regex to prevent misinterpretation by the shell. Thus, something like this should work:
egrep '[0-9]{7}-[0-9]{10}' file
See also
egrep mini tutorial
References
regular-expressions.info/Flavor comparison
Flavor note for GNU grep, ed, sed, egrep, awk, emacs
Lists the differences between grep vs egrep vs other regex flavors

For completeness:
Egrep does in fact have support for character classes. The classes are:
[:alnum:]
[:alpha:]
[:cntrl:]
[:digit:]
[:graph:]
[:lower:]
[:print:]
[:punct:]
[:space:]
[:upper:]
[:xdigit:]
Example (note the double brackets):
egrep '[[:digit:]]{7}-[[:digit:]]{10}' file

you can use \d if you pass grep the "perl regex" option, ex:
grep -P "\d{9}"

Use [0-9] instead of \d. egrep doesn't know \d.

try this one:
egrep '(\d{7}-\d{10})' file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Grep hashes via regex in bash - regex

I don't know why a metacharacter would need to be escaped in the bash, but your regex could be rewritten as this: grep --only-matching -e "\b[0-9a-f]{3,}\b"

The + quantifier is not part of the POSIX Basic Regular Expressions (aka BRE) so you must escape it with grep in BRE mode. As an alternative, you can: add the -E flag to grep: grep -E --only-matching -e "\b[0-9a-f][0-9a-f]+[0-9a-f]\b" use [0-9a-f][0-9a-f]* or [0-9a-f]{1,}

Related

Regex matches but sed fails replace

How can I grep for a string that contains multiple consecutive dashes?

How do you use a plus symbol with a character class as part of a regular expression?

Grep does not show results, online regex tester does

Pattern matching digits does not work in egrep?

Categories

Resources