With PCRE you'd do ax?a to find strings like aa and axa.
How would you write a regex for grep that'd do that?
grep default uses BRE, you could use -P (PCRE) or -E (ERE) option.
for example:
kent$ echo "aa
axa
axxxxa"|grep -E 'ax?a'
aa
axa
with BRE, you have to escape chars like ( ? + ... to give them special meaning.
In grep, you need to escape the quantifier:
ax\?a
Related
Both of the regexes below work In my case.
grep \s
grep ^[[:space:]]
However all those below fail. I tried both in git bash and putty.
grep ^\s
grep ^\s*
grep -E ^\s
grep -P ^\s
grep ^[\s]
grep ^(\s)
The last one even produces a syntax error.
If I try ^\s in debuggex it works.
Debuggex Demo
How do I find lines starting with whitespace characters with grep ? Do I have to use [[:space:]] ?
grep \s works for you because your input contains s. Here, you escape s and it matches the s, since it is not parsed as a whitespace matching regex escape. If you use grep ^\\s, you will match a string starting with whitespace since the \\ will be parsed as a literal \ char.
A better idea is to enable POSIX ERE syntax with -E and quote the pattern:
grep -E '^\s' <<< "$s"
See the online demo:
s=' word'
grep ^\\s <<< "$s"
# => word
grep -E '^\s' <<< "$s"
# => word
I'd like to use grep to match all characters before the first whitespace.
grep "^[^\s]*" filename.txt
did not work. Instead, all characters before the first s are matched. Is there no \s available in grep?
You can also try with perl regex flag P and o flag to show only matched part in the output:
grep -oP "^\S+" filename.txt
With a POSIX character class:
grep -o '^[^[:blank:]]*' filename.txt
As for where \s is available:
POSIX grep supports only Basic Regular Expressions or, when called grep -E, Extended Regular Expressions, both of which have no \s
GNU grep supports \s as a synonym for [[:space:]]
BSD grep doesn't seem to support \s
Alternatively, you could use awk with the field separator explicitly set to a single space so leading blanks aren't ignored:
awk -F ' ' '{ print $1 }'
I would like to replace all terms that start with a hashtag with a new term
I'm using sed but there seems to be a syntax error
sed 's/#[a-zA-Z0-9]+/replacement/g' terms
How can I correct my syntax?
sed supports a "basic regular expression" (BRE) which does not offer the + as a special operator.
A correct replacement for + would be
sed 's/#[[:alnum:]]\{1,\}/replacement/g'
or
sed 's/#[[:alnum:]][[:alnum:]]*/replacement/g'
GNU sed and recent BSD sed offer "extended regular expression" (ERE) matching:
sed -E 's/#[[:alnum:]]+/replacement/g'
(although with GNU sed you should probably use -r since -E is currently undocumented)
and they also offer \+ as an extension to BRE,
sed 's/#[[:alnum:]]\+/replacement/g'
If you require portability you should stick with the BRE of regular sed.
#user784637 I used [[:alnum:]] instead of [a-zA-Z0-9]. This would also match letters with diacriticals for example.
$ printf "%s\n" ë è é | grep '[a-zA-Z0-9]'
$
vs.
$ printf "%s\n" ë è é | grep '[[:alnum:]]'
ë
è
é
$
You could use either that suits your needs..
On my version of sed, + doesn't do anything useful. You should use * instead.
in cygwin, this does not return a match:
$ echo "aaab" | grep '^[ab]+$'
But this does return a match:
$ echo "aaab" | grep '^[ab][ab]*$'
aaab
Are the two expressions not identical?
Is there any way to express "one or more characters of the character class" without typing the character class twice (like in the seconds example)?
According to this link the two expressions should be the same, but perhaps Regular-Expressions.info does not cover bash in cygwin.
grep has multiple "modes" of matching, and by default only uses a basic set, which does not recognize a number of metacharacters unless they're escaped. You can put grep into extended or perl modes to let + be evaluated.
From man grep:
Matcher Selection
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression. This is highly experimental and grep -P may warn of unimplemented features.
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).
Traditional egrep did not support the { meta-character, and some egrep implementations support \{ instead, so portable scripts should avoid { in grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not special if it would be the start of an invalid interval specification. For example, the command grep -E '{1' searches for the two-character string {1 instead of reporting a syntax
error in the regular expression. POSIX.2 allows this behavior as an extension, but portable scripts should avoid it.
Alternately, you can use egrep instead of grep -E.
In basic regular expressions the metacharacters ?, +, {, |, (, and )
lose their special meaning; instead use the backslashed versions \?,
\+, \{, \|, \(, and \).
So use the backslashed version:
$ echo aaab | grep '^[ab]\+$'
aaab
Or activate extended syntax:
$ echo aaab | egrep '^[ab]+$'
aaab
Masking by backslash, or egrep as extended grep, alias grep -e:
echo "aaab" | egrep '^[ab]+$'
aaab
echo "aaab" | grep '^[ab]\+$'
aaab
Why can't I match the string
"1234567-1234567890"
with the given regular expression
\d{7}-\d{10}
with egrep from the shell like this:
egrep \d{7}-\d{10} file
?
egrep doesn't recognize \d shorthand for digit character class, so you need to use e.g. [0-9].
Moreover, while it's not absolutely necessary in this case, it's good habit to quote the regex to prevent misinterpretation by the shell. Thus, something like this should work:
egrep '[0-9]{7}-[0-9]{10}' file
See also
egrep mini tutorial
References
regular-expressions.info/Flavor comparison
Flavor note for GNU grep, ed, sed, egrep, awk, emacs
Lists the differences between grep vs egrep vs other regex flavors
For completeness:
Egrep does in fact have support for character classes. The classes are:
[:alnum:]
[:alpha:]
[:cntrl:]
[:digit:]
[:graph:]
[:lower:]
[:print:]
[:punct:]
[:space:]
[:upper:]
[:xdigit:]
Example (note the double brackets):
egrep '[[:digit:]]{7}-[[:digit:]]{10}' file
you can use \d if you pass grep the "perl regex" option, ex:
grep -P "\d{9}"
Use [0-9] instead of \d. egrep doesn't know \d.
try this one:
egrep '(\d{7}-\d{10})' file