How to replace regex pattern with a new term in sed? - regex

I would like to replace all terms that start with a hashtag with a new term
I'm using sed but there seems to be a syntax error
sed 's/#[a-zA-Z0-9]+/replacement/g' terms
How can I correct my syntax?

sed supports a "basic regular expression" (BRE) which does not offer the + as a special operator.
A correct replacement for + would be
sed 's/#[[:alnum:]]\{1,\}/replacement/g'
or
sed 's/#[[:alnum:]][[:alnum:]]*/replacement/g'
GNU sed and recent BSD sed offer "extended regular expression" (ERE) matching:
sed -E 's/#[[:alnum:]]+/replacement/g'
(although with GNU sed you should probably use -r since -E is currently undocumented)
and they also offer \+ as an extension to BRE,
sed 's/#[[:alnum:]]\+/replacement/g'
If you require portability you should stick with the BRE of regular sed.
#user784637 I used [[:alnum:]] instead of [a-zA-Z0-9]. This would also match letters with diacriticals for example.
$ printf "%s\n" ë è é | grep '[a-zA-Z0-9]'
$
vs.
$ printf "%s\n" ë è é | grep '[[:alnum:]]'
ë
è
é
$
You could use either that suits your needs..

On my version of sed, + doesn't do anything useful. You should use * instead.

Related

sed regular expression failed on solaris

Under Solaris 5.10, Why this regexp doesn't match a line like tag="12447"
sed "s/tag=\"[0-9]+\"/emptytag/" test.xml
(I noticed that -r is not implemented in the sed version)
In strict posix mode, the + sign cannot be used to represent "one or more" of something. You can use a range of {1,} instead (escaped of course):
echo 'tag="12447"' | sed --posix "s/tag=\"[0-9]\{1,\}\"/emptytag/"
emptytag
Note that you don't actually need the --posix, I was just using it to disable all GNU extensions in my version of sed:
echo 'tag="12447"' | sed "s/tag=\"[0-9]\{1,\}\"/emptytag/"
emptytag

grep equivalent to pcre's `?` operator

With PCRE you'd do ax?a to find strings like aa and axa.
How would you write a regex for grep that'd do that?
grep default uses BRE, you could use -P (PCRE) or -E (ERE) option.
for example:
kent$ echo "aa
axa
axxxxa"|grep -E 'ax?a'
aa
axa
with BRE, you have to escape chars like ( ? + ... to give them special meaning.
In grep, you need to escape the quantifier:
ax\?a

Why doesn't this simple RegEx work with sed?

This is a really simple RegEx that isn't working, and I can't figure out why. According to this, it should work.
I'm on a Mac (OS X 10.8.2).
script.sh
#!/bin/bash
ZIP="software-1.3-licensetypeone.zip"
VERSION=$(sed 's/software-//g;s/-(licensetypeone|licensetypetwo).zip//g' <<< $ZIP)
echo $VERSION
terminal
$ sh script.sh
1.3-licensetypeone.zip
Looking at the regex documentation for OS X 10.7.4 (but should apply to OP's 10.8.2), it is mentioned in the last paragraph that
Obsolete (basic) regular expressions differ in several respects. | is an ordinary character and there is no equivalent for its functionality...
... The parentheses for nested subexpressions are \(' and )'...
sed, without any options, uses basic regular expression (BRE).
To use | in OS X or BSD's sed, you need to enable extended regular expression (ERE) via -E option, i.e.
sed -E 's/software-//g;s/-(licensetypeone|licensetypetwo).zip//g'
p/s: \| in BRE is a GNU extension.
Alternative ways to extract version number
chop-chop (parameter expansion)
VERSION=${ZIP#software-}
VERSION=${VERSION%-license*.zip}
sed
VERSION=$(sed 's/software-\(.*\)-license.*/\1/' <<< "$ZIP")
You don't necessarily have to match strings word-by-word with shell patterns or regex.
sed works with simple regular expressions. You have to backslash parentheses and a vertical bar to make it work.
sed 's/software-//g;s/-\(licensetypeone\|licensetypetwo\)\.zip//g'
Note that I backslashed the dot, too. Otherwise, it would have matched any character.
You can do this in the shell, don't need sed, parameter expansion suffices:
shopt -s extglob
ZIP="software-1.3-licensetypeone.zip"
tmp=${ZIP#software-}
VERSION=${tmp%-licensetype#(one|two).zip}
With a recent version of bash (may not ship with OSX) you can use regular expressions
if [[ $ZIP =~ software-([0-9.]+)-licensetype(one|two).zip ]]; then
VERSION=${BASH_REMATCH[1]}
fi
or, if you just want the 2nd word in a hyphen-separated string
VERSION=$(IFS=-; set -- $ZIP; echo $2)
$ man sed | grep "regexp-extended" -A2
-r, --regexp-extended
use extended regular expressions in the script.

How do you use a plus symbol with a character class as part of a regular expression?

in cygwin, this does not return a match:
$ echo "aaab" | grep '^[ab]+$'
But this does return a match:
$ echo "aaab" | grep '^[ab][ab]*$'
aaab
Are the two expressions not identical?
Is there any way to express "one or more characters of the character class" without typing the character class twice (like in the seconds example)?
According to this link the two expressions should be the same, but perhaps Regular-Expressions.info does not cover bash in cygwin.
grep has multiple "modes" of matching, and by default only uses a basic set, which does not recognize a number of metacharacters unless they're escaped. You can put grep into extended or perl modes to let + be evaluated.
From man grep:
Matcher Selection
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression. This is highly experimental and grep -P may warn of unimplemented features.
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).
Traditional egrep did not support the { meta-character, and some egrep implementations support \{ instead, so portable scripts should avoid { in grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not special if it would be the start of an invalid interval specification. For example, the command grep -E '{1' searches for the two-character string {1 instead of reporting a syntax
error in the regular expression. POSIX.2 allows this behavior as an extension, but portable scripts should avoid it.
Alternately, you can use egrep instead of grep -E.
In basic regular expressions the metacharacters ?, +, {, |, (, and )
lose their special meaning; instead use the backslashed versions \?,
\+, \{, \|, \(, and \).
So use the backslashed version:
$ echo aaab | grep '^[ab]\+$'
aaab
Or activate extended syntax:
$ echo aaab | egrep '^[ab]+$'
aaab
Masking by backslash, or egrep as extended grep, alias grep -e:
echo "aaab" | egrep '^[ab]+$'
aaab
echo "aaab" | grep '^[ab]\+$'
aaab

Grep does not show results, online regex tester does

I am fairly unexperienced with the behavior of grep. I have a bunch of XML files that contain lines like these:
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
<identifier type="abc">abc:def.ghi/g5678m.ab678901</identifier>
I wanted to get the identifier part after the slash and constructed a regex using RegexPal:
[a-z]\d{4}[a-z]*\.[a-z]*\d*
It highlights everything that I wanted. Perfect. Now when I run grep on the very same file, I don't get any results. And as I said, I really don't know much about grep, so I tried all different combinations.
grep [a-z]\d{4}[a-z]*\.[a-z]*\d* test.xml
grep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
egrep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
grep '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
grep -E '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
What am I doing wrong?
Your regex doesn't match the input. Let's break it down:
[a-z] matches g
\d{4} matches 1234
[a-z]* doesn't match .
Also, I believe grep and family don't like the \d syntax. Try either [0-9] or [:digit:]
Finally, when using regular expressions, prefer egrep to grep. I don't remember the exact details, but egrep supports more regex operators. Also, in many shells (including bash on OS X as you mentioned, use single quotes instead of double quotes, otherwise * will be expanded by the shell to a list of files in the current directory before grep sees it (and other shell meta-characters will get expanded too). Bash won't touch anything in single quotes.
grep doesn't support \d by defaul. To match a digit, use [0-9], or allow Perl compatible regular expressions:
$ grep -P "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
or:
$ egrep "[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*" test.xml
grep uses "basic" regular expressions : (excerpt from man pages )
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their
special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and
\).
Traditional egrep did not support the { meta-character, and some egrep
implementations support \{ instead, so portable scripts should avoid { in
grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not
special if it would be the start of an invalid interval specification. For
example, the command grep -E '{1' searches for the two-character string {1
instead of reporting a syntax error in the regular expression. POSIX.2 allows
this behavior as an extension, but portable scripts should avoid it.
Also depending on which shell you are executing in the '*' character might get expanded.
You can make use of the following command:
$ cat file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# Use -P option to enable Perl style regex \d.
$ grep -P '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# to get only the part of the input that matches use -o option:
$ grep -P -o '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
g1234.ab012345
# You can use [0-9] inplace of \d and use -E option.
$ grep -E -o '[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*' file
g1234.ab012345
$
Try this:
[a-z]\d{5}[.][a-z]{2}\d{6}
Try this expression in grep:
[a-z]\d{4}[a-z]*\.[a-z]*\d*