Why doesn't this simple RegEx work with sed? - regex

This is a really simple RegEx that isn't working, and I can't figure out why. According to this, it should work.
I'm on a Mac (OS X 10.8.2).
script.sh
#!/bin/bash
ZIP="software-1.3-licensetypeone.zip"
VERSION=$(sed 's/software-//g;s/-(licensetypeone|licensetypetwo).zip//g' <<< $ZIP)
echo $VERSION
terminal
$ sh script.sh
1.3-licensetypeone.zip

Looking at the regex documentation for OS X 10.7.4 (but should apply to OP's 10.8.2), it is mentioned in the last paragraph that
Obsolete (basic) regular expressions differ in several respects. | is an ordinary character and there is no equivalent for its functionality...
... The parentheses for nested subexpressions are \(' and )'...
sed, without any options, uses basic regular expression (BRE).
To use | in OS X or BSD's sed, you need to enable extended regular expression (ERE) via -E option, i.e.
sed -E 's/software-//g;s/-(licensetypeone|licensetypetwo).zip//g'
p/s: \| in BRE is a GNU extension.
Alternative ways to extract version number
chop-chop (parameter expansion)
VERSION=${ZIP#software-}
VERSION=${VERSION%-license*.zip}
sed
VERSION=$(sed 's/software-\(.*\)-license.*/\1/' <<< "$ZIP")
You don't necessarily have to match strings word-by-word with shell patterns or regex.

sed works with simple regular expressions. You have to backslash parentheses and a vertical bar to make it work.
sed 's/software-//g;s/-\(licensetypeone\|licensetypetwo\)\.zip//g'
Note that I backslashed the dot, too. Otherwise, it would have matched any character.

You can do this in the shell, don't need sed, parameter expansion suffices:
shopt -s extglob
ZIP="software-1.3-licensetypeone.zip"
tmp=${ZIP#software-}
VERSION=${tmp%-licensetype#(one|two).zip}
With a recent version of bash (may not ship with OSX) you can use regular expressions
if [[ $ZIP =~ software-([0-9.]+)-licensetype(one|two).zip ]]; then
VERSION=${BASH_REMATCH[1]}
fi
or, if you just want the 2nd word in a hyphen-separated string
VERSION=$(IFS=-; set -- $ZIP; echo $2)

$ man sed | grep "regexp-extended" -A2
-r, --regexp-extended
use extended regular expressions in the script.

Related

Regex code tested but not matching when used with sed

Why does the following
$ echo "test\|String" | sed 's/(?!\||\\)./&./g'
test\|String
Not produce:
t.e.s.t.\|S.t.r.i.n.g.
I have tested the regex code and it picks the string up correctly in my tests
I want it to do the following:
$ echo "test\|String" | sed 's/(?!\||\\)./&./g'
t.e.s.t.\|S.t.r.i.n.g.
I would have written what you describe like this:
echo "test\|String" | sed 's/[^\\|]/&./g'
The following produces the effect you want:
echo "test\|String" | perl -npe 's/(?!\||\\)./$&./g'
t.e.s.t.\|S.t.r.i.n.g.
Not all regular expression engines are the same. You are using something close to PCRE (Perl-Compatible Regular Expressions), but the one used in sed is older, more limited, and has some quirks, like using \( and \) to match groups, rather than ( and ). It doesn't know about negative lookahead assertions. Perl's -n and -p flags give it sed-like behavior. Note finally that Perl uses $& to indicate "entire matched expression", not & alone.
The regular expression engines used by the shell, by grep, by sed, by Emacs, by POSIX, and by PCRE are all different; unless your regex is quite simple, you will need to target the specific engine.

Retrieve value of attribute in bash

I have a list of lines:
<some_random_text="someval" my_val_="0.4" some_random_text_1="someval_">
<some_random_text="someval" my_val_="0.8" some_random_text_1="someval_">
<some_random_text="someval" my_val_="1.2" some_random_text_1="someval_">
and so on.
From each line, I want to return the numeric value given after my_val_. How can I do this in bash?
Within this very rigid structure, what you want to do is quite easy using sed:
sed 's/.*my_val_="\([0-9.]\{1,\}\)".*/\1/' file
or using extended regular expressions:
sed -r 's/.*my_val_="([0-9.]+)".*/\1/' file
This captures the part you're interested in (the digits and dots between the quotes) and uses them to replace the contents of the line.
As mentioned in the comments (thanks), the switch to enable extended regular expressions differs between versions of sed. Out of habit, I tend to use -r but some implementations (such as BSD sed on OSX) work with -E instead. Others work with either -r or -E but neither option is defined by the standard.
This could also be done in native bash (although I wouldn't recommend it...):
re='my_val_="([0-9.]+)"'
while read -r line; do
[[ $line =~ $re ]] && echo "${BASH_REMATCH[1]}"
done < file
=~ is the regex match operator. The captured digits and dots are stored in element 1 of the special array BASH_REMATCH.
The sed and bash approaches are subtly different, as the sed version will print all lines in the file, even if they don't match the pattern. If this is a problem, you can add the -n switch and a p at the end of the command to print matching lines:
sed -nr 's/.*my_val_="([0-9.]+)".*/\1/p' file
With grep:
grep -oP 'my_val_="\K[^"]*' filename
-o so that grep only prints only the match, -P so that Perl-compatible regexes are used.
The \K in the regex removes from the match everything that was matched by the part of the regex that came before it; this has the effect of a lookbehind: only non-quote characters that come directly after my_val_=" are matched.

the use of "+" in sed regular expression

echo " lfj"|sed -e "s/^\s+//g"
lfj
I remember + indicates one or more, so there's two white spaces in front of lfj, why cannot it strip them?
You need to escape the +.
$ echo " lfj"|sed -e "s/^\s\+//g"
lfj
Basic sed uses BRE(Basic REgular Expressions). To make + to repeat the previous character one or more times in BRE, you need to escape it.
Enable the extended regexp option -r in sed to make sed to use ERE instead of BRE.
$ echo " lfj"|sed -r "s/^\s+//g"
lfj
Use -r
echo " lfj"|sed -re "s/^\s+//g"
From http://www.grymoire.com/Unix/Sed.html
A quick comment. The original sed did not support the "+" metacharacter. GNU sed does if
you use the "-r" command line option, which enables extended regular expressions. The "+"
means "one or more matches".
also from https://www.gnu.org/software/sed/manual/sed.html
-r
--regexp-extended
Use extended regular expressions rather than basic regular expressions. Extended regexps
are those that egrep accepts; they can be clearer because they usually have less backslashes,
but are a GNU extension and hence scripts that use them are not portable.

PCRE Regex to SED

I am trying to take PCRE regex and use it in SED, but I'm running into some issues. Please note that this question is representative of a bigger issue (how to convert PCRE regex to work with SED) so the question is not simply about the example below, but about how to use PCRE regex in SED regex as a whole.
This example is extracting an email address from a line, and replacing it with "[emailaddr]".
echo "My email is abc#example.com" | sed -e 's/[a-zA-Z0-9]+[#][a-zA-Z0-9]+[\.][A-Za-z]{2,4}/[emailaddr]/g'
I've tried the following replace regex:
([a-zA-Z0-9]+[#][a-zA-Z0-9]+[\.][A-Za-z]{2,4})
[a-zA-Z0-9]+[#][a-zA-Z0-9]+[\.][A-Za-z]{2,4}
([a-zA-Z0-9]+[#][a-zA-Z0-9]+[.][A-Za-z]{2,4})
[a-zA-Z0-9]+[#][a-zA-Z0-9]+[.][A-Za-z]{2,4}
I've tried changing the delimited of sed from s/find/replace/g to s|find|replace|g as outlined here (stack overflow: pcre regex to sed regex).
I am still not able to figure out how to use PCRE regex in SED, or how to convert PCRE regex to SED. Any help would be great.
Want PCRE (Perl Compatible Regular Expressions)? Why don't you use perl instead?
perl -pe 's/[a-zA-Z0-9]+[#][a-zA-Z0-9]+[\.][A-Za-z]{2,4}/[emailaddr]/g' \
<<< "My email is abc#example.com"
Output:
My email is [emailaddr]
Write output to a file with tee:
perl -pe 's/[a-zA-Z0-9]+[#][a-zA-Z0-9]+[\.][A-Za-z]{2,4}/[emailaddr]/g' \
<<< "My email is abc#example.com" | tee /path/to/file.txt > /dev/null
Use the -r flag enabling the use of extended regular expressions. ( -E instead of -r on OS X )
echo "My email is abc#example.com" | sed -r 's/[a-zA-Z0-9]+#[a-zA-Z0-9]+\.[A-Za-z]{2,4}/[emailaddr]/g'
Ideone Demo
GNU sed uses basic regular expressions or, with the -r flag, extended regular expressions.
Your regex as a POSIX basic regex (thanks mklement0):
[[:alnum:]]\{1,\}#[[:alnum:]]\{1,\}\.[[:alpha:]]\{2,4\}
Note that this expression will not match all email addresses (not by a long shot).
for multiline use the 0!
perl -0pe 's/search/replace/gms' file
Sometimes this might be helpful too as a work-around:
str=$(grep -Poh "pcre-pattern" file)
sed -i "s/$str/$something_else/" file
-o, --only-matching:
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

Java regex and sed aren't the same...?

Get these strings:
00543515703528
00582124628575
0034911320020
0034911320020
005217721320739
0902345623
067913187056
00543515703528
Apply this exp in java: ^(06700|067|00)([0-9]*).
My intention is to remove leading "06700, 067 and 00" from the beggining of the string.
It is all cool in java, group 2 always have the number I intend to, but in sed it isnt the same:
$ cat strings|sed -e 's/^\(06700|067|00\)\([0-9]*\)/\2/g'
00543515703528
00582124628575
0034911320020
0034911320020
005217721320739
0902345623
067913187056
00543515703528
What the heck am I missing?
Cheers,
f.
When using extended regular expressions, you also need to omit the \ before ( and ). This works for me:
sed -r 's/^(06700|067|00)([0-9]*)/\2/g' strings
note also that there's no need for a separate call to cat
I believe your problem is this:
sed defaults to BRE: The default
behaviour of sed is to support Basic
Regular Expressions (BRE). To use all
the features described on this page
set the -r (Linux) or -E (BSD) flag to
use Extended Regular Expressions
Source
Without this flag, the | character is interpreted literally. Try this example:
echo "06700|067|0055555" | sed -e 's/^\(06700|067|00\)\([0-9]*\)/\2/g'