Regex code tested but not matching when used with sed - regex

Why does the following
$ echo "test\|String" | sed 's/(?!\||\\)./&./g'
test\|String
Not produce:
t.e.s.t.\|S.t.r.i.n.g.
I have tested the regex code and it picks the string up correctly in my tests
I want it to do the following:
$ echo "test\|String" | sed 's/(?!\||\\)./&./g'
t.e.s.t.\|S.t.r.i.n.g.

I would have written what you describe like this:
echo "test\|String" | sed 's/[^\\|]/&./g'

The following produces the effect you want:
echo "test\|String" | perl -npe 's/(?!\||\\)./$&./g'
t.e.s.t.\|S.t.r.i.n.g.
Not all regular expression engines are the same. You are using something close to PCRE (Perl-Compatible Regular Expressions), but the one used in sed is older, more limited, and has some quirks, like using \( and \) to match groups, rather than ( and ). It doesn't know about negative lookahead assertions. Perl's -n and -p flags give it sed-like behavior. Note finally that Perl uses $& to indicate "entire matched expression", not & alone.
The regular expression engines used by the shell, by grep, by sed, by Emacs, by POSIX, and by PCRE are all different; unless your regex is quite simple, you will need to target the specific engine.

Related

PCRE Regex to SED

I am trying to take PCRE regex and use it in SED, but I'm running into some issues. Please note that this question is representative of a bigger issue (how to convert PCRE regex to work with SED) so the question is not simply about the example below, but about how to use PCRE regex in SED regex as a whole.
This example is extracting an email address from a line, and replacing it with "[emailaddr]".
echo "My email is abc#example.com" | sed -e 's/[a-zA-Z0-9]+[#][a-zA-Z0-9]+[\.][A-Za-z]{2,4}/[emailaddr]/g'
I've tried the following replace regex:
([a-zA-Z0-9]+[#][a-zA-Z0-9]+[\.][A-Za-z]{2,4})
[a-zA-Z0-9]+[#][a-zA-Z0-9]+[\.][A-Za-z]{2,4}
([a-zA-Z0-9]+[#][a-zA-Z0-9]+[.][A-Za-z]{2,4})
[a-zA-Z0-9]+[#][a-zA-Z0-9]+[.][A-Za-z]{2,4}
I've tried changing the delimited of sed from s/find/replace/g to s|find|replace|g as outlined here (stack overflow: pcre regex to sed regex).
I am still not able to figure out how to use PCRE regex in SED, or how to convert PCRE regex to SED. Any help would be great.
Want PCRE (Perl Compatible Regular Expressions)? Why don't you use perl instead?
perl -pe 's/[a-zA-Z0-9]+[#][a-zA-Z0-9]+[\.][A-Za-z]{2,4}/[emailaddr]/g' \
<<< "My email is abc#example.com"
Output:
My email is [emailaddr]
Write output to a file with tee:
perl -pe 's/[a-zA-Z0-9]+[#][a-zA-Z0-9]+[\.][A-Za-z]{2,4}/[emailaddr]/g' \
<<< "My email is abc#example.com" | tee /path/to/file.txt > /dev/null
Use the -r flag enabling the use of extended regular expressions. ( -E instead of -r on OS X )
echo "My email is abc#example.com" | sed -r 's/[a-zA-Z0-9]+#[a-zA-Z0-9]+\.[A-Za-z]{2,4}/[emailaddr]/g'
Ideone Demo
GNU sed uses basic regular expressions or, with the -r flag, extended regular expressions.
Your regex as a POSIX basic regex (thanks mklement0):
[[:alnum:]]\{1,\}#[[:alnum:]]\{1,\}\.[[:alpha:]]\{2,4\}
Note that this expression will not match all email addresses (not by a long shot).
for multiline use the 0!
perl -0pe 's/search/replace/gms' file
Sometimes this might be helpful too as a work-around:
str=$(grep -Poh "pcre-pattern" file)
sed -i "s/$str/$something_else/" file
-o, --only-matching:
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

Why doesn't this simple RegEx work with sed?

This is a really simple RegEx that isn't working, and I can't figure out why. According to this, it should work.
I'm on a Mac (OS X 10.8.2).
script.sh
#!/bin/bash
ZIP="software-1.3-licensetypeone.zip"
VERSION=$(sed 's/software-//g;s/-(licensetypeone|licensetypetwo).zip//g' <<< $ZIP)
echo $VERSION
terminal
$ sh script.sh
1.3-licensetypeone.zip
Looking at the regex documentation for OS X 10.7.4 (but should apply to OP's 10.8.2), it is mentioned in the last paragraph that
Obsolete (basic) regular expressions differ in several respects. | is an ordinary character and there is no equivalent for its functionality...
... The parentheses for nested subexpressions are \(' and )'...
sed, without any options, uses basic regular expression (BRE).
To use | in OS X or BSD's sed, you need to enable extended regular expression (ERE) via -E option, i.e.
sed -E 's/software-//g;s/-(licensetypeone|licensetypetwo).zip//g'
p/s: \| in BRE is a GNU extension.
Alternative ways to extract version number
chop-chop (parameter expansion)
VERSION=${ZIP#software-}
VERSION=${VERSION%-license*.zip}
sed
VERSION=$(sed 's/software-\(.*\)-license.*/\1/' <<< "$ZIP")
You don't necessarily have to match strings word-by-word with shell patterns or regex.
sed works with simple regular expressions. You have to backslash parentheses and a vertical bar to make it work.
sed 's/software-//g;s/-\(licensetypeone\|licensetypetwo\)\.zip//g'
Note that I backslashed the dot, too. Otherwise, it would have matched any character.
You can do this in the shell, don't need sed, parameter expansion suffices:
shopt -s extglob
ZIP="software-1.3-licensetypeone.zip"
tmp=${ZIP#software-}
VERSION=${tmp%-licensetype#(one|two).zip}
With a recent version of bash (may not ship with OSX) you can use regular expressions
if [[ $ZIP =~ software-([0-9.]+)-licensetype(one|two).zip ]]; then
VERSION=${BASH_REMATCH[1]}
fi
or, if you just want the 2nd word in a hyphen-separated string
VERSION=$(IFS=-; set -- $ZIP; echo $2)
$ man sed | grep "regexp-extended" -A2
-r, --regexp-extended
use extended regular expressions in the script.

Regular Expression and sed command: echo 1234567890|sed 's/((?<=\d)\d{3})*\b/TEST/'

source is 1234567890, the regular expression ((?<=\d)\d{3})*\b can match: 234567890, so i think sed should replace 234567890 with TEST, but the result is: 1234567890, why?
look-behind is not supported by sed.
you could try ssed (super sed), it supports perl mode, (-R) then you could pass perl style regex to it. e.g. look ahead/behind.
see the feature list:
https://launchpad.net/ssed
You need to rewrite the regex without lookbehind:
s/(\d)(?:\d{3})*\b/\1TEST

Java regex and sed aren't the same...?

Get these strings:
00543515703528
00582124628575
0034911320020
0034911320020
005217721320739
0902345623
067913187056
00543515703528
Apply this exp in java: ^(06700|067|00)([0-9]*).
My intention is to remove leading "06700, 067 and 00" from the beggining of the string.
It is all cool in java, group 2 always have the number I intend to, but in sed it isnt the same:
$ cat strings|sed -e 's/^\(06700|067|00\)\([0-9]*\)/\2/g'
00543515703528
00582124628575
0034911320020
0034911320020
005217721320739
0902345623
067913187056
00543515703528
What the heck am I missing?
Cheers,
f.
When using extended regular expressions, you also need to omit the \ before ( and ). This works for me:
sed -r 's/^(06700|067|00)([0-9]*)/\2/g' strings
note also that there's no need for a separate call to cat
I believe your problem is this:
sed defaults to BRE: The default
behaviour of sed is to support Basic
Regular Expressions (BRE). To use all
the features described on this page
set the -r (Linux) or -E (BSD) flag to
use Extended Regular Expressions
Source
Without this flag, the | character is interpreted literally. Try this example:
echo "06700|067|0055555" | sed -e 's/^\(06700|067|00\)\([0-9]*\)/\2/g'

Grep does not show results, online regex tester does

I am fairly unexperienced with the behavior of grep. I have a bunch of XML files that contain lines like these:
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
<identifier type="abc">abc:def.ghi/g5678m.ab678901</identifier>
I wanted to get the identifier part after the slash and constructed a regex using RegexPal:
[a-z]\d{4}[a-z]*\.[a-z]*\d*
It highlights everything that I wanted. Perfect. Now when I run grep on the very same file, I don't get any results. And as I said, I really don't know much about grep, so I tried all different combinations.
grep [a-z]\d{4}[a-z]*\.[a-z]*\d* test.xml
grep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
egrep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
grep '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
grep -E '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
What am I doing wrong?
Your regex doesn't match the input. Let's break it down:
[a-z] matches g
\d{4} matches 1234
[a-z]* doesn't match .
Also, I believe grep and family don't like the \d syntax. Try either [0-9] or [:digit:]
Finally, when using regular expressions, prefer egrep to grep. I don't remember the exact details, but egrep supports more regex operators. Also, in many shells (including bash on OS X as you mentioned, use single quotes instead of double quotes, otherwise * will be expanded by the shell to a list of files in the current directory before grep sees it (and other shell meta-characters will get expanded too). Bash won't touch anything in single quotes.
grep doesn't support \d by defaul. To match a digit, use [0-9], or allow Perl compatible regular expressions:
$ grep -P "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
or:
$ egrep "[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*" test.xml
grep uses "basic" regular expressions : (excerpt from man pages )
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their
special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and
\).
Traditional egrep did not support the { meta-character, and some egrep
implementations support \{ instead, so portable scripts should avoid { in
grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not
special if it would be the start of an invalid interval specification. For
example, the command grep -E '{1' searches for the two-character string {1
instead of reporting a syntax error in the regular expression. POSIX.2 allows
this behavior as an extension, but portable scripts should avoid it.
Also depending on which shell you are executing in the '*' character might get expanded.
You can make use of the following command:
$ cat file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# Use -P option to enable Perl style regex \d.
$ grep -P '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# to get only the part of the input that matches use -o option:
$ grep -P -o '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
g1234.ab012345
# You can use [0-9] inplace of \d and use -E option.
$ grep -E -o '[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*' file
g1234.ab012345
$
Try this:
[a-z]\d{5}[.][a-z]{2}\d{6}
Try this expression in grep:
[a-z]\d{4}[a-z]*\.[a-z]*\d*