Java regex and sed aren't the same...? - regex

Get these strings:
00543515703528
00582124628575
0034911320020
0034911320020
005217721320739
0902345623
067913187056
00543515703528
Apply this exp in java: ^(06700|067|00)([0-9]*).
My intention is to remove leading "06700, 067 and 00" from the beggining of the string.
It is all cool in java, group 2 always have the number I intend to, but in sed it isnt the same:
$ cat strings|sed -e 's/^\(06700|067|00\)\([0-9]*\)/\2/g'
00543515703528
00582124628575
0034911320020
0034911320020
005217721320739
0902345623
067913187056
00543515703528
What the heck am I missing?
Cheers,
f.

When using extended regular expressions, you also need to omit the \ before ( and ). This works for me:
sed -r 's/^(06700|067|00)([0-9]*)/\2/g' strings
note also that there's no need for a separate call to cat

I believe your problem is this:
sed defaults to BRE: The default
behaviour of sed is to support Basic
Regular Expressions (BRE). To use all
the features described on this page
set the -r (Linux) or -E (BSD) flag to
use Extended Regular Expressions
Source
Without this flag, the | character is interpreted literally. Try this example:
echo "06700|067|0055555" | sed -e 's/^\(06700|067|00\)\([0-9]*\)/\2/g'

Related

Replace the separator between pairs of numbers

I want to replace all strings like [0-9][0-9]-[0-9][0-9] with [0-9][0-9]/[0-9][0-9] using sed.
In other words, I want to replace - with /.
If I have somewhere in my text:
09-36
32-43
54-65
I want this change:
09/36
32/43
54/65
Using GNU sed:
$ echo '09-36 32-43 54-65' | sed -r 's|\<([0-9]{2})-([0-9]{2})\>|\1/\2|g'
09/36 32/43 54/65
-r turns on extended regular expressions, which:
doesn't require \-escaping ( ) { } char.
enables use of \< and /> to only match at word boundaries (if the expression should only match full lines, use ^ and $ instead, and omit the g option)
| is used as an alternative regex delimiter so that / can be used without \-escaping.
A BSD/macOS sed solution would look slightly different:
echo '09-36 32-43 54-65' | sed -E 's|[[:<:]]([0-9]{2})-([0-9]{2})[[:>:]]|\1/\2|g'
sed -e 's/\([0-9]\{2\}\)-\([0-9]\{2\}\)/\1\/\2/g'
Might not be the most elegant version, but works for me. The gazillion backslashes make this rather unreadable in my opinion. You might improve the readability by not using / to separate the pattern and the replacement maybe?
perl -C -npe 's/(?<!\d)(\d\d)-(\d\d)(?!\d)/\1\/\2/g' file
Input
维基 1-11 22-33 444-44 55-555 66-66百科
77-77
8 88-88
Output
维基 1-11 22/33 444-44 55-555 66/66百科
77/77
8 88/88
In the command above
-C enables Unicode;
-n causes Perl to process the script for each input line;
-p causes Perl to print the result of the script to the standard output;
-e accepts a Perl expression (particularly, it is a substitution).
In this mode (-npe), Perl works just like sed. The script substitutes each pair of digits separated with - to the same pair separated with a slash.
(?<!\d) and (?!\d) are negative lookaround expressions.
To edit the file in place use -i option: perl -C -i.backup -npe ....
If the input is not a file, you can pass the input to Perl via pipe, e.g.:
echo '维基 1-11 22-33 444-44 55-555 66-66百科' | \
perl -C -npe 's/(?<!\d)(\d\d)-(\d\d)(?!\d)/\1\/\2/g'

-E flag for sed

I seem to remember a -E flag for sed to enable extended regex, then looking in the man page today I see it's absent, however
echo development.properties | sed -E 's/(development.|staging.|qa.|production.)//'
properties
echo development.properties | sed 's/(development.|staging.|qa.|production.)//'
development.properties
echo development.properties | sed -r 's/(development.|staging.|qa.|production.)//'
properties
So it looks to me like -E is doing something, furthermore I'd venture to say it's now an alias for -r, at least that's what I thought -E was for (extended regex). Did it change at some point? It looks like it may still be supported for backwards compatibility, or no?
Also, it sounds like extended regex in sed has nothing to do with ranges like the pipe character inside of parens, so why doesn't the substitution work without either of those flags (-E, -r)?
-E is an alias for -r in your sed version i.e. extended regex support
Without extended regex support, square brackets are matched as literal parentheses not for regex grouping.

Sed regex not matching 'either or' inner group

I would like to match multiple file extensions passed through a pipe using sed and regex.
The following works:
sed '/.\(rb\)\$/!d'
But if I want to allow multiple file extensions, the following does not work.
sed '/.\(rb\|js\)\$/!d'
sed '/.\(rb|js\)\$/!d'
sed '/.(rb|js)\$/!d'
Any ideas on how to do either/or inner groups?
Here is the whole block of code:
#!/bin/sh
files=`git diff-index --check --cached $against | # Find all changed files
sed '/.\(rb\|js\)\$/!d' | # Only process .rb and .js files
uniq` # Remove duplicate files
I am using a Mac OSX 10.8.3 and the previous answer does not work for me, but this does:
sed -E '/\.(rb|js)$/!d'
Note: use -E to
Interpret regular expressions as extended (modern) regular expressions
rather than basic regular expressions (BRE's).
and this enables the OR function |; other versions seem to want the -r flag to enable extended regular expressions.
Note that the initial . must be escaped and the trailing $ must not be.
Try something like this:
sed '/\.\(rb\|js\)$/!d'
or if you have then use -r option to use extended regular expression for avoiding escaping special character.

Why doesn't this simple RegEx work with sed?

This is a really simple RegEx that isn't working, and I can't figure out why. According to this, it should work.
I'm on a Mac (OS X 10.8.2).
script.sh
#!/bin/bash
ZIP="software-1.3-licensetypeone.zip"
VERSION=$(sed 's/software-//g;s/-(licensetypeone|licensetypetwo).zip//g' <<< $ZIP)
echo $VERSION
terminal
$ sh script.sh
1.3-licensetypeone.zip
Looking at the regex documentation for OS X 10.7.4 (but should apply to OP's 10.8.2), it is mentioned in the last paragraph that
Obsolete (basic) regular expressions differ in several respects. | is an ordinary character and there is no equivalent for its functionality...
... The parentheses for nested subexpressions are \(' and )'...
sed, without any options, uses basic regular expression (BRE).
To use | in OS X or BSD's sed, you need to enable extended regular expression (ERE) via -E option, i.e.
sed -E 's/software-//g;s/-(licensetypeone|licensetypetwo).zip//g'
p/s: \| in BRE is a GNU extension.
Alternative ways to extract version number
chop-chop (parameter expansion)
VERSION=${ZIP#software-}
VERSION=${VERSION%-license*.zip}
sed
VERSION=$(sed 's/software-\(.*\)-license.*/\1/' <<< "$ZIP")
You don't necessarily have to match strings word-by-word with shell patterns or regex.
sed works with simple regular expressions. You have to backslash parentheses and a vertical bar to make it work.
sed 's/software-//g;s/-\(licensetypeone\|licensetypetwo\)\.zip//g'
Note that I backslashed the dot, too. Otherwise, it would have matched any character.
You can do this in the shell, don't need sed, parameter expansion suffices:
shopt -s extglob
ZIP="software-1.3-licensetypeone.zip"
tmp=${ZIP#software-}
VERSION=${tmp%-licensetype#(one|two).zip}
With a recent version of bash (may not ship with OSX) you can use regular expressions
if [[ $ZIP =~ software-([0-9.]+)-licensetype(one|two).zip ]]; then
VERSION=${BASH_REMATCH[1]}
fi
or, if you just want the 2nd word in a hyphen-separated string
VERSION=$(IFS=-; set -- $ZIP; echo $2)
$ man sed | grep "regexp-extended" -A2
-r, --regexp-extended
use extended regular expressions in the script.

Converting Files with regexp Pattern in sed

I want to turn this (Mitarbeiter.csv):
Max;Mustermann;02.03.1964;501;GL;Prokurist
Monika;Mueller;02.02.1972;500;Sek;Chefsekretaerin
Michael;Maier;06.07.1985;617;Aquise;-
into this (header-content.html):
<tr><td>Max</td><td>Mustermann</td><td>501</td></tr>
<tr><td>Monika</td><td>Mueller</td><td>500</td></tr>
<tr><td>Michael</td><td>Maier</td><td>617</td></tr>
by using sed
I've tried:
sed 's#^\([^\]+\);\([^\]+\);[^\]+;\([^\]+\);.*$#<tr><td>\2</td><td>\1</td><td>\3</td></tr>\n#g' <Mitarbeiter.csv >header-content.html
but that does nothing. Output is same as Mitarbeiter.csv
awk might be a little better suited to what you're trying to do:
awk -F\; '{printf "<tr><td>%s</td><td>%s</td><td>%s</td></tr>\n",$1,$2,$4}'
sed -r -ne 's:^([^;]+);([^;]+);[^;]+;([^;]+);.*:<tr><td>\1</td><td>\2</td><td>\3</td></tr>:p'
Or if you're using OSX or an older version of FreeBSD or NetBSD, replace the -r with -E to use extended regular expressions.
If you want to skip using ERE for portability (i.e. you're using Solaris or HP/UX or somesuch), the regexp might be:
^\([^;][^;]*\);\([^;][^;]*\);[^;]*;\([^;][^;]*\);.*
Note that these both require at least 1 character per field. If fields are allowed to be empty ... well, update your question before we more spend more time on things that might not be necessary. :-)
A few points,
you need the -r switch for extended regex patterns
Sed is greedy, and even -r does not support non greedy matching
The g flag is a special get flag, you probably don't want this
So your command should be:
sed -r 's#^([^\;]+);([^\;]+);[^\;]+;([^\;]+);.*$#<tr><td>\1</td><td>\2</td><td>\3</td></tr>#' < Mitarbeiter.csv > header-content.html
Note that your items cannot have a semicolon in them, as that is the field separator. If you a a true csv file, this won't work, as it will not ignore an escaped semicolon, either wrapped in quotes or with an escape char.
Why would you want to use sed?
awk '{print "<tr><td>"$1"</td><td>"$2"</td><td>"$4"</td></tr>}
' IFS=';' Mitarbeiter.csv > header-content.html
If you insist on using sed, you can try:
$ p='\([^;]*\);'
$ sed "s#$p$p$p$p.*#<tr><td>\1</td><td>\2</td><td>\4</td></tr>#" \
Mitarbeiter.csv > header-content.html