the use of "+" in sed regular expression - regex

echo " lfj"|sed -e "s/^\s+//g"
lfj
I remember + indicates one or more, so there's two white spaces in front of lfj, why cannot it strip them?

You need to escape the +.
$ echo " lfj"|sed -e "s/^\s\+//g"
lfj
Basic sed uses BRE(Basic REgular Expressions). To make + to repeat the previous character one or more times in BRE, you need to escape it.
Enable the extended regexp option -r in sed to make sed to use ERE instead of BRE.
$ echo " lfj"|sed -r "s/^\s+//g"
lfj

Use -r
echo " lfj"|sed -re "s/^\s+//g"
From http://www.grymoire.com/Unix/Sed.html
A quick comment. The original sed did not support the "+" metacharacter. GNU sed does if
you use the "-r" command line option, which enables extended regular expressions. The "+"
means "one or more matches".
also from https://www.gnu.org/software/sed/manual/sed.html
-r
--regexp-extended
Use extended regular expressions rather than basic regular expressions. Extended regexps
are those that egrep accepts; they can be clearer because they usually have less backslashes,
but are a GNU extension and hence scripts that use them are not portable.

Related

substitute single quotes in sed and perl

Could someone please explain what was happening with these two commands? Why do sed and perl give different results running the same regular expression pattern:
# echo "'" | sed -e "s/\'/\'/"
''
# echo "'" | perl -pe "s/\'/\'/"
'
# sed --version
sed (GNU sed) 4.5
You're using GNU sed, right? \' is an extension that acts as an anchor for end-of-string in GNU's implementation of basic regular expressions. So you're seeing two quotes in the output because the s matches the end of the line and adds a quote, after the one that was already in the line.
To make it a bit more obvious:
echo foo | sed -e "s/\'/#/"
produces
foo#
Documented here, and in the GNU sed manual
Edit: The equivalent in perl is \Z (or maybe \z depending on how you want to handle a trailing newline). Since \' isn't a special sequence in perl regular expressions, it just matches a literal quote. As mentioned in the other answer and comments, escaping a single quote inside a double quoted string isn't necessary, and as you've found, can potentially result in unintended behavior.

Retrieve value of attribute in bash

I have a list of lines:
<some_random_text="someval" my_val_="0.4" some_random_text_1="someval_">
<some_random_text="someval" my_val_="0.8" some_random_text_1="someval_">
<some_random_text="someval" my_val_="1.2" some_random_text_1="someval_">
and so on.
From each line, I want to return the numeric value given after my_val_. How can I do this in bash?
Within this very rigid structure, what you want to do is quite easy using sed:
sed 's/.*my_val_="\([0-9.]\{1,\}\)".*/\1/' file
or using extended regular expressions:
sed -r 's/.*my_val_="([0-9.]+)".*/\1/' file
This captures the part you're interested in (the digits and dots between the quotes) and uses them to replace the contents of the line.
As mentioned in the comments (thanks), the switch to enable extended regular expressions differs between versions of sed. Out of habit, I tend to use -r but some implementations (such as BSD sed on OSX) work with -E instead. Others work with either -r or -E but neither option is defined by the standard.
This could also be done in native bash (although I wouldn't recommend it...):
re='my_val_="([0-9.]+)"'
while read -r line; do
[[ $line =~ $re ]] && echo "${BASH_REMATCH[1]}"
done < file
=~ is the regex match operator. The captured digits and dots are stored in element 1 of the special array BASH_REMATCH.
The sed and bash approaches are subtly different, as the sed version will print all lines in the file, even if they don't match the pattern. If this is a problem, you can add the -n switch and a p at the end of the command to print matching lines:
sed -nr 's/.*my_val_="([0-9.]+)".*/\1/p' file
With grep:
grep -oP 'my_val_="\K[^"]*' filename
-o so that grep only prints only the match, -P so that Perl-compatible regexes are used.
The \K in the regex removes from the match everything that was matched by the part of the regex that came before it; this has the effect of a lookbehind: only non-quote characters that come directly after my_val_=" are matched.

sed plus sign doesn't work

I'm trying to replace /./ or /././ or /./././ to / only in bash script. I've managed to create regex for sed but it doesn't work.
variable="something/./././"
variable=$(echo $variable | sed "s/\/(\.\/)+/\//g")
echo $variable # this should output "something/"
When I tried to replace only /./ substring it worked with regex in sed \/\.\/. Does sed regex requires more flags to use multiplication of substring with + or *?
Use -r option to make sed to use extended regular expression:
$ variable="something/./././"
$ echo $variable | sed -r "s/\/(\.\/)+/\//g"
something/
Any sed:
sed 's|/\(\./\)\{1,\}|/|g'
But a + or \{1,\} would not even be required in this case, a * would do nicely, so
sed 's|/\(\./\)*|/|g'
should suffice
Two things to make it simple:
$ variable="something/./././"
$ sed -r 's#(\./){1,}##' <<< "$variable"
something/
Use {1,} to indicate one or more patterns. You won't need g with this.
Use different delimiterers # in above case to make it readable
+ is ERE so you need to enable -E or -r option to use it
You can also do this with bash's built-in parameter substitution. This doesn't require sed, which doesn't accept -r on a Mac under OS X:
variable="something/./././"
a=${variable/\/*/}/ # Remove slash and everything after it, then re-apply slash afterwards
echo $a
something/
See here for explanation and other examples.

Why doesn't this simple RegEx work with sed?

This is a really simple RegEx that isn't working, and I can't figure out why. According to this, it should work.
I'm on a Mac (OS X 10.8.2).
script.sh
#!/bin/bash
ZIP="software-1.3-licensetypeone.zip"
VERSION=$(sed 's/software-//g;s/-(licensetypeone|licensetypetwo).zip//g' <<< $ZIP)
echo $VERSION
terminal
$ sh script.sh
1.3-licensetypeone.zip
Looking at the regex documentation for OS X 10.7.4 (but should apply to OP's 10.8.2), it is mentioned in the last paragraph that
Obsolete (basic) regular expressions differ in several respects. | is an ordinary character and there is no equivalent for its functionality...
... The parentheses for nested subexpressions are \(' and )'...
sed, without any options, uses basic regular expression (BRE).
To use | in OS X or BSD's sed, you need to enable extended regular expression (ERE) via -E option, i.e.
sed -E 's/software-//g;s/-(licensetypeone|licensetypetwo).zip//g'
p/s: \| in BRE is a GNU extension.
Alternative ways to extract version number
chop-chop (parameter expansion)
VERSION=${ZIP#software-}
VERSION=${VERSION%-license*.zip}
sed
VERSION=$(sed 's/software-\(.*\)-license.*/\1/' <<< "$ZIP")
You don't necessarily have to match strings word-by-word with shell patterns or regex.
sed works with simple regular expressions. You have to backslash parentheses and a vertical bar to make it work.
sed 's/software-//g;s/-\(licensetypeone\|licensetypetwo\)\.zip//g'
Note that I backslashed the dot, too. Otherwise, it would have matched any character.
You can do this in the shell, don't need sed, parameter expansion suffices:
shopt -s extglob
ZIP="software-1.3-licensetypeone.zip"
tmp=${ZIP#software-}
VERSION=${tmp%-licensetype#(one|two).zip}
With a recent version of bash (may not ship with OSX) you can use regular expressions
if [[ $ZIP =~ software-([0-9.]+)-licensetype(one|two).zip ]]; then
VERSION=${BASH_REMATCH[1]}
fi
or, if you just want the 2nd word in a hyphen-separated string
VERSION=$(IFS=-; set -- $ZIP; echo $2)
$ man sed | grep "regexp-extended" -A2
-r, --regexp-extended
use extended regular expressions in the script.

Java regex and sed aren't the same...?

Get these strings:
00543515703528
00582124628575
0034911320020
0034911320020
005217721320739
0902345623
067913187056
00543515703528
Apply this exp in java: ^(06700|067|00)([0-9]*).
My intention is to remove leading "06700, 067 and 00" from the beggining of the string.
It is all cool in java, group 2 always have the number I intend to, but in sed it isnt the same:
$ cat strings|sed -e 's/^\(06700|067|00\)\([0-9]*\)/\2/g'
00543515703528
00582124628575
0034911320020
0034911320020
005217721320739
0902345623
067913187056
00543515703528
What the heck am I missing?
Cheers,
f.
When using extended regular expressions, you also need to omit the \ before ( and ). This works for me:
sed -r 's/^(06700|067|00)([0-9]*)/\2/g' strings
note also that there's no need for a separate call to cat
I believe your problem is this:
sed defaults to BRE: The default
behaviour of sed is to support Basic
Regular Expressions (BRE). To use all
the features described on this page
set the -r (Linux) or -E (BSD) flag to
use Extended Regular Expressions
Source
Without this flag, the | character is interpreted literally. Try this example:
echo "06700|067|0055555" | sed -e 's/^\(06700|067|00\)\([0-9]*\)/\2/g'