Simple replacement with sed inside bash not working - regex

Why is this simple replacement with sed inside bash not working?
echo '[a](!)' | sed 's/[a](!)/[a]/'
It returns [a](!) instead of [a]. But why, given that only three characters need to be escaped in a sed replacement string?
If I account for the case that additional characters need to be replaced in the regex string and try
echo '[a](!)' | sed 's/\[a\]\(!\)/[a]/'
it is still not working.

The point is that [a] in the regex pattern does not match square brackets that form a bracket expression. Escape the first [ for it to be parsed as a literal [ symbol, and your replacement will work:
echo '[a](!)' | sed 's/\[a](!)/[a]/'
^^
See this demo

sed uses BREs by default and EREs can be enabled by escaping individual ERE metacharaters or by using the -E argument. [ and ] are BRE metacharacters, ( and ) are ERE metacharacters. When you wrote:
echo '[a](!)' | sed 's/\[a\]\(!\)/[a]/'
you were turning the [ and ] BRE metacharacters into literals, which is good, but you were turning the literal ( and ) into ERE metacharacters, which is bad. This is what you were trying to do:
echo '[a](!)' | sed 's/\[a\](!)/[a]/'
which you'd probably really want to write using a capture group:
echo '[a](!)' | sed 's/\(\[a\]\)(!)/\1/'
to avoid duplicating [a] on both sides of the substitution. With EREs enabled using the -E argument that last would be:
echo '[a](!)' | sed -E 's/(\[a\])\(!\)/\1/'
Read the sed man page and a regexp tutorial.

man echo tells that the command echo display a line of text. So [ and ( with their closing brackets are just text.
If you read man grep and type there /^\ *Character Classes and Bracket Expressions and /^\ *Basic vs Extended Regular Expressions you can read the difference. sed and other tools that use regex interprets this as Character Classes and Bracket Expressions.
You can try this
$ echo '[a](!)' | sed 's/(!)//'

Related

Correct (?) regex not understood by sed

According to https://regex101.com/r/NLSymf/3, the following regex:
\[\[(foo)([^\]]+)\]\]
(full) matches the string [[foo>test1|test2]], but this seems to not be understood by sed, since:
echo "[[foo>test1|test2]]" | sed -E -e '/\[\[(foo)([^\]]+)\]\]/d'
(which should return an empty string) returns:
[[foo>test1|test2]]
What is the regex that matches [[foo>test1|test2]] from sed's point of view?
The backslash character loses its escaping capability within a bracket expression. And stray closing brackets in a RE need not be escaped, that's why grep doesn't fail the first pipeline below. See RE Bracket Expression for reference.
$ echo 'a]' | grep -Eo '[^\]]'
a]
$ echo 'a]' | grep -Eo '[^]]'
a
The correct regex would be:
\[\[(foo)([^]]+)]]

Printing only text from group

I have working example of substitution in online regex tester https://regex101.com/r/3FKdLL/1 and I want to use it as a substitution in sed editor.
echo "repo-2019-12-31-14-30-11.gz" | sed -r 's/^([\w-]+)-\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-\d{2}.gz$.*/\1/p'
It always prints whole string: repo-2019-12-31-14-30-11.gz, but not matched group [\w-]+.
I expect to get only text from group which is repo string in this example.
Try this:
echo "repo-2019-12-31-14-30-11.gz" |
sed -rn 's/^([A-Za-z]+)-[[:alnum:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}.gz.*$/\1/p'
Explanations:
\w will work (not [\w] wich matches either backslash or w), but you should use [[:alnum:]] which is POSIX
For sed, \d isn't a regex class, but an escaped character representing a non-printable character
Add -n to mute sed, with /p to explicitly print matched lines
Additionaly, you could refactor your regex by removing duplication:
echo "repo-2019-12-31-14-30-11.gz" |
sed -rn 's/^([[:alnum:]]+)-[[:digit:]]{4}(-[[:digit:]]{2}){5}.gz.*$/\1/p'
Looks like a job for GNU grep :
echo "repo-2019-12-31-14-30-11.gz" | grep -oP '^\K[[:alpha:]-]+'
Displays :
repo-
On this example :
echo "repo-repo-2019-12-31-14-30-11.gz" | grep -oP '^\K[[:alpha:]-]+'
Displays :
repo-repo-
Which I think is what you want because you tried with [\w-]+ on your regex.
If I'm wrong, just replace the grep command with : grep -oP '^\K\w+'

Get substring using either perl or sed

I can't seem to get a substring correctly.
declare BRANCH_NAME="bugfix/US3280841-something-duh";
# Trim it down to "US3280841"
TRIMMED=$(echo $BRANCH_NAME | sed -e 's/\(^.*\)\/[a-z0-9]\|[A-Z0-9]\+/\1/g')
That still returns bugfix/US3280841-something-duh.
If I try an use perl instead:
declare BRANCH_NAME="bugfix/US3280841-something-duh";
# Trim it down to "US3280841"
TRIMMED=$(echo $BRANCH_NAME | perl -nle 'm/^.*\/([a-z0-9]|[A-Z0-9])+/; print $1');
That outputs nothing.
What am I doing wrong?
Using bash parameter expansion only:
$: # don't use caps; see below.
$: declare branch="bugfix/US3280841-something-duh"
$: tmp="${branch##*/}"
$: echo "$tmp"
US3280841-something-duh
$: trimmed="${tmp%%-*}"
$: echo "$trimmed"
US3280841
Which means:
$: tmp="${branch_name##*/}"
$: trimmed="${tmp%%-*}"
does the job in two steps without spawning extra processes.
In sed,
$: sed -E 's#^.*/([^/-]+)-.*$#\1#' <<< "$branch"
This says "after any or no characters followed by a slash, remember one or more that are not slashes or dashes, followed by a not-remembered dash and then any or no characters, then replace the whole input with the remembered part."
Your original pattern was
's/\(^.*\)\/[a-z0-9]\|[A-Z0-9]\+/\1/g'
This says "remember any number of anything followed by a slash, then a lowercase letter or a digit, then a pipe character (because those only work with -E), then a capital letter or digit, then a literal plus sign, and then replace it all with what you remembered."
GNU's manual is your friend. I look stuff up all the time to make sure I'm doing it right. Sometimes it still takes me a few tries, lol.
An aside - try not to use all-capital variable names. That is a convention that indicates it's special to the OS, like RANDOM or IFS.
You may use this sed:
sed -E 's~^.*/|-.*$~~g' <<< "$BRANCH_NAME"
US3280841
Ot this awk:
awk -F '[/-]' '{print $2}' <<< "$BRANCH_NAME"
US3280841
sed 's:[^/]*/\([^-]*\)-.*:\1:'<<<"bugfix/US3280841-something-duh"
Perl version just has + in wrong place. It should be inside the capture brackets:
TRIMMED=$(echo $BRANCH_NAME | perl -nle 'm/^.*\/([a-z0-9A-Z]+)/; print $1');
Just use a ^ before A-Z0-9
TRIMMED=$(echo $BRANCH_NAME | sed -e 's/\(^.*\)\/[a-z0-9]\|[^A-Z0-9]\+/\1/g')
in your sed case.
Alternatively and briefly, you can use
TRIMMED=$(echo $BRANCH_NAME | sed "s/[a-z\/\-]//g" )
too.
type on shell terminal
$ BRANCH_NAME="bugfix/US3280841-something-duh"
$ echo $BRANCH_NAME| perl -pe 's/.*\/(\w\w[0-9]+).+/\1/'
use s (substitute) command instead of m (match)
perl is a superset of sed so it'd be identical 'sed -E' instead of 'perl -pe'
Another variant using Perl Regular Expression Character Classes (see perldoc perlrecharclass).
echo $BRANCH_NAME | perl -nE 'say m/^.*\/([[:alnum:]]+)/;'

Grep regex not working with square brackets

So I was trying to write a regex in grep to match square brackets, i.e [ad] should match [ and ]. But I was getting different results on using capturing groups and character classes. Also the result is different on putting ' in the beginning and end of regex string.
So these are the different result that I am getting.
Using capturing groups works fine
echo "[ad]" | grep -E '(\[|\])'
[ad]
Using capturing groups without ' gives syntax error
echo "[ad]" | grep -E (\[|\])
bash: syntax error near unexpected token `('
using character class with [ followed by ] gives no output
echo "[ad]" | grep -E [\[\]]
Using character class with ] followed by [ works correctly
echo "[ad]" | grep -E [\]\[]
[ad]
Using character class with ] followed by [ and using ' does not work
echo "[ad]" | grep -E '[\]\[]'
It'd be great if someone could explain the difference between them.
You should know about:
BRE ( = Basic Regular Expression )
ERE ( = Extended Regular Expression )
BRE metacharacters require a backslash to give them their special meaning and grep is based on
The ERE flavor standardizes a flavor similar to the one used by the UNIX egrep command.
Pay attention to -E and -G
grep --help
Usage: grep [OPTION]... PATTERN [FILE]...
Search for PATTERN in each FILE or standard input.
PATTERN is, by default, a basic regular expression (BRE).
Example: grep -i 'hello world' menu.h main.c
Regexp selection and interpretation:
-E, --extended-regexp PATTERN is an extended regular expression (ERE)
-F, --fixed-strings PATTERN is a set of newline-separated strings
-G, --basic-regexp PATTERN is a basic regular expression (BRE)
-P, --perl-regexp PATTERN is a Perl regular expression
...
...
POSIX Basic Regular Expressions
POSIX Extended Regular Expressions
POSIX Bracket Expressions
And you should also know about bash, since some of your input is related to bash interpreter not grep or anything else
echo "[ad]" | grep -E (\[|\])
Here bash assumes you try to use () something like:
echo $(( 10 * 10 ))
and by using single quote ' you tell the bash that you do not want it treats as a special operator for it. So
echo "[ad]" | grep -E '(\[|\])'
is correct.
Firstly, always quote Regex pattern to prevent shell interpretation beforehand:
$ echo "[ad]" | grep -E '(\[|\])'
[ad]
Secondly, within [] surrounded by quotes, you don't need to escape the [] inside, just write them as is within the outer []:
$ echo "[ad]" | grep -E '[][]'
[ad]
Maybe you provided such a simple example on purpose (after all, it is minimal), but in case all you really want is to check for existence of square brackets (a fixed string, not regex pattern), you can use grep with -F/--fixed-strings and multiple -e options:
$ echo "[ad]" | grep -F -e '[' -e ']'
[ad]
Or, a little bit shorter with fgrep:
$ echo "[ad]" | fgrep -e '[' -e ']'
[ad]
Or, even:
$ echo "[ad]" | fgrep -e[ -e]
[ad]

Sed subexpressions not working as expected

I am trying to make a simple wikitext parser using sed/bash. When I run
echo "London has [[public transport]]" | sed s/\\[\\[[A-Za-z0-9\ ]*\\]\\]/link/
it gives me London has link
but when I try to use marked subexpressions to get the contents of the brackets using
sed s/\\[\\[\([A-Za-z0-9\ ]*\)\\]\\]/\1/
it just gives me London has [[public transport]]
That's because the regex doesn't match.
Since you're not surrounding your sed expression in quotes, you have to double-escape slashes for the shell - that's why you have \\[ instead of \[.
Now in sed default regex (basic regular expressions), capturing brackets are denoted by \( and \) in regex. Since you're typing this into the shell without surrounding with quote marks, you need to escape the backslash. And since bash interprets brackets, you have to escape them too:
echo "London has [[public transport]]" | sed s/\\[\\[\\\([A-Za-z0-9\ ]*\\\)\\]\\]/\\1/
I strongly recommend you just enclose your sed expression in single quotes for ease of writing:
echo "London has [[public transport]]" | sed 's/\[\[\([A-Za-z0-9\ ]*\)\]\]/\1/'
Much easier right?
echo "London has [[public transport]]" | sed 's#[[][[]\([A-Za-z0-9\ ]*\)[]][]]#\1#'
output
London has public transport
works on my machine.
I hope this helps.