Why doesn't sed interpret this regex properly? - regex

echo "This is a test string" | sed 's/This/\0/'
First I match substring This using the regex This. Then I replace the entire string with the first match using \0. So the result should be just the matched string.
But it prints out the entire line. Why is this so?

You don't replace the whole string with \0, just the pattern match, which is This. In other words, you replace This with This.
To replace the whole line with This, you can do:
echo "This is a test string" | sed '/This/s/.*/This/'
It looks for a line matching This, and replaces the whole line with This. In this case (since there is only one line) you can also do:
echo "This is a test string" | sed 's/.*/This/'
If you want to reuse the match, then you can do
echo "This is a test string" | sed 's/.*\(This\).*/\1/'
\( and \) are used to remember the match inside them. It can be referenced as \1 (if you have more than one pair of \( and \), then you can also use \2, \3, ...).
In the example above this is not very helpful, since we know that inside \( and \) is the word This, but if we have a regex inside the parentheses that can match different words, this can be very helpful.

sed 's/.*\(PatThis\).*/PatThat/'
or
se '/PatThis/ s/.*/PatThat/'
In your request "PatThis" and "PatThat" are the same contain ("This"). In the comment (
I need to select a number using \d\d\d\d and then use it as
replacement
) you have 2 different value for the pattern PatThis and PatThat
the \1 is not really needed because you know the exact contain (unless 'PatThis' is a regex with special char like \ & ? .)

Related

capturing each word containing pattern regex

I'm trying to write a sed script that finds every word that contains a certain pattern and then prepends all words that contain that pattern. For example:
foobarbaz barfoobaz barbazfoo barbaz
might turn into:
quxfoobarbaz quxbarfoobaz quxbarbazfoo barbaz
I understand the basics of capture groups and backrefrences, but I'm still having trouble. Specifically I can't get it so that it captures each whole word separately.
s/\(.*\)men\(.*\)/ not just the \1men\2, but the \1women\2 and \1children\2 too /
I tried using \s, for whitespace as many sites recommend, but sed treats \s as the separate characters \ and s
You could use the non-space character \S as follows:
sed 's/\S*foo\S*/qux&/g' <<< "foobarbaz barfoobaz barbazfoo barbaz"
this will match words containing foo. The replacement string qux& will prepend every matched pattern with qux. Output:
quxfoobarbaz quxbarfoobaz quxbarbazfoo barbaz
It works fine if no spaces in each word.
echo "foobarbaz barfoobaz barbazfoo barbaz" | sed 's/\([^ ]*foo[^ ]*\)/qux\1/g'

Using sed to replace space delimited strings

echo 'bar=start "bar=second CONFIG="$CONFIG bar=s buz=zar bar=g bar=ggg bar=f bar=foo bar=zoo really?=yes bar=z bar=yes bar=y bar=one bar=o que=idn"' | sed -e 's/^\|\([ "]\)bar=[^ ]*[ ]*/\1/g'
Actual output:
CONFIG="$CONFIG buz=zar bar=ggg bar=foo really?=yes bar=yes bar=one que=idn"
Expected output:
CONFIG="$CONFIG buz=zar really?=yes que=idn"
What I'm missing in my regex?
Edit:
This works as expected (with GNU sed):
's/\(^\|\(['\''" ]\)\)bar=[^ ]*/\2/g; s/[ ][ ]\+/ /g; s/[ ]*\(['\''"]\+\)[ ]*/\1/g'
sed regular expressions are pretty limited. They don't include \w as a synonym for [a-zA-Z0-9_], for example. They also don't include \b which means the zero-length string at the beginning or end of a word (which you really want in this situation...).
s/ bar=[^ ]* *//
is close, but the problem is the trailing * removes the space that might precede the next bar=. So, in ... bar=aaa bar=bbb ... the first match is bar=aaa leaving bar=bbb ... to try for the second match but it won't match because you already consumed the space before bar.
s/ bar=[^ ]*//
is better -- don't consume the trailing spaces, leave them for the next match attempt. If you want to match bar=something even if it's at the beginning of the string, insert a space at the beginning first:
sed 's/^bar=/ bar=/; s/ bar=[^ ]*//'
If you want to remove all instances of bar=something then you can simplify your regex as such:
\sbar=\w+
This matches all bar= plus all whole words. The bar= must be preceded by a whitespace character.
Demonstration:
https://regex101.com/r/xbBhJZ/3
As sed:
s/\sbar=\w\+//g
This correctly accounts for foobar=bar.
Like Waxrat's answer, you have to insert a space at the beginning for it to properly match as it's now matching against a preceding whitespace character before the bar=. This can be easily done since you're quoting your string explicitly.

Using sed to replace string matching regex with wildcards

I have a string I'm trying manipulate with sed
js/plex.js?hash=f1c2b98&version=2.4.23"
Desired output is
js/plex.js"
This is what I'm currently trying
sed -i s'/js\/plex.js[\?.\+\"]/js\/plex.js"/'
But it is only matching the first ? and returns this output
js/plex.js"hash=f1c2b98&version=2.4.23"
I can't see why this isn't working after a few hours
This works
echo 'js/plex.js?hash=f1c2b98&version=2.4.23"' | sed s:.js?.*:.js:g
With the original Regex:
Firstly I would suggest use a different delimiter (like : in sed when using / in the regex. Secondly, the use of [] means that you are matching the characters inside the brackets (and as such it will not expand the .+ to the end of the line - you could potentially try put the + after the [])
perhaps
sed 's#\(js/plex.js?\)[^"]\+".*#\1#g'
..
\# is used as a delimiter
\(js/plex.js?\)[^"]\+".* #find this pattern and replace everything with your marked pattern \1 found
The marked pattern
In sed you can mark part of a pattern or the whole pattern buy using \( \). .
When part of a pattern is enclosed by brackets () escaped by backslashes..the pattern is marked/stored...
in my example this is my pattern without marking
js/plex.js?[^"]\+".*
but I only want sed to remember js/plex.js? and replace the whole line with only this piece of pattern js/plex.js? ..with sed the first marked pattern is known as \1, the second \2 and so forth
\(js/plex.js?\) ---> is marked as \1
Hence I replace the whole line with \1

Why sed doesn't print an optional group?

I have two strings, say foo_bar and foo_abc_bar. I would like to match both of them, and if the first one is matched I would like to emphasize it with = sign. So, my guess was:
echo 'foo_abc_bar' | sed -r 's/(foo).*(abc)?.*(bar)/\1=\2=\3/g'
> foo==bar
or
echo 'foo_abc_bar' | sed -r 's/(foo).*((abc)?).*(bar)/\1=\2=\3/g'
> foo==
But as output above shows none of them work.
How can I specify an optional group that will match if the string contains it or just skip if not?
The solution:
echo 'foo_abc_bar' | sed -r 's/(foo)_((abc)_)?(bar)/\1=\3=\4/g'
Why your previous attempts didn't work:
.* is greedy, so for the regex (foo).*(abc)?.*(bar) attempting to match 'foo_abc_bar' the (foo) will match 'foo', and then the .* will initially match the rest of the string ('_abc_bar'). The regex will continue until it reaches the required (bar) group and this will fail, at which point the regex will backtrack by giving up characters that had been matched by the .*. This will happen until the first .* is only matching '_abc_', at which point the final group can match 'bar'. So instead of the 'abc' in your string being matched in the capture group it is matched in the non-capturing .*.
Explanation of my solution:
The first and most important thing is to replace the .* with _, there is no need to match any arbitrary string if you know what the separator will be. The next thing we need to do is figure out exactly which portion of the string is optional. If the strings 'foo_abc_bar' and 'foo_bar' are both valid, then the 'abc_' in the middle is optional. We can put this in an optional group using (abc_)?. The last step is to make sure that we still have the string 'abc' in a capturing group, which we can do by wrapping that portion in an additional group, so we end up with ((abc)_)?. We then need to adjust the replacement because there is an extra group, so instead of \1=\2=\3 we use \1=\3=\4, \2 would be the string 'abc_' (if it matched). Note that in most regex implementations you could also have used a non-capturing group and continued to use \1=\2=\3, but sed does not support non-capturing groups.
An alternative:
I think the regex above is your best bet because it is most explicit (it will only match the exact strings you are interested in). However you could also avoid the issue described above by using lazy repetition (matches as few characters as possible) instead of greedy repetition (matches as many characters as possible). You can do this by changing the .* to .*?, so your expression would look something like this:
echo 'foo_abc_bar' | sed -r 's/(foo).*?(abc).*?(bar)/\1=\2=\3/g'
Maybe you could simply use:
echo 'foo_abc_bar' | sed -r 's/(foo|bar|abc)_?/\1=/g'
echo 'foo_bar' | sed -r 's/(foo|bar|abc)_?/\1=/g'
> foo=abc=bar=
> foo=bar=
This avoids the foo==bar you get with foo_bar and I found it a bit weird to show emphasis by putting = sometimes before the match, sometimes after the match.

Vim regex backreference

I want to do this:
%s/shop_(*)/shop_\1 wp_\1/
Why doesn't shop_(*) match anything?
There's several issues here.
parens in vim regexen are not for capturing -- you need to use \( \) for captures.
* doesn't mean what you think. It means "0 or more of the previous", so your regex means "a string that contains shop_ followed by 0+ ( and then a literal ). You're looking for ., which in regex means "any character". Put together with a star as .* it means "0 or more of any character". You probably want at least one character, so use .\+ (+ means "1 or more of the previous")
Use this: %s/shop_\(.\+\)/shop_\1 wp_\1/.
Optionally end it with g after the final slash to replace for all instances on one line rather than just the first.
If I understand correctly, you want %s/shop_\(.*\)/shop_\1 wp_\1/
Escape the capturing parenthesis and use .* to match any number of any character.
(Your search is searching for "shop_" followed by any number of opening parentheses followed by a closing parenthesis)
If you would like to avoid having to escape the capture parentheses and make the regex pattern syntax closer to other implementations (e.g. PCRE), add \v (very magic!) at the start of your pattern (see :help \magic for more info):
:%s/\vshop_(*)/shop_\1 wp_\1/
#Luc if you look here: regex-info, you'll see that vim is behaving correctly. Here's a parallel from sed:
echo "123abc456" | sed 's#^([0-9]*)([abc]*)([456]*)#\3\2\1#'
sed: -e expression #1, char 35: invalid reference \3 on 's' command's RHS
whereas with the "escaped" parentheses, it works:
echo "123abc456" | sed 's#^\([0-9]*\)\([abc]*\)\([456]*\)#\3\2\1#'
456abc123
I hate to see vim maligned - especially when it's behaving correctly.
PS I tried to add this as a comment, but just couldn't get the formatting right.