Why sed doesn't print an optional group? - regex

I have two strings, say foo_bar and foo_abc_bar. I would like to match both of them, and if the first one is matched I would like to emphasize it with = sign. So, my guess was:
echo 'foo_abc_bar' | sed -r 's/(foo).*(abc)?.*(bar)/\1=\2=\3/g'
> foo==bar
or
echo 'foo_abc_bar' | sed -r 's/(foo).*((abc)?).*(bar)/\1=\2=\3/g'
> foo==
But as output above shows none of them work.
How can I specify an optional group that will match if the string contains it or just skip if not?

The solution:
echo 'foo_abc_bar' | sed -r 's/(foo)_((abc)_)?(bar)/\1=\3=\4/g'
Why your previous attempts didn't work:
.* is greedy, so for the regex (foo).*(abc)?.*(bar) attempting to match 'foo_abc_bar' the (foo) will match 'foo', and then the .* will initially match the rest of the string ('_abc_bar'). The regex will continue until it reaches the required (bar) group and this will fail, at which point the regex will backtrack by giving up characters that had been matched by the .*. This will happen until the first .* is only matching '_abc_', at which point the final group can match 'bar'. So instead of the 'abc' in your string being matched in the capture group it is matched in the non-capturing .*.
Explanation of my solution:
The first and most important thing is to replace the .* with _, there is no need to match any arbitrary string if you know what the separator will be. The next thing we need to do is figure out exactly which portion of the string is optional. If the strings 'foo_abc_bar' and 'foo_bar' are both valid, then the 'abc_' in the middle is optional. We can put this in an optional group using (abc_)?. The last step is to make sure that we still have the string 'abc' in a capturing group, which we can do by wrapping that portion in an additional group, so we end up with ((abc)_)?. We then need to adjust the replacement because there is an extra group, so instead of \1=\2=\3 we use \1=\3=\4, \2 would be the string 'abc_' (if it matched). Note that in most regex implementations you could also have used a non-capturing group and continued to use \1=\2=\3, but sed does not support non-capturing groups.
An alternative:
I think the regex above is your best bet because it is most explicit (it will only match the exact strings you are interested in). However you could also avoid the issue described above by using lazy repetition (matches as few characters as possible) instead of greedy repetition (matches as many characters as possible). You can do this by changing the .* to .*?, so your expression would look something like this:
echo 'foo_abc_bar' | sed -r 's/(foo).*?(abc).*?(bar)/\1=\2=\3/g'

Maybe you could simply use:
echo 'foo_abc_bar' | sed -r 's/(foo|bar|abc)_?/\1=/g'
echo 'foo_bar' | sed -r 's/(foo|bar|abc)_?/\1=/g'
> foo=abc=bar=
> foo=bar=
This avoids the foo==bar you get with foo_bar and I found it a bit weird to show emphasis by putting = sometimes before the match, sometimes after the match.

Related

regex for capturing path from a string with optional character ~ (perl|awk|sed|..)

I want to match everything between first and last slash / including optional ~ before first slash.
I used this for the first part:
echo ~~a~/dir1/di r2/b.c \
| perl -pe 's/[^\/]*(\/.*\/).*/\1/'
which produces /dir1/di r2/.
This match includes the tilde:
perl -pe 's/.*(~\/.*\/).*/\1/'
but adding ? for optional character doesn't seem to work like in these cases:
perl -pe 's/.*(~?\/.*\/).*/\1/' -> /di r2/
perl -pe 's/.*((?:~)\/.*\/).*/\1/' -> ~~a/dir1/di r2/b.c
What am I doing wrong?
If I understood the desired output right, this works for me with or without tilde
echo "path /d1/d2/43a/" | perl -nE 'm{ ( ~? (?: /.*/ | /) ) }x; say "$1"'
Prints
/d1/d2/43a/
Same Perl code, with a tilde before the first slash in the input
echo "path ~/d1/d2/43a/" | perl -nE 'm{ ( ~? (?: /.*/ | /) ) }x; say "$1"'
prints
~/d1/d2/43a/
Notes Use of /1 in the substitution is deprecated. Use $1 instead. With {} for the delimiters we don't have to escape /, making it more readable (while with delimiters other than // we can't leave out m in front). Otherwise the same works when using / for delimiter and then escaping it inside.
Update
To also catch a lone ~/ (or /), the simplest change was to add that explicitly, /.*/ | /. In order to capture the (optinal) ~ in both cases there is a (non-capturing) grouping around this. Removed -w flag so no warnings are issued when the input string has no slashes at all, but only an empty line is printed.
Original requirements
File data
~~a~/dir1/di r2/b.c
/dir1/di r2/z.y
~/dir1/di r3/p.q
gobbledegook~/name/more/still/more/notwanted.c
xxx~//yyy
Script
perl -ple 's%(?:^.*?)((?:^|~)/.*/).*%$1%' data
Example output
~/dir1/di r2/
/dir1/di r2/
~/dir1/di r3/
~/name/more/still/more/
~//
Is that what you needed?
Dissecting the regex
s%(?:^.*?)((?:^|~)/.*/).*%$1%
The first part, (?:^.*?) is a non-capturing non-greedy match for an arbitrary sequence of characters at the start of the line.
The second part, ((?:^|~)/.*/), is a capturing expression that contains a non-capturing term that matches at the start of a line, or a tilde, followed by a slash and a greedy anything up to the last slash on the line.
The trailing .* matches everything after the second part.
The replacement is simply what was captured; the rest is Perl being Perl.
Revised requirements
The original problem statement was incomplete, it seems. Apparently:
for single slash it should output just / (with accompanying tilde if present). For no slashes preferably empty string as there is no match. … And for this case ~a b/c/d.f it returns full string; instead it should return /c/.
So, here is a revised script to deal with the special extra cases (what happened to 'learning how to fish'?). The ~a b/c/d.f case was a missing ? qualifier on a 'start of string or tilde' grouping.
Revised data file
~~a~/dir1/di r2/b.c
/dir1/di r2/z.y
~/dir1/di r3/p.q
gobbledegook~/name/more/still/more/notwanted.c
xxx~//yyy
not-a-slash-in-sight
just-the-one/with-extra-info
just-the~/with-more-info
~/one-slash-at-start-with-tilde
/one-slash-at-start-without-tilde
~a b/c/d.f
Revised script
perl -ple 's%^[^/]*$%%; s%(?:^[^/]*?)((?:^|~)?/)[^/]*$%$1%; s%(?:^[^/]*?)((?:^|~)?/.*/).*%$1%' data
A mildly modified of the original expression comes last.
The first s/// looks for lines without any / and replaces them with nothing.
The second s/// looks for lines with a slash, possibly preceded by tilde or start of line, followed by non-slashes to end of line with the optional tilde and the slash.
The output of the first two in event of a match does not match the third s///.
Revised output
~/dir1/di r2/
/dir1/di r2/
~/dir1/di r3/
~/name/more/still/more/
~//
/
~/
~/
/
/c/

Using sed to replace string matching regex with wildcards

I have a string I'm trying manipulate with sed
js/plex.js?hash=f1c2b98&version=2.4.23"
Desired output is
js/plex.js"
This is what I'm currently trying
sed -i s'/js\/plex.js[\?.\+\"]/js\/plex.js"/'
But it is only matching the first ? and returns this output
js/plex.js"hash=f1c2b98&version=2.4.23"
I can't see why this isn't working after a few hours
This works
echo 'js/plex.js?hash=f1c2b98&version=2.4.23"' | sed s:.js?.*:.js:g
With the original Regex:
Firstly I would suggest use a different delimiter (like : in sed when using / in the regex. Secondly, the use of [] means that you are matching the characters inside the brackets (and as such it will not expand the .+ to the end of the line - you could potentially try put the + after the [])
perhaps
sed 's#\(js/plex.js?\)[^"]\+".*#\1#g'
..
\# is used as a delimiter
\(js/plex.js?\)[^"]\+".* #find this pattern and replace everything with your marked pattern \1 found
The marked pattern
In sed you can mark part of a pattern or the whole pattern buy using \( \). .
When part of a pattern is enclosed by brackets () escaped by backslashes..the pattern is marked/stored...
in my example this is my pattern without marking
js/plex.js?[^"]\+".*
but I only want sed to remember js/plex.js? and replace the whole line with only this piece of pattern js/plex.js? ..with sed the first marked pattern is known as \1, the second \2 and so forth
\(js/plex.js?\) ---> is marked as \1
Hence I replace the whole line with \1

Using EGREP to find a substring repeated 3 or moretimes in a string

I'm trying to find any string that repeats any 4 word substring 3 times or more, with no overlapping (the substrings cant overlap each other)
Something like this:
grep -E '([A-Za-z]{4})\1\1' test.txt
I know that this is wrong but I'm not really sure what I'm doing wrong or how to use the string repeating feature.
I'm specifically interested in doing this using EGREP, not other ways.
Some examples:
fourfourfour would be okay
fourfourfourfour would not be okay
none of the substrings can overlap, so if I was searching for "hehe" in hehehehe it would return false as there is only two non overlapping matches.
If it's a four chracter string then you could try the below grep command.
grep -oP '^(?:(?!\1).)*\K(.{4})(?=(?:(?!\1).)*\1(?:(?!\1).)*\1(?:(?!\1).)*$)' file
Try this:
grep -P '^(.*?(....))(?=(.*?\\2){2}(?!(.*\\2){3}).*'
The key here is using reluctant quantifiers to consume as little as possible before each abc then a negative look ahead to disallow more than 3.
Since you are requesting specifically a grep -E solution, and all of the earlier answers seem hellbent on using grep -P, here's one more.
grep -E '(....)\1\1' file
This looks for a group four arbitrary characters (including spaces) repeated three times, adjacent to each other.
If you want to restrict to non-whitespace characters, try this instead.
grep -E '([^[:space:]]{4})\1\1' file
This looks more complicated, but really isn't: we use [^[:space:]] instead of . and specify four repetitions with {4} just because it would really suck to have to write [^[:space:]][^[:space:]][^[:space:]][^[:space:]].
If you want to relax the adjacency requirement and look for a four-charcter string occuring three times on the same input line with some other characters in between, try this instead.
grep -E '(....).*\1.*\1' file
The parentheses perform grouping, but also capturing; whatever the first set of parentheses matched will be available as \1. You cannot just say (....){3} because that simply says four characters, followed by any other four, followed by any other four.
Here is a non-regex solution using awk:
awk '{s=substr($0, 1, 4); print ($0 == s s s)?"match":"no match"}'
Testing:
echo "fourfourfour" | awk '{s=substr($0, 1, 4); print ($0 == s s s)?"match":"no match"}'
match
echo "hehehehe" | awk '{s=substr($0, 1, 4); print ($0 == s s s)?"match":"no match"}'
no match
I have to take back an earlier statement. Getting EXACTLY x-number of matches does
appear to work in Perl (and possibly PCRE types too).
It does this because in Perl, variables can exist as multiple types, and as such each
has a control state. One of the states is defined or not.
So capture buffers can be referenced before they are actually defined.
This might not apply to command line grep (even in Perl mode), but it might be worth a try.
Adding to #AvinashRaj's regex, it can be done like this. I tested it in Perl, works there:
# ^(?:(?!\1).)*(.{4})(?:(?!\1).)*\1(?:(?!\1).)*\1(?:(?!\1).)*$
^
(?:
(?! \1 )
.
)*
( .{4} ) # (1)
(?:
(?! \1 )
.
)*
\1
(?:
(?! \1 )
.
)*
\1
(?:
(?! \1 )
.
)*
$

Why doesn't sed interpret this regex properly?

echo "This is a test string" | sed 's/This/\0/'
First I match substring This using the regex This. Then I replace the entire string with the first match using \0. So the result should be just the matched string.
But it prints out the entire line. Why is this so?
You don't replace the whole string with \0, just the pattern match, which is This. In other words, you replace This with This.
To replace the whole line with This, you can do:
echo "This is a test string" | sed '/This/s/.*/This/'
It looks for a line matching This, and replaces the whole line with This. In this case (since there is only one line) you can also do:
echo "This is a test string" | sed 's/.*/This/'
If you want to reuse the match, then you can do
echo "This is a test string" | sed 's/.*\(This\).*/\1/'
\( and \) are used to remember the match inside them. It can be referenced as \1 (if you have more than one pair of \( and \), then you can also use \2, \3, ...).
In the example above this is not very helpful, since we know that inside \( and \) is the word This, but if we have a regex inside the parentheses that can match different words, this can be very helpful.
sed 's/.*\(PatThis\).*/PatThat/'
or
se '/PatThis/ s/.*/PatThat/'
In your request "PatThis" and "PatThat" are the same contain ("This"). In the comment (
I need to select a number using \d\d\d\d and then use it as
replacement
) you have 2 different value for the pattern PatThis and PatThat
the \1 is not really needed because you know the exact contain (unless 'PatThis' is a regex with special char like \ & ? .)

regex, search and replace until a certain point

The Problem
I have a file full of lines like
convert.these.dots.to.forward.slashes/but.leave.these.alone/i.mean.it
I want to search and replace such that I get
convert/these/dots/to/forward/slashes/but.leave.these.alone/i.mean.it
The . are converted to / up until the first forward slash
The Question
How do I write a regex search and replace to solve my problem?
Attempted solution
I tried using look behind with perl, but variable length look behinds are not implemented
$ echo "convert.these.dots.to.forward.slashes/but.leave.these.alone/i.mean.it" | perl -pe 's/(?<=[^\/]*)\./\//g'
Variable length lookbehind not implemented in regex m/(?<=[^/]*)\./ at -e line 1.
Workaround
Variable length look aheads are implemented, so you can use this dirty trick
$ echo "convert.these.dots.to.forward.slashes/but.leave.these.alone/i.mean.it" | rev | perl -pe 's/\.(?=[^\/]*$)/\//g' | rev
convert/these/dots/to/forward/slashes/but.leave.these.alone/i.mean.it
Is there a more direct solution to this problem?
s/\G([^\/.]*)\./\1\//g
\G is an assertion that matches the point at the end of the previous match. This ensures that each successive match immediately follows the last.
Matches:
\G # start matching where the last match ended
([^\/.]*) # capture until you encounter a "/" or a "."
\. # the dot
Replaces with:
\1 # that interstitial text you captured
\/ # a slash
Usage:
echo "convert.these.dots.to.forward.slashes/but.leave.these.alone/i.mean.it" | perl -pe 's/\G([^\/.]*)\./\1\//g'
# yields: convert/these/dots/to/forward/slashes/but.leave.these.alone/i.mean.it
Alternatively, if you're a purist and don't want to add the captured subpattern back in — avoiding that may be more efficient, but I'm not certain — you could make use of \K to restrict the "real" match solely to the ., then simply replace with a /. \K essentially "forgets" what has been matched up to that point, so the final match ultimately returned is only what comes after the \K.
s/\G[^\/.]*\K\./\//g
Matches:
\G # start matching where the last match ended
[^\/.]* # consume chars until you encounter a "/" or a "."
\K # "forget" what has been consumed so far
\. # the dot
Thus, the entirety of the text matched for replacement is simply ".".
Replaces with:
\/ # a slash
Result is the same.
You can use substr as an lvalue and perform the substitution on it. Or transliteration, like I did below.
$ perl -pe 'substr($_,0,index($_,"/")) =~ tr#.#/#'
convert.these.dots.to.forward.slashes/but.leave.these.alone/i.mean.it
convert/these/dots/to/forward/slashes/but.leave.these.alone/i.mean.it
This finds the first instance of a slash, extracts the part of the string before it, and performs a transliteration on that part.