ask for explanation of sed regex - regex

I struggle to understand the following two sed regex in a makefile:
sw_version:=software/module/1.11.0
sw:= $(shell echo $(sw_version) | sed -e 's:/[^/]*$$::;s:_[^/]*$$::g')
// sw is "software/module"
version:= $(shell echo $(sw_version) | sed -e 's:.*/::g;s/-.*$$//')
// version is "1.11.0"
I really appreciate a detailed explanation. Thanks!

$$ will be substituted to $ in make files, so the sed expression looks like this:
sed -e 's:/[^/]*$::;s:_[^/]*$::g'
s/// is the substitution command. And the delimiter doesn't need to be / in your case it's a colon (:):
s:/[^/]*$::;
s:_[^/]*$::g
It works with matching pattern and replacing with replacement:
s/pattern/replacement/
; is a delimiter to use multiply commands in the same call to sed.
So basically this is two substitutions one which replaces /[^/]*$ another which replaces _[^/]*$ with nothing.
[...] is a character class which will match what ever you stick in there one time. eg: [abc] will match either a or b or c. If ^ is in the beginning of the class it will match everything but what is in the class, eg: [^abc] will match everything but a and b and c.
* will repeat the last pattern zero or more times.
$ is end of line
Lets apply what we know to the examples above (read bottom up):
s:/[^/]*$::;
#^^^^^ ^^ ^^
#||||| || |Delimiter
#||||| || Replace with nothing
#||||| |End of line
#||||| Zero or more times
#||||Literal slash
#|||Match everything but ...
#||Character class
#|Literal slash
#Delimiter used for the substitution command
/[^/]*$ will match literal slash (/) followed by everything but a slash zero or more times at end of line.
_[^/]*$ will match literal underscore (_) followed by everything but a slash zero or more times at end of line.
That was the first, the second is left as an exercise.

Related

How to grep an exact string with slash in it?

I'm running macOS.
There are the following strings:
/superman
/superman1
/superman/batman
/superman2/batman
/superman/wonderwoman
/superman3/wonderwoman
/batman/superman
/batman/superman1
/wonderwoman/superman
/wonderwoman/superman2
I want to grep only the bolded words.
I figured doing grep -wr 'superman/|/superman' would yield all of them, but it only yields /superman.
Any idea how to go about this?
You may use
grep -E '(^|/)superman($|/)' file
See the online demo:
s="/superman
/superman1
/superman/batman
/superman2/batman
/superman/wonderwoman
/superman3/wonderwoman
/batman/superman
/batman/superman1
/wonderwoman/superman
/wonderwoman/superman2"
grep -E '(^|/)superman($|/)' <<< "$s"
Output:
/superman
/superman/batman
/superman/wonderwoman
/batman/superman
/wonderwoman/superman
The pattern matches
(^|/) - start of string or a slash
superman - a word
($|/) - end of string or a slash.
grep '/superman\>'
\> is the "end of word marker", and for "superman3", the end of word is not following "man"
The problems with your -w solution:
| is not special in a basic regex. You either need to escape it or use grep -E
read the man page about how -w works:
The test is that the
matching substring must either be at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end of the line or followed by a
non-word constituent character
In the case where the line is /batman/superman,
the pattern superman/ does not appear
the pattern /superman is:
at the end of the line, which is OK, but
is prededed by the character "n" which is a word constituent character.
grep -w superman will give you better results, or if you need to have superman preceded by a slash, then my original answer works.

sed & protobuf: need to delete dots

I need to delete dots using sed, but not all dots.
- repeated .CBroadcast_GetBroadcastChatUserNames_Response.PersonaName persona_names = 1
+ repeated CBroadcast_GetBroadcastChatUserNames_Response.PersonaName persona_names = 1
Here the dot after repeated, (repeated also can beoptional | required | extend) should be deleted
- rpc NotifyBroadcastViewerState (.CBroadcast_BroadcastViewerState_Notification) returns (.NoResponse)
+ rpc NotifyBroadcastViewerState (CBroadcast_BroadcastViewerState_Notification) returns (NoResponse)
And here delete dot after (
It should work on multiple files with different content.
Full code can be found here
A perhaps simpler solution (works with both GNU sed and BSD/macOS sed):
sed -E 's/([[:space:][:punct:]])\./\1/g' file
In case a . can also appear as the first character on a line, use the following varation:
sed -E 's/(^|[[:space:][:punct:]])\./\1/g' file
The assumption is that any . preceded by:
a whitespace character (character class [:space:])
as in:  .
or a punctuation character (character class [:punct:])
as in: (.
should be removed, by replacing the matched sequence with just the character preceding the ., captured via subexpression (...) in the regular expression, and referenced in the replacement string with \1 (the first capture group).
If you invert the logic, you can try the simpler:
sed -E 's/([^[:alnum:]])\./\1/g' file
In case a . can also appear as the first character on a line:
sed -E 's/(^|[^[:alnum:]])\./\1/g' file
This replaces all periods that are not (^) preceded by an alphanumeric character (a letter or digit).
Assuming only the leading . needs removal, here's some GNU sed code:
echo '.a_b.c c.d (.e_f.g) ' |
sed 's/^/& /;s/\([[:space:]{([]\+\)\.\([[:alpha:]][[:alpha:]_.]*\)/\1\2/g;s/^ //'
Output:
a_b.c c.d (e_f.g)
Besides the ., it checks for two fields, which are left intact:
Leading whitespace, or any opening (, [, or {.
Trailing alphabetical chars or also _ or ..
Unfortunately, while the \+ regexp matches one or more spaces et al, it fails if the . is at the beginning of the line. (Replacing the \* with a '*' would match the beginning, but would incorrectly change c.d to cd.) So there's a kludge... s/^/& / inserts a dummy space at the beginning of the line, that way the \+ works as desired, then a s/^ // removes the dummy space.

Using sed to replace space delimited strings

echo 'bar=start "bar=second CONFIG="$CONFIG bar=s buz=zar bar=g bar=ggg bar=f bar=foo bar=zoo really?=yes bar=z bar=yes bar=y bar=one bar=o que=idn"' | sed -e 's/^\|\([ "]\)bar=[^ ]*[ ]*/\1/g'
Actual output:
CONFIG="$CONFIG buz=zar bar=ggg bar=foo really?=yes bar=yes bar=one que=idn"
Expected output:
CONFIG="$CONFIG buz=zar really?=yes que=idn"
What I'm missing in my regex?
Edit:
This works as expected (with GNU sed):
's/\(^\|\(['\''" ]\)\)bar=[^ ]*/\2/g; s/[ ][ ]\+/ /g; s/[ ]*\(['\''"]\+\)[ ]*/\1/g'
sed regular expressions are pretty limited. They don't include \w as a synonym for [a-zA-Z0-9_], for example. They also don't include \b which means the zero-length string at the beginning or end of a word (which you really want in this situation...).
s/ bar=[^ ]* *//
is close, but the problem is the trailing * removes the space that might precede the next bar=. So, in ... bar=aaa bar=bbb ... the first match is bar=aaa leaving bar=bbb ... to try for the second match but it won't match because you already consumed the space before bar.
s/ bar=[^ ]*//
is better -- don't consume the trailing spaces, leave them for the next match attempt. If you want to match bar=something even if it's at the beginning of the string, insert a space at the beginning first:
sed 's/^bar=/ bar=/; s/ bar=[^ ]*//'
If you want to remove all instances of bar=something then you can simplify your regex as such:
\sbar=\w+
This matches all bar= plus all whole words. The bar= must be preceded by a whitespace character.
Demonstration:
https://regex101.com/r/xbBhJZ/3
As sed:
s/\sbar=\w\+//g
This correctly accounts for foobar=bar.
Like Waxrat's answer, you have to insert a space at the beginning for it to properly match as it's now matching against a preceding whitespace character before the bar=. This can be easily done since you're quoting your string explicitly.

regex for capturing path from a string with optional character ~ (perl|awk|sed|..)

I want to match everything between first and last slash / including optional ~ before first slash.
I used this for the first part:
echo ~~a~/dir1/di r2/b.c \
| perl -pe 's/[^\/]*(\/.*\/).*/\1/'
which produces /dir1/di r2/.
This match includes the tilde:
perl -pe 's/.*(~\/.*\/).*/\1/'
but adding ? for optional character doesn't seem to work like in these cases:
perl -pe 's/.*(~?\/.*\/).*/\1/' -> /di r2/
perl -pe 's/.*((?:~)\/.*\/).*/\1/' -> ~~a/dir1/di r2/b.c
What am I doing wrong?
If I understood the desired output right, this works for me with or without tilde
echo "path /d1/d2/43a/" | perl -nE 'm{ ( ~? (?: /.*/ | /) ) }x; say "$1"'
Prints
/d1/d2/43a/
Same Perl code, with a tilde before the first slash in the input
echo "path ~/d1/d2/43a/" | perl -nE 'm{ ( ~? (?: /.*/ | /) ) }x; say "$1"'
prints
~/d1/d2/43a/
Notes Use of /1 in the substitution is deprecated. Use $1 instead. With {} for the delimiters we don't have to escape /, making it more readable (while with delimiters other than // we can't leave out m in front). Otherwise the same works when using / for delimiter and then escaping it inside.
Update
To also catch a lone ~/ (or /), the simplest change was to add that explicitly, /.*/ | /. In order to capture the (optinal) ~ in both cases there is a (non-capturing) grouping around this. Removed -w flag so no warnings are issued when the input string has no slashes at all, but only an empty line is printed.
Original requirements
File data
~~a~/dir1/di r2/b.c
/dir1/di r2/z.y
~/dir1/di r3/p.q
gobbledegook~/name/more/still/more/notwanted.c
xxx~//yyy
Script
perl -ple 's%(?:^.*?)((?:^|~)/.*/).*%$1%' data
Example output
~/dir1/di r2/
/dir1/di r2/
~/dir1/di r3/
~/name/more/still/more/
~//
Is that what you needed?
Dissecting the regex
s%(?:^.*?)((?:^|~)/.*/).*%$1%
The first part, (?:^.*?) is a non-capturing non-greedy match for an arbitrary sequence of characters at the start of the line.
The second part, ((?:^|~)/.*/), is a capturing expression that contains a non-capturing term that matches at the start of a line, or a tilde, followed by a slash and a greedy anything up to the last slash on the line.
The trailing .* matches everything after the second part.
The replacement is simply what was captured; the rest is Perl being Perl.
Revised requirements
The original problem statement was incomplete, it seems. Apparently:
for single slash it should output just / (with accompanying tilde if present). For no slashes preferably empty string as there is no match. … And for this case ~a b/c/d.f it returns full string; instead it should return /c/.
So, here is a revised script to deal with the special extra cases (what happened to 'learning how to fish'?). The ~a b/c/d.f case was a missing ? qualifier on a 'start of string or tilde' grouping.
Revised data file
~~a~/dir1/di r2/b.c
/dir1/di r2/z.y
~/dir1/di r3/p.q
gobbledegook~/name/more/still/more/notwanted.c
xxx~//yyy
not-a-slash-in-sight
just-the-one/with-extra-info
just-the~/with-more-info
~/one-slash-at-start-with-tilde
/one-slash-at-start-without-tilde
~a b/c/d.f
Revised script
perl -ple 's%^[^/]*$%%; s%(?:^[^/]*?)((?:^|~)?/)[^/]*$%$1%; s%(?:^[^/]*?)((?:^|~)?/.*/).*%$1%' data
A mildly modified of the original expression comes last.
The first s/// looks for lines without any / and replaces them with nothing.
The second s/// looks for lines with a slash, possibly preceded by tilde or start of line, followed by non-slashes to end of line with the optional tilde and the slash.
The output of the first two in event of a match does not match the third s///.
Revised output
~/dir1/di r2/
/dir1/di r2/
~/dir1/di r3/
~/name/more/still/more/
~//
/
~/
~/
/
/c/

Unable to figure out regex bash or sed or awk

I wanted to split the following jdk-1.6.0_30-fcs.x86_64 to just jdk-1.6.0_30. I tried the following sed 's/\([a-z][^fcs]*\).*/\1/'but I end up with jdk-1.6.0_30-. I think am approaching it the wrong way, is there a way to start from the end of the word and traverse backwards till I encounter -?
Not exactly, but you can anchor the pattern to the end of the string with $. Then you just need to make sure that the characters you repeat may not include hyphens:
echo jdk-1.6.0_30-fcs.x86_64 | sed 's/-[^-]*$//'
This will match from a - to the end of the string, but all characters in between must be different from - (so that it does not match for the first hyphen already).
A slightly more detailed explanation. The engine tries to match the literal - first. That will first work at the first - in the string (obviously). Then [^-]* matches as many non-- characters as possible, so it will consume 1.6.0_30 (because the next character is in fact a hyphen). Now the engine will try to match $, but that does not work because we are not at the end of the string. Some backtracking occurs, but we can ignore that here. In the end the engine will abandon matching the first - and continue through the string. Then the engine will match the literal - with the second -. Now [^-]* will consume fcs.x86_64. Now we are actually at the end of the string and $ will match, so the full match (which will be removed is) -fcs.x86_64.
Use cut >>
echo 'jdk-1.6.0_30-fcs.x86_64' | cut -d- -f-2
Try doing this :
echo 'jdk-1.6.0_30-fcs.x86_64' | sed 's/-fcs.*//'
If using bash, sh or ash, you can do :
var=jdk-1.6.0_30-fcs.x86_64
echo ${var%%-fcs*}
jdk-1.6.0_30
Later solution use parameter expansion, tested on Linux and Minix3