What does "%%" mean in a sed regex? - regex

In the code base, there is a line:
echo x/y/z | sed 's%/[^/]*$%%'
It will remove the /z from the input string.
I can't quite understand it.
s is substitution
%/ is quoting /
[^/]*$ means matching any characters except / any times from the end of line
But what is %% here?

Here's info sed:
The '/' characters may be uniformly replaced by any other single
character within any given 's' command. The '/' character (or whatever
other character is used in its stead) can appear in the REGEXP or
REPLACEMENT only if it is preceded by a '\' character.
So the % is just an arbitrary delimiter. The canonical delimiter is /, but that collides with your pattern which is also /.
In other words, %/ isn't an escaped /. They're independent characters.
The expression breaks down like this:
s Replace
% Delimiter
/[^/]*$ Search pattern
% Delimiter
Empty replacement string
% Delimiter
Which is completely analogous to a simple s/foo/bar/:
s Replace
/ Delimiter
foo Search pattern
/ Delimiter
bar Replacement string
/ Delimiter

Related

Substring using Regex in Shell or bash

I've a huge text file having row items like following
"https://www.wayfair.ca/appliances/pdp/agua-canada-30-500-cfm-ducted-wall-mount-range-hood-agcd1041.html?piid=47570655"
"https://www.wayfair.ca/appliances/pdp/agua-canada-30-500-cfm-ducted-wall-mount-range-hood-agcd1041.html?piid=47570656"
"https://www.wayfair.ca/appliances/pdp/agua-canada-30-500-cfm-ducted-wall-mount-range-hood-agcd1042.html"
"https://www.wayfair.ca/appliances/pdp/agua-canada-30-500-cfm-ducted-wall-mount-range-hood-agcd1043.html?piid=47570657"
"https://www.wayfair.ca/appliances/pdp/agua-canada-30-500-cfm-ducted-wall-mount-range-hood-agcd1043.html?piid=47570658"
I want to extract alpha-numeric character after last occurrence of '-' and before '.html' ('agcd1043' only) and save those values to another file.
Kindly help me do this using regex ( .-(.+).html. - is the regex I used to npp for smaller files) or any other method. TIA
You could extract the string with sed:
sed 's/.*-\([^-]*\)\.html.*/\1/' <<< "https://www.wayfair.ca/appliances/pdp/agua-canada-30-500-cfm-ducted-wall-mount-range-hood-agcd1041.html?piid=47570655"
If you have all your strings in a file you can iterate on it:
while read line
do
variable=$(sed 's/.*-\([^-]*\)\.html.*/\1/' <<< $line)
# ... use the value from $variable
done < /path/to/file
The sed script is a substitution, where:
.*-\([^-]*\)\.html.* is the pattern
\1 is the replacement
The pattern is written so that it captures any sequence of non-hyphen character, i.e. [^-]* trapped between a hypen character - and the .html string. The dot character is escaped for regex purposes, hence the \.html pattern. The leading ad trailing .* make sure that anything before the hyphen and after html are captured too, otherwise they would appear in the output.

sed & protobuf: need to delete dots

I need to delete dots using sed, but not all dots.
- repeated .CBroadcast_GetBroadcastChatUserNames_Response.PersonaName persona_names = 1
+ repeated CBroadcast_GetBroadcastChatUserNames_Response.PersonaName persona_names = 1
Here the dot after repeated, (repeated also can beoptional | required | extend) should be deleted
- rpc NotifyBroadcastViewerState (.CBroadcast_BroadcastViewerState_Notification) returns (.NoResponse)
+ rpc NotifyBroadcastViewerState (CBroadcast_BroadcastViewerState_Notification) returns (NoResponse)
And here delete dot after (
It should work on multiple files with different content.
Full code can be found here
A perhaps simpler solution (works with both GNU sed and BSD/macOS sed):
sed -E 's/([[:space:][:punct:]])\./\1/g' file
In case a . can also appear as the first character on a line, use the following varation:
sed -E 's/(^|[[:space:][:punct:]])\./\1/g' file
The assumption is that any . preceded by:
a whitespace character (character class [:space:])
as in:  .
or a punctuation character (character class [:punct:])
as in: (.
should be removed, by replacing the matched sequence with just the character preceding the ., captured via subexpression (...) in the regular expression, and referenced in the replacement string with \1 (the first capture group).
If you invert the logic, you can try the simpler:
sed -E 's/([^[:alnum:]])\./\1/g' file
In case a . can also appear as the first character on a line:
sed -E 's/(^|[^[:alnum:]])\./\1/g' file
This replaces all periods that are not (^) preceded by an alphanumeric character (a letter or digit).
Assuming only the leading . needs removal, here's some GNU sed code:
echo '.a_b.c c.d (.e_f.g) ' |
sed 's/^/& /;s/\([[:space:]{([]\+\)\.\([[:alpha:]][[:alpha:]_.]*\)/\1\2/g;s/^ //'
Output:
a_b.c c.d (e_f.g)
Besides the ., it checks for two fields, which are left intact:
Leading whitespace, or any opening (, [, or {.
Trailing alphabetical chars or also _ or ..
Unfortunately, while the \+ regexp matches one or more spaces et al, it fails if the . is at the beginning of the line. (Replacing the \* with a '*' would match the beginning, but would incorrectly change c.d to cd.) So there's a kludge... s/^/& / inserts a dummy space at the beginning of the line, that way the \+ works as desired, then a s/^ // removes the dummy space.

How to regexp match surrounding whitespace or beginning/end of line

I am trying to find lines in a file that contain a / (slash) character which is not part of a word, like this:
grep "\</\>" file
But no luck, even if the file contains the "/" alone, grep does not find it.
I want to be able to match lines such as
some text / pictures
/ text
text /
but not e.g.
/home
Why your approach does not work
\<, \> only match against the beginning (or end, respectively) of a word. That means that they can never match if put adjacent to / (which is not treated as a word-character) – because e.g. \</ basically says "match the beginning of a word directly followed by something other than a word (a 'slash', in this case)", which is impossible.
What will work
This will match / surrounded by whitespace (\s) or beginning/end of line:
egrep '(^|\s)/($|\s)' file
(egrep implies the -E option, which turns on processing of extended regular expressions.)
What might also work
The following slightly simpler expression will work if a / is never adjacent to non-word characters (such as *, #, -, and characters outside the ASCII range); it might be of limited usefulness in OP's case:
grep '\B/\B' file
for str in 'some text / pictures' ' /home ' '/ text' ' text /'; do
echo "$str" | egrep '(^|\s)/($|\s)'
done
This will match /:
if the entire input string is /
if the input string starts with / and is followed by at least 1 whitespace
if the input string ends with / and is preceded by at least 1 whitespace
if / is inside the input string surrounded by at least 1 whitespace on either side.
As for why grep "\</\>" file did not work:
\< and /> match the left/right boundaries between words and non-words. However, / does not qualify as a word, because words are defined as a sequence of one or more instances of characters from the set [[:alnum:]_], i.e.: sequences of at least length 1 composed entirely of letters, digits, or _.
This seems to work for me.
grep -rni " / \| /\|/ " .

regular expression what's the meaning of this regular expression s#^.*/##s

what is the meaning of s#^.*/##s
because i know that in the pattern '.' denotes that it can represent random letter except the \n.
then '.* 'should represent the random quantity number of random letter .
but in the book it said that this would be delete all the unix type of path.
My question is that, does it means I could substitute random quantity number of random letter by space?
s -> subsitution
# -> pattern delimiter
^.* -> all chars 0 or more times from the begining
/ -> literal /
## -> replace by nothing (2 delimiters)
s -> single line mode ( the dot can match newline)
Substitutions conventionally use the / character as a delimiter (s/this/that/), but you can use other punctuation characters if it's more convenient. In this case, # is used because the regexp itself contains a / character; if / were used as the delimiter, any / in the pattern would have to be escaped as \/. (# is not the character I would have chosen, but it's perfectly valid.)
^ matches the beginning of the string (or line; see below)
.*/ matches any sequence of characters up to and including a / character. Since * is greedy, it will match all characters up to an including the last / character; any precedng / characters are "eaten" by the .*. (The final / is not, because if .* matched all / characters the final / would fail to match.)
The trailing s modifier treats the string as a single line, i.e., causes . to match any character including a newline. See the m and s modifiers in perldoc perlre for more information.
So this:
s#^.*/##s
replaces everything from the beginning of the string ($_ in this case, since that's the default) up to the last / character by nothing.
If there are no / characters in $_, the match fails and the substitution does nothing.
This might be used to replace all directory components of an absolute or relative path name, for example changing /home/username/dir/file.txt to file.txt.
It will delete all characters, including line breaks because of the s modifier, in a string until the last slash included.
Please excuse a little pedantry. But I keep seeing this and I think it's important to get it right.
s#^.*/##s is not a regular expression.
^.* is a regular expression.
s/// is the substitution operator.
The substitution operator takes two arguments. The first is a regular expression. The second is a replacement string.
The substitution operator (like many other quote-like operators in Perl) allows you you change the delimiter character that you use.
So s### is also a substitution operator (just using # instead of /).
s#^.*/## means "find the text that matches the regular expression ^.*/ and replace it with an empty string. And the s on the end is a option which changes the regex so that the . matches "\n" as well as all other characters.

Specify a "-" in a sed pattern

I'm trying to find a '-' character in a line, but this character is used for specifying a range. Can I get an example of a sed pattern that will contain the '-' character?
Also, it would be quicker if I could use a pattern that includes all characters except a space and a tab.
'-' specifies a range only between square brackets.
For example, this:
sed -n '/-/p'
prints all lines containing a '-' character. If you want a '-' to represent itself between square brackets, put it immediately after the [ or before the ]. This:
sed -n '/[-x]/p'
prints all lines containing either a '-' or a 'x'.
This pattern:
[^ <tab>]
matches all characters other than a space and a tab (note that you need a literal tab character, not "<tab>").
If you want to find a dash, specify it outside a character class:
/-/
If you want to include it in a character class, make it the first or last character:
/[-a-z]/
/[a-z-]/
If you want to find anything except a blank or tab (or newline), then:
/[^ \t]/
where, I hasten to add, the '\t' is a literal tab character and not backslash-t.
find "-" in file
awk '/-/' file