How to regexp match surrounding whitespace or beginning/end of line

How to regexp match surrounding whitespace or beginning/end of line - regex

I am trying to find lines in a file that contain a / (slash) character which is not part of a word, like this:
grep "\</\>" file
But no luck, even if the file contains the "/" alone, grep does not find it.
I want to be able to match lines such as
some text / pictures
/ text
text /
but not e.g.
/home

Why your approach does not work
\<, \> only match against the beginning (or end, respectively) of a word. That means that they can never match if put adjacent to / (which is not treated as a word-character) – because e.g. \</ basically says "match the beginning of a word directly followed by something other than a word (a 'slash', in this case)", which is impossible.
What will work
This will match / surrounded by whitespace (\s) or beginning/end of line:
egrep '(^|\s)/($|\s)' file
(egrep implies the -E option, which turns on processing of extended regular expressions.)
What might also work
The following slightly simpler expression will work if a / is never adjacent to non-word characters (such as *, #, -, and characters outside the ASCII range); it might be of limited usefulness in OP's case:
grep '\B/\B' file

for str in 'some text / pictures' ' /home ' '/ text' ' text /'; do
echo "$str" | egrep '(^|\s)/($|\s)'
done
This will match /:
if the entire input string is /
if the input string starts with / and is followed by at least 1 whitespace
if the input string ends with / and is preceded by at least 1 whitespace
if / is inside the input string surrounded by at least 1 whitespace on either side.
As for why grep "\</\>" file did not work:
\< and /> match the left/right boundaries between words and non-words. However, / does not qualify as a word, because words are defined as a sequence of one or more instances of characters from the set [[:alnum:]_], i.e.: sequences of at least length 1 composed entirely of letters, digits, or _.

This seems to work for me.
grep -rni " / \| /\|/ " .

Related

What does "%%" mean in a sed regex?

In the code base, there is a line:
echo x/y/z | sed 's%/[^/]*$%%'
It will remove the /z from the input string.
I can't quite understand it.
s is substitution
%/ is quoting /
[^/]*$ means matching any characters except / any times from the end of line
But what is %% here?

Here's info sed:
The '/' characters may be uniformly replaced by any other single
character within any given 's' command. The '/' character (or whatever
other character is used in its stead) can appear in the REGEXP or
REPLACEMENT only if it is preceded by a '\' character.
So the % is just an arbitrary delimiter. The canonical delimiter is /, but that collides with your pattern which is also /.
In other words, %/ isn't an escaped /. They're independent characters.
The expression breaks down like this:
s Replace
% Delimiter
/[^/]*$ Search pattern
% Delimiter
Empty replacement string
% Delimiter
Which is completely analogous to a simple s/foo/bar/:
s Replace
/ Delimiter
foo Search pattern
/ Delimiter
bar Replacement string
/ Delimiter

sed & protobuf: need to delete dots

I need to delete dots using sed, but not all dots.
- repeated .CBroadcast_GetBroadcastChatUserNames_Response.PersonaName persona_names = 1
+ repeated CBroadcast_GetBroadcastChatUserNames_Response.PersonaName persona_names = 1
Here the dot after repeated, (repeated also can beoptional | required | extend) should be deleted
- rpc NotifyBroadcastViewerState (.CBroadcast_BroadcastViewerState_Notification) returns (.NoResponse)
+ rpc NotifyBroadcastViewerState (CBroadcast_BroadcastViewerState_Notification) returns (NoResponse)
And here delete dot after (
It should work on multiple files with different content.
Full code can be found here

A perhaps simpler solution (works with both GNU sed and BSD/macOS sed):
sed -E 's/([[:space:][:punct:]])\./\1/g' file
In case a . can also appear as the first character on a line, use the following varation:
sed -E 's/(^|[[:space:][:punct:]])\./\1/g' file
The assumption is that any . preceded by:
a whitespace character (character class [:space:])
as in:  .
or a punctuation character (character class [:punct:])
as in: (.
should be removed, by replacing the matched sequence with just the character preceding the ., captured via subexpression (...) in the regular expression, and referenced in the replacement string with \1 (the first capture group).
If you invert the logic, you can try the simpler:
sed -E 's/([^[:alnum:]])\./\1/g' file
In case a . can also appear as the first character on a line:
sed -E 's/(^|[^[:alnum:]])\./\1/g' file
This replaces all periods that are not (^) preceded by an alphanumeric character (a letter or digit).

Assuming only the leading . needs removal, here's some GNU sed code:
echo '.a_b.c c.d (.e_f.g) ' |
sed 's/^/& /;s/\([[:space:]{([]\+\)\.\([[:alpha:]][[:alpha:]_.]*\)/\1\2/g;s/^ //'
Output:
a_b.c c.d (e_f.g)
Besides the ., it checks for two fields, which are left intact:
Leading whitespace, or any opening (, [, or {.
Trailing alphabetical chars or also _ or ..
Unfortunately, while the \+ regexp matches one or more spaces et al, it fails if the . is at the beginning of the line. (Replacing the \* with a '*' would match the beginning, but would incorrectly change c.d to cd.) So there's a kludge... s/^/& / inserts a dummy space at the beginning of the line, that way the \+ works as desired, then a s/^ // removes the dummy space.

What's special about a "space" character in an "expr match" regexp?

In a bash shell, I set line like so:
line="total active bytes: 256"
Now, I just want to get the digits from that line so I do:
echo $(expr match "$line" '.*\([[:digit:]]*\)' )
and I don't get anything. But, if I add a space character before the first backslash in the regexp, then it works:
echo $(expr match "$line" '.* \([[:digit:]]*\)' )
Why?

The space isn't special at all. What's happening is that in the first case, the .* matches the entire string (i.e., it matches "greedily"), including the numbers, and since you've quantified the digits with * (as opposed to \+), that part of the regex is allowed to match 0 characters.
By putting a space before the digit match, the first part can only match up to but not including the last space in the string, leaving the digits to be matched by \([[:digit:]]*\).

Decoding a regular expression in Perl

I am trying to decode the following Perl regular expression:
$fname =~ /\/([^\/]+)\.txt$/
What are we trying to match for here?

Here's how you break it down.
\/ - the literal character /
(...) - followed by a group that will be captured to $1
[ ... ] - a character class
^ - in a character class, this means take the inversion of the specified set
\/ - the literal character /
+ - one or more times
\. - the literal character .
txt - the literal string txt
$ - the end of the string
So, in other words, this is trying to match "anything with a / followed by one or more characters that are not /, followed by .txt, followed by the end of the string, and put the part before .txt into $1"

\/([^\/]+)\.txt
This regular expression matches a file name, as it exists in a path
Minus the extension, and
Only when (or starting where) the path begins with an up-right slash.
Examples:
\folder\path\file.txt
Nothing is matched.
folder/path/file.txt
file.txt is matched (and file is placed in capture group 1: $1).
/folder/path/file.txt
Again, file.txt is matched (and file captured).
You can try it yourself at Debuggex

regular expression what's the meaning of this regular expression s#^.*/##s

what is the meaning of s#^.*/##s
because i know that in the pattern '.' denotes that it can represent random letter except the \n.
then '.* 'should represent the random quantity number of random letter .
but in the book it said that this would be delete all the unix type of path.
My question is that, does it means I could substitute random quantity number of random letter by space?

s -> subsitution
# -> pattern delimiter
^.* -> all chars 0 or more times from the begining
/ -> literal /
## -> replace by nothing (2 delimiters)
s -> single line mode ( the dot can match newline)

Substitutions conventionally use the / character as a delimiter (s/this/that/), but you can use other punctuation characters if it's more convenient. In this case, # is used because the regexp itself contains a / character; if / were used as the delimiter, any / in the pattern would have to be escaped as \/. (# is not the character I would have chosen, but it's perfectly valid.)
^ matches the beginning of the string (or line; see below)
.*/ matches any sequence of characters up to and including a / character. Since * is greedy, it will match all characters up to an including the last / character; any precedng / characters are "eaten" by the .*. (The final / is not, because if .* matched all / characters the final / would fail to match.)
The trailing s modifier treats the string as a single line, i.e., causes . to match any character including a newline. See the m and s modifiers in perldoc perlre for more information.
So this:
s#^.*/##s
replaces everything from the beginning of the string ($_ in this case, since that's the default) up to the last / character by nothing.
If there are no / characters in $_, the match fails and the substitution does nothing.
This might be used to replace all directory components of an absolute or relative path name, for example changing /home/username/dir/file.txt to file.txt.

It will delete all characters, including line breaks because of the s modifier, in a string until the last slash included.

Please excuse a little pedantry. But I keep seeing this and I think it's important to get it right.
s#^.*/##s is not a regular expression.
^.* is a regular expression.
s/// is the substitution operator.
The substitution operator takes two arguments. The first is a regular expression. The second is a replacement string.
The substitution operator (like many other quote-like operators in Perl) allows you you change the delimiter character that you use.
So s### is also a substitution operator (just using # instead of /).
s#^.*/## means "find the text that matches the regular expression ^.*/ and replace it with an empty string. And the s on the end is a option which changes the regex so that the . matches "\n" as well as all other characters.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to regexp match surrounding whitespace or beginning/end of line - regex

I am trying to find lines in a file that contain a / (slash) character which is not part of a word, like this: grep "\</\>" file But no luck, even if the file contains the "/" alone, grep does not find it. I want to be able to match lines such as some text / pictures / text text / but not e.g. /home

This seems to work for me. grep -rni " / \| /\|/ " .

Related

What does "%%" mean in a sed regex?

sed & protobuf: need to delete dots

What's special about a "space" character in an "expr match" regexp?

Decoding a regular expression in Perl

regular expression what's the meaning of this regular expression s#^.*/##s

Categories

Resources