Regex find/replace stops after first word

Regex find/replace stops after first word - regex

Why is only one of "the" replaced in the string if I set \g for global?
sed -E 's/(^|_)(the|an|is)(_|$)/\1/g' <<< "the_the_river"
= the_river

As mentioned, the problem is, that latter _ is consumed. To avoid overlapping matches you need either lookarounds or word boundaries. Word boundaries like \<, \> or some versions \b cannot be used in your case because the underscore belongs to word characters.
Alternative can be a perl one-liner that uses PCRE which supports lookarounds.
perl -pe 's/(?<![^_])(?:the|an|is)(?:_|$)//g' <<< "the_the_river"
river
(?<![^_]) is a negative lookbehind that checks if before the word there is not any character besides underscore. It matches at start or any position right before an underscore.
(?:the|an|is) is a non-capturing group alternating the different words.
(?:_|$) assuming you want to remove (consume) the underscore after word.
See regex101 for testing the pattern

Related

Match regex for given statement

I want to write regex for the following statement and match the bolded characters "The following strings must be matched
xyz.90001DUS.annotations and xyz.765896DUS.courses".
I tried to write one using regex but it is not matching above strings, can someone please help me?
It should match whole of bolded strings, this is the only criteria.
^xyz.([0-9])?DUS.annotations(.*)?\.annotations$

Your ^xyz.([0-9])?DUS.annotations(.*)?\.annotations$ cannot match the strings inside a longer string due to anchors, ^ and $. Besides, . matches any char other than line break chars, ([0-9])? matches a single optional digit (while you have five in 90001). The (.*)?\.annotations part would match any zero or more chars other than line break chars as many as possible consuming chars up to the last occurrence of .annotations.
What you can use is
xyz\.\d+DUS\.\w+
Or, with word boundaries:
\bxyz\.\d+DUS\.\w+ <<< In most NFA regex flavors
\yxyz\.\d+DUS\.\w+ <<< In PostgreSQL, Tcl
\mxyz\.\d+DUS\.\w+ <<< R (TRE), Tcl
\<xyz\.\d+DUS\.\w+ <<< GNU word boundary
[[:<:]]xyz\.\d+DUS\.\w+ <<< POSIX word boundary
See the regex demo. You do not need a word boundary after \w+, there is always a word boundary after the trailing \w+ in any regex pattern.
Details:
xyz\. - xyz.
\d+ - one or more digits
DUS\. - DUS.
\w+ - one or more word chars.

Removing multiple consecutive words separated with whitespace

In the code below the pattern / man / matches twice consecutively. So when I substitute that pattern only the first occurence is matched but the second occurence is not matched.
As I understand the problem the first pattern itself matches until the start of second pattern(i.e, the space after man is the end of first pattern and also the start of first pattern). So second pattern is not matched. How to match this pattern globally when it occurs consecutively.
use strict;
use warnings;
#my $name =" man sky man "; #this works
my $name =" man man sky"; #this does'nt
$name =~s/ man / nam /g; #expected= 'nam nam sky'
print $name,"\n";

The regex is eating up characters which it matches. So, to avoid this, you should use lookahead and lookbehind to match it in this case. Check perlre
$name =~ s/(?<=\s)man(?=\s)/nam/g;
Quoting from perlre
Look Ahead:
(?=pattern)
A zero-width positive lookahead assertion. For example, /\w+(?=\t)/ matches
a word followed by a tab, without including the tab in $&.
Look Behind:
(?<=pattern) \K A zero-width positive lookbehind assertion. For
example, /(?<=\t)\w+/ matches a word that follows a tab, without
including the tab in $& . Works only for fixed-width lookbehind.

I understand you want to replace man in between whitespace characters or start/end of string.
In this case, you may use two approaches, with positive lookarounds containing alternation operator checking for string boundaries and/or whitespaces, or negative lookarounds checking for the non-whitespace chars on both ends of the search word.
Use either of the two:
$name =~ s/(?<=^|\s)man(?=\z|\s)/nam/g;
$name =~ s/(?<!\S)man(?!\S)/nam/g;
From the point of view of efficiency, the second option is better since alternation is a bit "expensive".
The (?<=^|\s) positive lookbehind matches a location in string that is preceded with start of string (^) or (|) a whitespace (\s) and the (?=$|\s) positive lookahead makes sure there is a whitespace or end of string ($) immediately after man.
The (?<!\S) negative lookbehind matches a location in string that is not immediately preceded with a non-whitespace char, i.e. if there is a non-whitespace char there will be no match), and (?!\S) negative lookahead asserts there is no non-whitespace right after man.
See more details about Lookaround Assertions at perlre.

A regex for the last use of a word in a string

I'm trying to figure out how to grab the tail end of a string using a word as a delimiter, but that word can be used anywhere in the string. So only the last use would start the grab.
example: Go by the office and pickup milk by the safeway BY tomorrow
I want to grab the by tomorrow and not the other bys
This is the regex I'm trying to make robust:
$pattern = '/^(.*?)(#.*?)?(\sBY\s.*?)?(#.*)?$/i';
I think a negative lookahead would do it, but I've never used one before
Thanks!

I'm not sure what are the other things you have in the regex for, but here's the one I would use:
$pattern = '/\bby\s(?!.*\bby\b).*?$/i';
regex101 demo
\b is word boundary and will match only between a \w and a \W character or at string beginning/end.
by matches by literally.
\s matches a space (also matches newlines, tabs, form feeds, carriage returns)
(?!.*\bby\b) is the negative lookahead and will prevent a match if there is another word by ahead.
.*?$ is to get the remaining part of the string till the end of the string.

To match the last BY (uppercase and lowercase letters) try this regex:
\b[bB][yY]\b(?!.*\b[bB][yY]\b.*)
see demo here http://regex101.com/r/uA2rL0
This uses the \b word boundary to avoid matching things like nearby and as you said a negative lookahead.

Regex: word boundary but for white space, beginning of line or end of line only

I am looking for some word boundary to cover those 3 cases:
beginning of string
end of string
white space
Is there something like that since \b covers also -,/ etc.?
Would like to replace \b in this pattern by something described above:
(\b\d*\sx\s|\b\d*x|\b)

Try replacing \b with (?:^|\s|$)
That means
(
?: don't consider this group a match
^ match beginning of line
| or
\s match whitespace
| or
$ match end of line
)
Works for me in Python and JavaScript.

OK, so your real question is:
How do I match a unit, optionally preceded by a quantity, but only if there is either nothing or a space right before the match?
Use
(?<!\S)\b(?:\d+\s*x\s*)?\d+(?:\.\d+)?\s*ml\b
Explanation
(?<!\S): Assert that it's impossible to match a non-space character before the match.
\b: Match a word boundary
(?:\d+\s*x\s*)?: Optionally match a quantifier (integers only)
\d+(?:\.\d+)?: Match a number (decimals optional)
\s*ml\b: Match ml, optionally preceded by whitespace.

Boundaries that you get with \b are not whitespace sensitive. They are complicated conditional assertions related to the transition between \w\W or \W\w. See this answer for how to write your anchor more precisely, so that you can deal with whitespace the way you want.

Trim string using reqex match

I have to use a crippled tool which doesn't provide any way to trim leading an trailing spaces from a string. It does have .NET style regex, but only Match is implemented, not replace. So, I came up (surprisingly by myself) with this regex that seems to work.. but I don't completely understand why it works :-)
$trimmed = regex/[^ ].*[^ ]/ ($original_string)
Why does this work, does it really work in all cases, and is there a better way if you only have regex Match ( even group matches can't be captured :( ) ?

It should work fine unless there's only a single character surrounded by space.
Your pattern searches for:
A non-space character [^ ]
Zero or more characters of any kind, as many as possible (greedy match) .*
A non-space character [^ ]
So, if there aren't at least two non-space characters (1 and 3), the pattern won't match at all.
You should use \b instead of [^ ], that will match any 'word boundary', but will be of zero length and won't require two non-space characters:
\b.*\b

It works like this: [^ ] will match the first non space character, .* will match anything, and [^ ] will again match a non space character. Since regex is greedy the longest possible match is returned, so in this case the longest possible string with two non spaces at the ends effectively trimming off whitespace at the beginning and end of $original_string.
A good tutorial on regex is here, it teaches you about greedy and lazy matching which are key to understanding and optimizing regexes. It also teaches you about matching between characters which is what you would want to do here (see the answer about \b by Martin).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex find/replace stops after first word - regex

Why is only one of "the" replaced in the string if I set \g for global? sed -E 's/(^|_)(the|an|is)(_|$)/\1/g' <<< "the_the_river" = the_river

Related

Match regex for given statement

Removing multiple consecutive words separated with whitespace

A regex for the last use of a word in a string

Regex: word boundary but for white space, beginning of line or end of line only

Trim string using reqex match

Categories

Resources