Removing multiple consecutive words separated with whitespace

Removing multiple consecutive words separated with whitespace - regex

In the code below the pattern / man / matches twice consecutively. So when I substitute that pattern only the first occurence is matched but the second occurence is not matched.
As I understand the problem the first pattern itself matches until the start of second pattern(i.e, the space after man is the end of first pattern and also the start of first pattern). So second pattern is not matched. How to match this pattern globally when it occurs consecutively.
use strict;
use warnings;
#my $name =" man sky man "; #this works
my $name =" man man sky"; #this does'nt
$name =~s/ man / nam /g; #expected= 'nam nam sky'
print $name,"\n";

The regex is eating up characters which it matches. So, to avoid this, you should use lookahead and lookbehind to match it in this case. Check perlre
$name =~ s/(?<=\s)man(?=\s)/nam/g;
Quoting from perlre
Look Ahead:
(?=pattern)
A zero-width positive lookahead assertion. For example, /\w+(?=\t)/ matches
a word followed by a tab, without including the tab in $&.
Look Behind:
(?<=pattern) \K A zero-width positive lookbehind assertion. For
example, /(?<=\t)\w+/ matches a word that follows a tab, without
including the tab in $& . Works only for fixed-width lookbehind.

I understand you want to replace man in between whitespace characters or start/end of string.
In this case, you may use two approaches, with positive lookarounds containing alternation operator checking for string boundaries and/or whitespaces, or negative lookarounds checking for the non-whitespace chars on both ends of the search word.
Use either of the two:
$name =~ s/(?<=^|\s)man(?=\z|\s)/nam/g;
$name =~ s/(?<!\S)man(?!\S)/nam/g;
From the point of view of efficiency, the second option is better since alternation is a bit "expensive".
The (?<=^|\s) positive lookbehind matches a location in string that is preceded with start of string (^) or (|) a whitespace (\s) and the (?=$|\s) positive lookahead makes sure there is a whitespace or end of string ($) immediately after man.
The (?<!\S) negative lookbehind matches a location in string that is not immediately preceded with a non-whitespace char, i.e. if there is a non-whitespace char there will be no match), and (?!\S) negative lookahead asserts there is no non-whitespace right after man.
See more details about Lookaround Assertions at perlre.

Related

Perl Regexp::Common package not matching certain real numbers when used with word boundary

The following code below print "34" instead of the expected ".34"
use strict;
use warnings;
use Regexp::Common;
my $regex = qr/\b($RE{num}{real})\s*/;
my $str = "This is .34 meters of cable";
if ($str =~ /$regex/) {
print $1;
}
Do I need to fix my regex? (The word boundary is need as not including it will cause it match something string like xx34 which I don't want to)
Or is it is a bug in Regexp::Common? I always thought that a longest match should win.

The word boundary is a context-dependent regex construct. When it is followed with a word char (letter, digit or _) this location should be preceded either with the start of a string or a non-word char. In this concrete case, the word boundary is followed with a non-word char and thus requires a word char to appear right before this character.
You may use a non-ambiguous word boundary expressed with a negative lookbehind:
my $regex = qr/(?<!\w)($RE{num}{real})/;
^^^^^^^
The (?<!\w) negative lookbehind always denotes one thing: fail the match if there
is no word character immediately to the left of the current location.
Or, use a whitespace boundary if you want your matches to only occur after whitespace or start of string:
my $regex = qr/(?<!\S)($RE{num}{real})/;
^^^^^^^

Try this patern: (?:^| )(\d*\.?\d+)
Explanation:
(?:...) - non-capturing group
^| - match either ^ - beginning oof a string or - space
\d* - match zero or more digits
\.? - match dot literally - zero or one
\d+ - match one or more digits
Matched number will be stored in first capturing group.
Demo

Regex find/replace stops after first word

Why is only one of "the" replaced in the string if I set \g for global?
sed -E 's/(^|_)(the|an|is)(_|$)/\1/g' <<< "the_the_river"
= the_river

As mentioned, the problem is, that latter _ is consumed. To avoid overlapping matches you need either lookarounds or word boundaries. Word boundaries like \<, \> or some versions \b cannot be used in your case because the underscore belongs to word characters.
Alternative can be a perl one-liner that uses PCRE which supports lookarounds.
perl -pe 's/(?<![^_])(?:the|an|is)(?:_|$)//g' <<< "the_the_river"
river
(?<![^_]) is a negative lookbehind that checks if before the word there is not any character besides underscore. It matches at start or any position right before an underscore.
(?:the|an|is) is a non-capturing group alternating the different words.
(?:_|$) assuming you want to remove (consume) the underscore after word.
See regex101 for testing the pattern

Optional regular expression operator in PowerShell

In $string, I'm trying to phase out the first "-1" so the output of the string will be "test test test-Long.xml".
$string = 'test test test-1-Long.xml'
$string -replace '^(.*)-?\d?(-?.*)\.xml$', '$1$2'
My issue is that I need to make that same first "-1" pattern optional, as both the hyphen and number could not be there as well.
Why is the "?" operator not working? I've also tried {0,1} after each as well with no luck.

Regexes are greedy, so the engine can't decide what to match, and it is ambiguous.
I am not sure it's the best solution, but I could make it work this way:
$string -replace '^([^\-]*)-?\d?(-?.*)\.xml$', '$1$2'
Sole change: the first group must not contain the dash: that kind of "balances" the regex, avoiding the greedyness and that yields:
test test test-Long
Note: the output is not test test test-Long.xml as required in your answer. To do that, simply remove the xml suffix:
$string -replace '^([^\-]*)-?\d?(-?.*)', '$1$2'

The $string -replace '^(.*?)(?:-\d+)?(-.*?)\.xml$', '$1$2' should work if the hyphen is obligatory in the input. Or $string -replace '^((?:(?!-\d+).)*)(?:-\d+)?(.*)\.xml$', '$1$2' in case the input may have no hyphen.
See the regex demo 1 and regex demo 2.
Pattern details:
^ - start of string
(.*?) - Group 1 capturing any 0+ characters other than a newline as few as possible (as the *? quantifier is lazy) up to the first (NOTE: to increase regex performance, you may use a tempered greedy token based pattern instead of (.*?) - ((?:(?!-\d+).)*) that matches any text, but -+1 or more digits, thus, acting similarly to negated character class, but for a sequence of symbols)
(?:-\d+)? - non-capturing group with a greedy ? quantifier (so, this group has more priority for the regex engine, the previous capture will end before this pattern) capturing a hyphen followed with one or more digits
(-.*?) - Group 3 capturing an obligatory - and any 0+ chars other than LF, as few as possible up to
\.xml - literal text .xml
$ - end of string.
Why is the "?" operator not working?
It is not true. The quantifier ? works well as it matches one or zero occurrences of the quantified subpattern. However, the issue arises in combination with the first .* greedy dot matching subpattern. See your regex in action: the first capture group grabs the whole substring up to the last .xml, and the second group is empty. Why?
Because of backtracking and how greedy quantifier works. The .* matches any characters, but a newline, as many as possible. Thus, it grabs the whole string up to the end. Then, backtracking starts: one character at a time is given back and tested against the subsequent subpatterns.
What are they? -?\d?(-?.*) - all of them can match an empty string. The -? matches an empty string before .xml, ok, \d? matches there as well, -? and .* also matches there.
However, the .* grabs the whole string again, but there is the \.xml pattern to accommodate. So, the second capture group is just empty. In fact, there are more steps the regex engine performs (see the regex debugger page), but the main idea is like that.

Regex numbers from string

I am trying to write a regex that can find only numbers from given string. What I mean is:
Input: My number is +12 345 678. I have galaxy s3, its symbol 34abc.
Output: 345 and 678 (but not +12, 3 from word s3 or 34 from 34abc)
I tried just numbers (\d+) and I combinations with white and words characters. The closest was^\d$ but that doesn't work as my numbers are part of the bigger string, not whole string themselves. Can you give me a hint?
------- EDIT
Looks like I just don't know how to check a character without actually getting it into result. Like "digit that follow space character (without this space)".

In general case, you can make use of lookbehind and lookahead:
(?<=^|\s)\d+(?=$|\s)
The part which makes it into the captured output is \d+.
Lookbehind and lookahead are not included in the match.
I just included spaces as delimiters in the regex, but you may replace \s with any character class, as defined by your requirements. For example, to allow dots as separators (both in front and after the digits), use the following regex:
(?<=^|[\s.])\d+(?=$|[\s.])
The (?<=^|\s) should be read as follows:
(?<= ... ) defines the lookbehind group.
The expression which must precede the \d+ is ^|\s, meaning "either start of the line (^) or whitespace".
Similarly, (?=$|\s) defines the lookahead group (it must follow the captured digits), which is either end of the line ($) or whitespace.
A note on \b mentioned in other answers: it is a nice feature, means "word boundary", but the "word characters" are not customizable. This means that, for example, the "+" character is considered to be a separator and you can't change this if you use \b. With lookaround, you can customize the separators to your needs.

What you seem to want is a sequence of digits (\d+) that is preceded by a whitespace (\s) or the start of the string (^), and followed by a whitespace or punctuation character ([\s.,:;!?]) or the end of the string ($), but the preceding/following whitespace or punctuation character should not be included in the match, so you need positive lookahead ((?=xxx)) and lookbehind ((?<=xxx)).
(?<=^|\s)\d+(?=[\s.,:;!?]|$)
See regex101 for demo.
Remember to double the backslashes in a Java literal.

Safer RegEx
Try this:
(?<=\s|^)\d+(?=\s|\b)
Live Demo on Regex101
How it works:
(?<=\s|^) # Start of String OR Whitespace (will not select +)
# Positive Lookbehind ensures the data is not included in the match
\d+ # Digit(s)
(?=\s|\b) # Whitespace OR Word Boundary
# Positive Lookahead ensures the data is not included in the match
Lookarounds do not take up any characters in the match, so they can be used so Capture Groups do not need to be. For example:
# Regex /.*barbaz/
barbaz # Matched Data Result: barbaz
foobarbaz # Matched Data Result: foobarbaz
# Regex (with Positive Lookahead) /.*bar(?=baz)/
barbaz # Matched Data Result: bar
foobarbaz # Matched Data Result: foobar
As you can see with the second RegEx, baz is never included in the matched data result, however it was required in the string for the RegEx to match. The RegEx above works on the same principle
Not as Safe (Old) RegEx
You can try this RegEx:
\b\d+\b
\b is a Word Boundary. This will, however, select 12 from +12.
You can change the RegEx to this to stop 12 from being selected:
(?<!\+)\b\d+\b
This uses a Negative Lookbehind and will fail if there is a + before the digits.
Live Demo on Regex101

A regex for the last use of a word in a string

I'm trying to figure out how to grab the tail end of a string using a word as a delimiter, but that word can be used anywhere in the string. So only the last use would start the grab.
example: Go by the office and pickup milk by the safeway BY tomorrow
I want to grab the by tomorrow and not the other bys
This is the regex I'm trying to make robust:
$pattern = '/^(.*?)(#.*?)?(\sBY\s.*?)?(#.*)?$/i';
I think a negative lookahead would do it, but I've never used one before
Thanks!

I'm not sure what are the other things you have in the regex for, but here's the one I would use:
$pattern = '/\bby\s(?!.*\bby\b).*?$/i';
regex101 demo
\b is word boundary and will match only between a \w and a \W character or at string beginning/end.
by matches by literally.
\s matches a space (also matches newlines, tabs, form feeds, carriage returns)
(?!.*\bby\b) is the negative lookahead and will prevent a match if there is another word by ahead.
.*?$ is to get the remaining part of the string till the end of the string.

To match the last BY (uppercase and lowercase letters) try this regex:
\b[bB][yY]\b(?!.*\b[bB][yY]\b.*)
see demo here http://regex101.com/r/uA2rL0
This uses the \b word boundary to avoid matching things like nearby and as you said a negative lookahead.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Removing multiple consecutive words separated with whitespace - regex

Related

Perl Regexp::Common package not matching certain real numbers when used with word boundary

Regex find/replace stops after first word

Optional regular expression operator in PowerShell

Regex numbers from string

A regex for the last use of a word in a string

Categories

Resources