Match pattern with exceptions - regex

I want to match a pattern using regular expressions, but I need some exceptions to the match. For instance, match every occurence of "John Doe" except for those occurences where "John Doe" is enclosed by bold tags, i.e. "<b>John Doe</b>".
Match: John Doe
Don't match: <b>John Doe</b>
How can I achieve this with regular expressions?
Clarification: I want to exclude everything between the bold tags. This excluded content may contain a wide variety of characters, line breaks and so on.

If your regex dialect allows lookarounds you may use a negative lookbehind and a negative lookahead to achieve that task:
(?<!<b>)John Doe(?!<b>)

You could use negative look-arounds for this:
(?<!<b>)John Doe(?!</b>)
That wouldn't match <b>John Doe or John Doe</b> either though.
If you only want to not match instances with both the opening and closing tag you could do something like:
John Doe(?!(?<=<b>John Doe)</b>)
Or slightly shorter (but less understandable - 8 is the length of John Doe):
John Doe(?!(?<=<b>.{8})</b>)

Using Perl you can use negative lookbehind:
$ echo "<b>John Doe</b>" | perl -ne 'print if /(?<!<b>)John Doe/'
(above prints nothing - does not match).
$ echo "John Doe" | perl -ne 'print if /(?<!<b>)John Doe/'
John Doe
(above matches).
Symbol (?<!<b>) is a negative lookbehind - string matches if it's not followed by what's inside of it (<b> in this case).

Related

Regex AND search inside block which enclosed by something delimiter

I want to regex AND search (?=)(?=) inside block which enclosed by something delimiter such as #
In following sample regex, what I expected is, cat to ugly matches to the pattern inside # cat B to before # cat C.
But the regex match to nothing.
regex
^#(?=[\s\S]*(cat))(?=[\s\S]*(ugly))^#
text
# cat A
the cat is
very cute.
# cat B
the cat is
very ugly.
# cat C
the cat is
very good.
#
You can test the regex on https://regexr.com/
In your pattern ^#(?=[\s\S]*(cat))(?=[\s\S]*(ugly))^# you use match a # from the start of the string ^#, followed by 2 positive lookaheads and then again match ^#. That is why you don't get a match.
To get a more exact match, you could start the pattern with ^# cat B
If you want to use lookaheads, you might use 2 capturing groups in the positive lookahead. If you want to search for cat and ugly as whole words you might use word boundaries \b.
The (?s) is a modifier that enables the dot matching a newline for which you might also use /s as a flag instead.
(?s)(?=^# cat B.*?(cat).*?(ugly).*?^# cat C
Regex demo
But it might be easier to not use the lookahead and match instead:
(?s)^# cat B.*?(cat).*?(ugly).*?^# cat C$
Php demo
This RegEx might help you to design/match your target words by bounding them using \n.
((.+)(cat)(.+))\n((.+)(ugly)(.+))
Just to be simple, it creates four groups for each target keywords: 🐈 and ugly, where your target keywords can be called using $3 and $7:
You could additionally bound it with start ^ and end $, if you wish.
This expression only works when your target keywords are in the middle of both lines.

How do I add characters to RegEx results?

I am not a programmer but I am having to use RegEx for a particular purpose. How do I add specific characters to what is being returned from RegEx?
For example, if I have a list as follows:
XYZ ABC 123
How do I use RegEx to add something specific to the end of each? For example, if I want all three to end with .com for example?
You can try this script, replacing XYZ abc 123 with your full list:
echo "XYZ abc 123" | sed -E 's/([a-zA-Z0-9]+)/\1.com/g'
Explanation:
s/ Starts a substitution regex
([a-zA-Z0-9]+) Capture at least one alphanumeric
/ End regex
\1.com replaces the capture with itself plus adds .com
/g Global modifier (for all matches)
Without knowing which regex engine you want to use, there are many other ways to do this. In the future, please give more information.

Regex: find last occurance of word in string

I need to find last occurance of a word in a string (and replace it). So in following sentence I would be looking for the second "chocolate".
I love milk chocolate but I hate white chocolate.
How can that be achieved with regular expression? Could you please give me some explanation?
Thanks.
If you want to use a regex you could use something like this:
(.*)chocolate
And the replacement string would be:
$1banana
^-- whatever you want
working demo
Update: as Lucas pointed out in his comment, you can improve the regex by using:
(.*)\bchocolate\b
This allows you to avoid false positives like chocolateeejojo
PCRE would look like this:
/^(.*)chocolate/$1replace/sm
If you want to match the second occurrence of any distinct word, you may be able to use a backreference, depending on the language and regex implementation you're in.
For example, in sed, you might do the following:
sed 's/\(.*\([[:<:]][[:alpha:]]*[[:>:]]\).*\)\(\2\)\(.*\)/\1russians\4/'
Breaking this down for easier reading, it looks like this:
s/ - substitute in sed
\(.*\([[:<:]][[:alpha:]]*[[:>:]]\).*\)\(\2\)\(.*\) - the search RE. Not really so complex....
[[:<:]] and [[:>:]] are portable word boundaries,
[[:alpha:]] is the class of alphabetical characters (words)
\( and \) surround atoms for use in backreferences, in BRE (this is sed, remember)
\1russians\4 - replacement string consists of the first (outer) parenthesized backreference from the RE, followed by the replacement word, followed by the trailing characters.
For example:
$ t="I love milk chocolate but I hate white chocolate."
$ sed 's/\(.*\([[:<:]][[:alpha:]]*[[:>:]]\).*\)\(\2\)\(.*\)/\1russians\4/' <<<"$t"
I love milk chocolate but I hate white russians.
$ t="In a few years, your twenty may be worth twenty bucks."
$ sed 's/\(.*\([[:<:]][[:alpha:]]*[[:>:]]\).*\)\(\2\)\(.*\)/\1fifty\4/' <<<"$t"
In a few years, your twenty may be worth fifty bucks.
$

Powershell REGEX extract last word

If i have a PowerShell string for example "John Doe Bloggs" or "John Bloggs".
And I wanted to extract the last word after the space so in the above example it would be "Bloggs" what REGEX would I use. The solution must be a REGEX. I've googled my mind away and still not any closer.
Any help would be appreciated.
It's really too bad that the answer "must" be a regex (I'm guessing this is some kind of homework assignment?) because it's pretty simple without.
$string = 'John Doe Bloggs';
$string.split(' ')[-1];
Here's a simple example:
$string = 'John Doe Bloggs'
$regex = '.+\s(.+)'
$string -replace $regex,'$1'
Bloggs
This regular expression will find the last word in the input:
(?<word>\w+)[\s\,\.\?\!]*$
The match is in the group named word - the entire expression matches the final word and optional whitespace / (some) punctuation. Any trailing whitespace / punctuation will not be part of the word group.

Overlapping text substitution with Perl regular expression

I have a text file that contains a bunch of sentences. The sentences contain white space (spaces, tabs, new lines) to separate out words consisting of letter and/or digits.
I want to find the word "123" or "-123" and insert a dot (.) before the digits begin. So all occurrences of "123" and "-123" will be converted to ".123" and "-.123".
I was trying this with the following:
$line =~ s/(\s+-*123\s+)/getNewWord($1)/ge
Where $line contains a line read from the file and the function getNewWord word will put the dot(.) at appropriate place in the matched word.
But it's not working for cases where there are two consecutive "123" like " 123 123 ". As the first "123" is replaced by a " .123 " the space following the word has already been matched and the second "123" is not matched since the regex engine can't match the preceding space with that word.
Can anyone help me with this? Thanks!
I agree with MRAB (and have +1'd his/her answer), but there's no real need for the getNewWord function. I'd change the entire statement to something like one of these:
$line =~ s/((?:^|\s)-?)(123)(?=\s|$)/$1.$2/g;
$line =~ s/(?:^|(?<=\s))(-?)(123)(?=\s|$)/$1.$2/g;
$line =~ s/(?:^|(?<=\s)|(?<=\s-))(?=123(?:\s|$))/./g;
It might be slightly faster (no explicit capture) and it allows a file without leading/trailing whitespace:
$ echo '123 -123 -123 123' | perl -pe's/(?:^|\s+)\K(?=-?123\b)/./g'
.123 .-123 .-123 .123
To put . after -:
$ echo '123 -123 -123 123' | perl -pe's/(?:^|\s+)-*\K(?=123\b)/./g'
.123 -.123 -.123 .123
Try using a positive lookahead like this: (\s+-*123)(?=\s).
This reminded me of this question: Search html file for random string using regex, where I found (was shown) a good use for negative lookaround assertions, i.e. matching optional delimiters and avoiding partial matches.
Matching -?123 is simple, the problems are
Not matching partial strings
Avoiding start/end of line mismatches
Avoid moving the \G anchor
Doing a lookbehind assertion of optional dash -?
I did not manage to solve #4, as variable length lookbehind assertions are not supported, so the fix is using a capture group.
Do note that some of the other answers to this question do not address these problems.
Explanation:
Negative lookbehind assertion for non-whitespace matches both whitespace and beginning of string, and assures we do not match partial strings. Then follows an optional dash in a capture group. The end of the match is a nested lookahead, where we must match 123 followed by anything that is not non-whitespace.
Code:
use strict;
use warnings;
while(<DATA>) {
s/(?<!\S)(-?)(?=123(?!\S))/$1./g;
print;
}
__DATA__
r 123 z123 "123" -1233 d123 123-123
123 -123 -123 123 123
Output:
r .123 z123 "123" -1233 d123 123-123
.123 -.123 -.123 .123 .123
Or simply this? This does not bother about the whitespaces, and works on perl 5.8.
echo '123 -123 -123 123' | perl -pe's/(-)?(123)/$1.$2/g'