Regex to match any words except words with given pattern - regex

I want to match all words except following words :
1) any-random-word
2) any-random-word/
3) any-random-word/123
4) any-random-word/abcdef
so that following similar words can be matched.
1) any-random-word123
2) any-random-word(any non-word character other than '/')123
2) any-random-wordabcdef
4) any-random-word(any non-word character other than '/')abcdef
In fact any number or any word can be appened after 'any-random-word/'.
I tried with
^(?!any-random-word(\/?)(\w+)$|any-random-word$)
but its escaping all words having any-random-word in it.
Thanks.

You can change your current regex a little:
^(?!.*\bany-random-word\b)
And if you want to actually match something, add .+ at the end:
^(?!.*\bany-random-word\b).+
regex101 demo
\b (a word boundary) ensures that there's no other \w character around the word you don't want to match.
Edit: As per your further clarification, I would suggest this regex:
^(?!.*\bany-random-word(?:/|$)).+
The main part of the regex is the negative lookahead: (?!.*\bany-random-word(?:/|$)). It will cause the whole match to fail if what's inside matches.
.*\bany-random-word(?:/|$) will match any-random-word at the end of a string or followed by /, anywhere in the string that it is being tested against.
So, if you have any-random-word/, it will match, and cause the whole match to fail. If you have the string ending in any-random-word, again, the whole match will fail.

Related

Regex Match - How to match a string that does not contain a specific char

I'm looking for a regex expression that will match a word (or anything splitted by space) that does not contain a specific character.
For example: foo anot%her bar idk%what #IJ#N, I need to match a word does noot contain % character, the result is foo, bar and #IJ#N.
I tried something like this, but it doesn't work:
Instead of using a word boundary and an anchor at the end, you can write the pattern using lookarounds if those are supported and then match any non whitespace character except for the % using a negated character class:
(?<!\S)[^\s%]+(?!\S)
See a regex101 demo
Since your words contain symbols, you can't use \w and \b here. \S Matches anything other than a space, tab or newline. Negative Lookahead (?!\S*%\S*) ensures that % is not contained in the string. Finally, add the anchors ^ and $. The pattern is ^(?!\S*%\S*)\S+$, see https://regex101.com/r/DjZFZj/1. If your string is a long string separated by white spaces, just change the boundary. You can change ^ to (?<=^|\s) and delete the $, the pattern is (?<=^|\s)(?!\S*%\S*)\S+, see https://regex101.com/r/0rBhF2/1.

How to overcome multiple matches within same sentence (regex) [duplicate]

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

Regex in middle of text doesn't match

I have a regex to find url's in text:
^(?!:\/\/)([a-zA-Z0-9-_]+\.)*[a-zA-Z0-9][a-zA-Z0-9-_]+\.[a-zA-Z]{2,11}?$
However it fails when it is surrounded by text:
https://regex101.com/r/0vZy6h/1
I can't seem to grasp why it's not working.
Possible reasons why the pattern does not work:
^ and $ make it match the entire string
(?!:\/\/) is a negative lookahead that fails the match if, immediately to the right of the current location, there is :// substring. But [a-zA-Z0-9-_]+ means there can't be any ://, so, you most probably wanted to fail the match if :// is present to the left of the current location, i.e. you want a negative lookbehind, (?<!:\/\/).
[a-zA-Z]{2,11}? - matches 2 chars only if $ is removed since the {2,11}? is a lazy quantifier and when such a pattern is at the end of the pattern it will always match the minimum char amount, here, 2.
Use
(?<!:\/\/)([a-zA-Z0-9-_]+\.)*[a-zA-Z0-9][a-zA-Z0-9-_]+\.[a-zA-Z]{2,11}
See the regex demo. Add \b word boundaries if you need to match the substrings as whole words.
Note in Python regex there is no need to escape /, you may replace (?<!:\/\/) with (?<!://).
The spaces are not being matched. Try adding space to the character sets checking for leading or trailing text.

Perl: Matching string not containing PATTERN

While using Perl regex to chop a string down into usable pieces I had the need to match everything except a certain pattern. I solved it after I found this hint on Perl Monks:
/^(?:(?!PATTERN).)*$/; # Matches strings not containing PATTERN
Although I solved my initial problem, I have little clue about how it actually works. I checked perlre, but it is a bit too formal to grasp.
Regular expression to match a line that doesn't contain a word? helps a lot in understanding, but why is the . in my example and the ?: and how do the outer parentheses work?
Can someone break up the regex and explain in simple words how it works?
Building it up piece by piece (and throughout assuming no newlines in the string or PATTERN):
This matches any string:
/^.*$/
But we don't want . to match a character that starts PATTERN, so replace
.
with
(?!PATTERN).
This uses a negative look-ahead that tests a given pattern without actually consuming any of the string and only succeeds if the pattern does not match at the given point in the string. So it's like saying:
if PATTERN doesn't match at this point,
match the next character
This needs to be done for every character in the string, so * is used to match zero or more times, from the beginning to the end of the string.
To make the * apply to the combination of the negative look-ahead and ., not just the ., it needs to be surrounded by parentheses, and since there's no reason to capture, they should be non-capturing parentheses (?: ):
(?:(?!PATTERN).)*
And putting back the anchors to make sure we test at every position in the string:
/^(?:(?!PATTERN).)*$/
Note that this solution is particularly useful as part of a larger match; e.g. to match any string with foo and later baz but no bar in between:
/foo(?:(?!bar).)*baz/
If there aren't such considerations, you can simply do:
/^(?!.*PATTERN)/
to check that PATTERN does not match anywhere in the string.
About newlines: there are two problems with your regex and newlines. First, . doesn't match newlines, so "foo\nbar" =~ /^(?:(?!baz).)*$/ doesn't match, even though the string does not contain baz. You need to add the /s flag to make . match any character; "foo\nbar" =~ /^(?:(?!baz).)*$/s correctly matches. Second, $ doesn't match just at the end of the string, it also can match before a newline at the end of the string. So "foo\n" =~ /^(?:(?!\s).)*$/s does match, even though the string contains whitespace and you are attempting to only match strings with no whitespace; \z always only matches at the end, so "foo\n" =~ /^(?:(?!\s).)*\z/s correctly fails to match the string that does in fact contain a \s. So the correct general purpose regex is:
/^(?:(?!PATTERN).)*\z/s
jippie, first, here's a tip. If you see a regex that is not immediately obvious to you, you can dump it in a tool that explains every token.
For instance, here is the RegexBuddy output:
"
^ # Assert position at the beginning of a line (at beginning of the string or after a line break character) (line feed)
(?: # Match the regular expression below
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
PATTERN # Match the character string “PATTERN” literally (case insensitive)
)
. # Match any single character that is NOT a line break character (line feed)
)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$ # Assert position at the end of a line (at the end of the string or before a line break character) (line feed)
# Perl 5.18 allows a zero-length match at the position where the previous match ends.
# Perl 5.18 attempts the next match at the same position as the previous match if it was zero-length and may find a non-zero-length match at the same position.
"
Some people also use regex101.
A Human Explanation
Now if I had to explain the regex, I would not be so linear. I would start by saying that it is fully anchored by the ^ and the $, implying that the only possible match is the whole string, not a substring of that string.
Then we come to the meat: a non-capturing group introduced by (?: and repeated any number of times by the *
What does this group do? It contains
a negative lookahead (you may want to read up on lookarounds here) asserting that at this exact position in the string, we cannot match the word PATTERN,
then a dot to match the next character
This means that at each position in the string, we assert that we cannot match PATTERN, then we match the next character.
If PATTERN can be matched anywhere, the negative lookahead fails, and so does the entire regex.

Regular expression doesn't match if a character participated in a previous match

I have this regex:
(?:\S)\++(?:\S)
Which is supposed to catch all the pluses in a query string like this:
?busca=tenis+nike+categoria:"Tenis+e+Squash"&pagina=4&operador=or
It should have been 4 matches, but there are only 3:
s+n
e+c
s+e
It is missing the last one:
e+S
And it seems to happen because the "e" character has participated in a previous match (s+e), because the "e" character is right in the middle of two pluses (Teni s+e+S quash).
If you test the regex with the following input, it matches the last "+":
?busca=tenis+nike+categoria:"Tenis_e+Squash"&pagina=4&operador=or
(changed "s+e" for "s_e" in order not to cause the "e" character to participate in the match).
Would someone please shed a light on that?
Thanks in advance!
In a consecutive match the search for the next match starts at the position of the end of the previous match. And since the the non-whitespace character after the + is matched too, the search for the next match will start after that non-whitespace character. So a sequence like s+e+S you will only find one match:
s+e+S
\_/
You can fix that by using look-around assertions that don’t match the characters of the assumption like:
\S\++(?=\S)
This will match any non-whitespace character followed by one or more + only if it is followed by another non-whitespace character.
But tince whitespace is not allowed in a URI query, you don’t need the surrounding \S at all as every character is non-whitespace. So the following will already match every sequence of one or more + characters:
\++
You are correct: The fourth match doesn't happen because the surrounding character has already participated in the previous match. The solution is to use lookaround (if your regex implementation supports it - JavaScript doesn't support lookbehind, for example).
Try
(?<!\s)\++(?!\s)
This matches one or more + unless they are surrounded by whitespace. This also works if the plus is at the start or the end of the string.
Explanation:
(?<!\s) # assert that there is no space before the current position
# (but don't make that character a part of the match itself)
\++ # match one or more pluses
(?!\s) # assert that there is no space after the current position
If your regex implementation doesn't support lookbehind, you could also use
\S\++(?!\s)
That way, your match would contain the character before the plus, but not after it, and therefore there will be no overlapping matches (Thanks Gumbo!). This will fail to match a plus at the start of the string, though (because the \S does need to match a character). But this is probably not a problem.
You can use the regex:
(?<=\S)\++(?=\S)
To match only the +'s that are surrounded by non-whitespace.