RegEx negative lookahead on pattern

RegEx negative lookahead on pattern - regex

I want to find all expressions that don't end with ":"
I tried to do it like that:
[a-z]{2,}(?!:)
On this text:
foobar foobaz:
foobaz
foobaz:
The problem is, that it just takes away the last character befor the ":" and not the whole match.
Here is the example: https://regex101.com/r/jtLRvz/1
How can I get the negative lookahead work for the whole regular expression?

When [a-z]{2,}(?!:) matches baz:, [a-z]{2,} grabs 2 or more lowercase ASCII letters at once (baz) and the negative lookahead (?!:) checks the char immediately to the right. It is :, so the engine asks itself if there is a way to match the string in a different way. Since {2,} can match two chars, not currently matched three, it backtracks, and finds a valid match.
Add a-z to the lookahead pattern to make sure the char right after 2 or more lowercase ASCII letters is not a letter and not a colon:
[a-z]{2,}(?![a-z:])
^^^
See the regex demo
If your regex engine supports possessive modifiers, or atomic groups, you may use them to prevent backtracking into the [a-z]{2,} subpattern:
[a-z]{2,}+(?!:)
(?>[a-z]{2,})(?!:)
See another regex demo.

Related

Using regex to find abbreviations

I am trying to create a regular expression that will identify possible abbreviations within a given string in Python. I am kind of new to RegEx and I am having difficulties creating an expression though I beleive it should be somewhat simple. The expression should pick up words that have two or more capitalised letter. The expression should also be able to pick up words where a dash have been used in-between and report the whole word (both before and after the dash). If numbers are also present they should also be reported with the word.
As such, it should pick up:
ABC, AbC, ABc, A-ABC, a-ABC, ABC-a, ABC123, ABC-123, 123-ABC.
I have already made the following expression: r'\b(?:[a-z]*[A-Z\-][a-z\d[^\]*]*){2,}'.
However this does also pick up these wrong words:
A-bc, a-b-c
I believe the problem is that it looks for either multiple capitalised letters or dashes. I wish for it to only give me words that have atleast two or more capitalised letters. I understand that it will also "mistakenly" take words as "Abc-Abc" but I don't believe there is a way to avoid these.

If a lookahead is supported and you don't want to match double -- you might use:
\b(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b
Explanation
\b A word boundary
(?= Positive lookahead, assert that from the current location to the right is
(?:[a-z\d-]*[A-Z]){2} Match 2 times the optionally the allowed characters and an uppercase char A-Z
) Close the lookahead
[A-Za-z\d]+ match 1+ times the allowed characters without the hyphen
(?:-[A-Za-z\d]+)* Optionally repeat - and 1+ times the allowed characters
\b A word boundary
See a regex101 demo.
To also not not match when there are hyphens surrounding the characters you can use negative lookarounds asserting not a hyphen to the left or right.
\b(?<!-)(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b(?!-)
See another regex demo.

Regex - How to prevent any string that starts with "de" but cannot use lookahead or lookbehind?

I have a regex
[a-zA-Z][a-z]
I have to change this regex such that the regex should not accept string that starts with "de","DE","dE" and "De" .I cannot use look behind or look ahead because my system does not support it?

There's a solution without a lookahead or lookbehind, but you need to be able to use groups.
The idea there is to create a sort of "honeypot" that will match your negative results and keep only the results that do interest you.
In your case, that would write:
[dD][eE].*|(<your-regex>)
If the proposition is de<anything> (case insensitive here), it will match, but group(1) will be null.
On the other hand, matching diZ for instance would match not match what is before the or and would therefore fall into the group(1).
Finally, if the proposition doesn't start with de and doesn't match your regex, well, there will be no groups to get at all.
If you need to be sure that your proposition will match the whole provided string, you can update the regex thus:
^(?:[dD][eE].*|(<your-regex>))$
Note that ?: is not a lookahead of any kind, it serves to mark the group as non-capturing, so that <your-regex> will still be captured by group(1) (would become group(2) otherwise and the capture of a group is not always a transparent operation, performance-wise).

Simply ignore those characters:
[a-ce-z][a-df-z][a-gi-kwxyzWZXZ]
Make sure the flag is set to case insensitive. Also, [a-gi-kwxyzWZXZ] can then be modified to [a-gi-kwxyz].
EDIT:
As pointed out in this comment, the regex here won't support other words that start with d but are not followed by e. In this case, negative lookahead is a possible solution:
^(?!de)[a-z]+

This matches anything not starting with "DE" (case insensitive, without look arounds, allowing leading whitespace):
^ *+(?:[^Dd].|.[^Ee])<your regex for rest of input>
See live demo.
The possessive quantifier *+ used for whitespace prevents [^Dd] from being allowed to match a space via backtracking, making this regex hardened against leading spaces.

You can use an alternation excluding matching the d and D from the first character, or exclude matching the e as the second character.
Note that the pattern [a-zA-Z][a-z] matches at least 2 characters, so will the following pattern:
^(?:[abce-zABCE-Z][a-z]|[a-zA-Z][a-df-z]).*
^ Start of string
(?: Non capture group
[abce-zABCE-Z][a-z] Match a char a-zA-Z without d and D followed by a lowercase char a-z
| or
[a-zA-Z][a-df-z] Match a char a-zA-Z followed by a lowercase chars a-z without e
) Close non capture grou
.* Match 0+ times any char except a newline
Regex demo
Another option is to use word boundaries \b instead of an anchor ^
\b(?:[abce-zABCE-Z][a-z]|[a-zA-Z][a-df-z])[a-zA-Z]*\b
Regex demo

Regex match last occurrence of substring among the same substrings in the string

For example we have a string:
asd/asd/asd/asd/1#s_
I need to match this part: /asd/1#s_ or asd/1#s_
How is it possible to do with plain regex?
I've tried negative lookahead like this
But it didn't work
\/(?:.(?!\/))?(asd)(\/(([\W\d\w]){1,})|)$
it matches this '/asd/asd/asd/asd/asd/asd/1#s_'
from this 'prefix/asd/asd/asd/asd/asd/asd/1#s_'
and I need to match '/asd/1#s_' without all preceding /asd/'s
Match should work with plain regex
Without any helper functions of any programming language
https://regexr.com/
I use this site to check if regex matches or not
here's the possible strings:
prefix/asd/asd/asd/1#s
prefix/asd/asd/asd/1s#
prefix/asd/asd/asd/s1#
prefix/asd/asd/asd/s#1
prefix/asd/asd/asd/#1s
prefix/asd/asd/asd/#s1
and asd part could be replaced with any word like
prefix/a1sd/a1sd/a1sd/1#s
prefix/a1sd/a1sd/a1sd/1s#
...
So I need to match last repeating part with everything to the right
And everything to the right could be character, not character, digit, in any order
A more complicated string example:
prefix/a1sd/a1sd/a1sd/1s#/ds/dsse/a1sd/22$$#!/123/321/asd
this should match that part:
/a1sd/22$$#!/123/321/asd

If you want the match only, you can use \K to reset the match buffer right before the parts that you want to match:
^.*\K/a\d?sd/\S+
The pattern will match
^ Start of string
.* Match any char except a newline until end of the line
\K Forget what is matched until now
/a\d?sd/ match a, optional digits and sd between forward slashes
\S+ Match 1+ non whitespace chars
See a regex demo

This regex to match a word surrounded by {} does not work

So here's my regex to match a word after "define" or "define:"
((?<=define |define: )\w+)
That part works well and all. But when I add the part where it also should match word between {} if it can, it matches everything.
((?<=define |define: )\w+)|([^{][A-Z]+[^}])
The regex with the examples
The thing that I noticed is that when I add ^ at first [{] then it ruins everything and I don't understand why.

Why does using [^{] not work?
By using [^{], your regex becomes:
[^{][A-Z]+[^}]
In words, this translates to:
character that's not a {
a bunch of letters
character that's not a }
Note how nothing in your regex enforces the idea that the "a bunch of letters" part has to be between {}s. It just says that it has to be after a character that is not {, and before a character that is not }. By this logic, even something like ABC would match because A is not {, B is the bunch of letters, and C is not }.
How to match a word between {}?
You can use this regex:
{([A-Z]+)}
And get group 1.
I don't think that you should combine this with the regex that matches a word after define. You should use 2 separate regexes because these are two completely different things.
So split it into two regexes:
(?<=define |define: )\w+
and
{([A-Z]+)}

You are using negated character classes the way we would use positive lookbehind (?<=) and positive lookahead (?=). They are fundamentally different and, as opposed to lookbehind or lookahead, character classes consume characters.
Hence:
[^{][A-Z] matches a capital letter that is preceded by a character other than {.
[A-Z][^}] matches a capital letter that is followed by a character other than }.
So if you try to match the letters in {OO} with the regex [^{][A-Z]+[^}], it is totally normal that your regex won't match anything because you have two letters, one preceded by a {, the other followed by a }.

Regex pattern to match string that's not followed by a colon

Using regex, I'm trying to match any string of characters that meets the following conditions (in the order displayed):
Contains a dollar sign $; then
at least one letter [a-zA-Z]; then
zero or more letters, numbers, underscores, periods (dots), opening brackets, and/or closing brackets [a-zA-Z0-9_.\[\]]*; then
one pipe character |; then
one at sign #; then
at least one letter [a-zA-Z]; then
zero or more letters, numbers, and/or underscores [a-zA-Z0-9_]*; then
zero colons :
In other words, if a colon is found at the end of the string, then it should not count as a match.
Here are some examples of valid matches:
$tmp1|#hello
$x2.h|#hi_th3re
Valid match$here|#in_the middle of other characters
And here are some examples of invalid matches:
$tmp2|#not_a_match:"because there is a colon"
$c.4a|#also_no_match:
Here are some of the patterns I've tried:
(\$[a-zA-Z])([a-zA-Z0-9_.\[\]]*)(\|#)([a-zA-Z][a-zA-Z0-9_]*(?!.[:]))
(\$[a-zA-Z])([a-zA-Z0-9_.\[\]]+)?(\|#)([a-zA-Z][a-zA-Z0-9_]*(?![:]))
(\$[a-zA-Z])([a-zA-Z0-9_.\[\]]+)?(\|#)([a-zA-Z][a-zA-Z0-9_]*)([^:])

This pattern will do what you need
\$[A-Za-z]+[\w.\[\]]*[|]#[A-Za-z]+[\w]*+(?!:)
Regex Demo
I am using possessive quantifiers to cut down the backtracking using [\w]*+. You can also use atomic groups instead of possessive quantifiers like
\$[A-Za-z]+[\w.\[\]]*[|]#[A-Za-z]+(?>[\w]*)(?!:)
NOTE
\w => [A-Za-z0-9_]

I tested your third pattern in Regex 101 and it appears to be working correctly:
^.*(\$[a-zA-Z])([a-zA-Z0-9_.\[\]]+)?(\|#)([a-zA-Z][a-zA-Z0-9_]*)([^:]).*$
The only change I needed to make to the regex to make it work was to add anchors ^ and $ to the start and end of the regex. I also allowed for your pattern to occur as a substring in the middle of a larger string.
By the way, you had the following example as a string which should not match:
$tmp2|#not_a_match:"because there is a colon"
However, even if we remove the colon from this string it will still not match because it contains quotes which are not allowed.
Regex101

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

RegEx negative lookahead on pattern - regex

Related

Using regex to find abbreviations

Regex - How to prevent any string that starts with "de" but cannot use lookahead or lookbehind?

Regex match last occurrence of substring among the same substrings in the string

This regex to match a word surrounded by {} does not work

Regex pattern to match string that's not followed by a colon

Categories

Resources