Regex must contain specific letters in any order - regex

I have been attempting to validate a string in VB.net that must contain these three letters in no particular order and do not need to be next to One another. ABC
I can do this easily using LINQ
MessageBox.Show(("ABC").All(Function(n) ("AAAABBBBBCCCC").Contains(n)).ToString)
However, after searching Google and SO for over a week, I am completely stumped. My closest pattern is ".*[A|B|C]+.*[A|B|C]+.*[A|B|C]+.*" how ever AAA would also return true. I know i can do this using other methods just after trying for a week i really want to know if its possible using One regular expression.

Your original pattern won't work because it will match any number of characters, followed by one or more A, B, C, or | character, followed by any number of characters, followed by one or more A, B, C, or | character, followed by any number of characters, followed by one or more A, B, C, or | character, followed by any number of characters.
I'd probably go with the code you've already written, but if you really want to use a regular expression, you can use a series of lookahead assertions, like this:
(?=.*A)(?=.*B)(?=.*C)
This will match any string that contains A, B, and C in any order.

You can make use of positive lookaheads:
^(?=.*A)(?=.*B)(?=.*C).+
(?=.*A) makes sure there's an A somewhere in the string and the same logic applies to the other lookaheads.

You can use zero-width lookaheads. Lookaheads are great to eliminate match possibilities if they don't meet a certain criteria.
For example, let's use the words
untie queue unique block unity
Start with a basic word match:
\b\w+\b
to require the word matched with \w+ begins with un, we could use a positive lookahead
\b(?=un)\w+\b
What this says is
\b Match a blank
(?=un) Are there the letters "un"? If not, NO MATCH. If so, then possible match.
\w+ One or more word characters
\b Match a blank
A positive lookahead eliminates a match possibility if it does NOT meet the expression inside. It applies to the regex RIGHT AFTER it. So the (?=un) applies to the \w+ expression above and requires that it BEGINS WITH un. If it does not, then the \w+ expression won't match.
How about matching any words that do not begin with un? Simply use a "negative lookahead"
\b(?!un)\w+\b
\b Match a blank
(?!un) Are there the letters "un"? If SO, NO MATCH. If not, then possible match.
\w+ One or more word characters
\b Match a blank
So for your requirement of having at least 1 A, 1 B and 1 C in the string, a pattern like
(?=.*A)(?=.*B)(?=.*C).+
Works because it says:
(?=.*A) - Does it have .* any characters followed by A? If so, possible match if not no match.
(?=.*B) - Does it have .* any characters followed by B? If so, possible match if not no match.
(?=.*C) - Does it have .* any characters followed by C? If so, possible match if not no match.
.+ If the above 3 lookahead requirements were met, match any characters. If not, then match no characters (and so there isn't a match)

Does it have to be a regex? That's something that can easily be solved without one.
I've never programmed in VB, but I'm sure there are helper functions that let you take a string, and query whether or not a character occurs in it.
If str is your string, maybe something like:
str.contains('A') && str.contains('B') && str.contains('C')

Related

Using regex to find abbreviations

I am trying to create a regular expression that will identify possible abbreviations within a given string in Python. I am kind of new to RegEx and I am having difficulties creating an expression though I beleive it should be somewhat simple. The expression should pick up words that have two or more capitalised letter. The expression should also be able to pick up words where a dash have been used in-between and report the whole word (both before and after the dash). If numbers are also present they should also be reported with the word.
As such, it should pick up:
ABC, AbC, ABc, A-ABC, a-ABC, ABC-a, ABC123, ABC-123, 123-ABC.
I have already made the following expression: r'\b(?:[a-z]*[A-Z\-][a-z\d[^\]*]*){2,}'.
However this does also pick up these wrong words:
A-bc, a-b-c
I believe the problem is that it looks for either multiple capitalised letters or dashes. I wish for it to only give me words that have atleast two or more capitalised letters. I understand that it will also "mistakenly" take words as "Abc-Abc" but I don't believe there is a way to avoid these.
If a lookahead is supported and you don't want to match double -- you might use:
\b(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b
Explanation
\b A word boundary
(?= Positive lookahead, assert that from the current location to the right is
(?:[a-z\d-]*[A-Z]){2} Match 2 times the optionally the allowed characters and an uppercase char A-Z
) Close the lookahead
[A-Za-z\d]+ match 1+ times the allowed characters without the hyphen
(?:-[A-Za-z\d]+)* Optionally repeat - and 1+ times the allowed characters
\b A word boundary
See a regex101 demo.
To also not not match when there are hyphens surrounding the characters you can use negative lookarounds asserting not a hyphen to the left or right.
\b(?<!-)(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b(?!-)
See another regex demo.

How to overcome multiple matches within same sentence (regex) [duplicate]

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

This regex to match a word surrounded by {} does not work

So here's my regex to match a word after "define" or "define:"
((?<=define |define: )\w+)
That part works well and all. But when I add the part where it also should match word between {} if it can, it matches everything.
((?<=define |define: )\w+)|([^{][A-Z]+[^}])
The regex with the examples
The thing that I noticed is that when I add ^ at first [{] then it ruins everything and I don't understand why.
Why does using [^{] not work?
By using [^{], your regex becomes:
[^{][A-Z]+[^}]
In words, this translates to:
character that's not a {
a bunch of letters
character that's not a }
Note how nothing in your regex enforces the idea that the "a bunch of letters" part has to be between {}s. It just says that it has to be after a character that is not {, and before a character that is not }. By this logic, even something like ABC would match because A is not {, B is the bunch of letters, and C is not }.
How to match a word between {}?
You can use this regex:
{([A-Z]+)}
And get group 1.
I don't think that you should combine this with the regex that matches a word after define. You should use 2 separate regexes because these are two completely different things.
So split it into two regexes:
(?<=define |define: )\w+
and
{([A-Z]+)}
You are using negated character classes the way we would use positive lookbehind (?<=) and positive lookahead (?=). They are fundamentally different and, as opposed to lookbehind or lookahead, character classes consume characters.
Hence:
[^{][A-Z] matches a capital letter that is preceded by a character other than {.
[A-Z][^}] matches a capital letter that is followed by a character other than }.
So if you try to match the letters in {OO} with the regex [^{][A-Z]+[^}], it is totally normal that your regex won't match anything because you have two letters, one preceded by a {, the other followed by a }.

RegEx more than multiple characters before number

I really don't use RegEx that much. You could say I am RegEx n00b. I have been working on this issue for a half a day.
I am trying to write a pattern that looks backward from a number character. For example:
1. bob1 => bob
2. cat3 => cat
3. Mary34 => Mary
So far I have this (?![A-Z][a-z]{1,})([A-Za-z_])
It only matches for individual characters, I want all the characters before the number character. I tried to add the ^ and $ into my pattern and using an online simulator. I am unsure where to put the ^ and $.
NOTE: I am using RegEx for the .NET Framework
You may use a regex like
[\p{L}_]+(?=\d)
or
[\w-[\d]]+(?=\d)
See the regex demo
Pattern details
[\p{L}_]+ - any 1 or more letters (both lower- and uppercase) and/or _
OR
[\w-[\d]]+ - 1 or more word chars except digits (the -[] inside a character class is a character class subtraction construct)
(?=\d) - a positive lookahead that requires a digit to appear immediately to the right of the current location
If we break down your RegEx, we see:
(?![A-Z][a-z]{1,}) which says "look ahead to find a string that is NOT one uppercase letter followed one or more lowercase letters" and ([A-Za-z_]) which says "match one letter or underscore". This should end up matching any single lowercase letter.
If I understand what you want to achieve, then you want all of the letters before a number. I would write something like that as:
\b([a-zA-Z]+)[0-9]
This will start at a word boundary \b, match one or more letters, and require a digit right after the matched string.
(The syntax I used seems to match this document about .NET RegEx: https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expressions)
In light of Wiktor Stribizew's comment, here is a pure match RegEx:
\b[a-zA-Z_]+(?=[0-9])
This matches the pattern and then looks ahead for the digit. This is better than my first lookahead attempt. (Thank you Wiktor.)
http://www.rexegg.com/regex-lookarounds.html

How to only match a single instance of a character?

Not quite sure how to go about this, but basically what I want to do is match a character, say a for example. In this case all of the following would not contain matches (i.e. I don't want to match them):
aa
aaa
fooaaxyz
Whereas the following would:
a (obviously)
fooaxyz (this would only match the letter a part)
My knowledge of RegEx is not great, so I am not even sure if this is possible. Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
^[^\sa]*\Ka(?=[^\sa]*$)
DEMO
\K discards the previously matched characters and lookahead assertes whether a match is possibel or not. So the above matches only the letter a which satifies the conditions.
OR
a{2,}(*SKIP)(*F)|a
DEMO
You may use a combination of a lookbehind and a lookahead:
(?<!a)a(?!a)
See the regex demo and the regex graph:
Details
(?<!a) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a a char
a - an a char
(?!a) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a a char.
You need two things:
a negated character class: [^a] (all except "a")
anchors (^ and $) to ensure that the limits of the string are reached (in other words, that the pattern matches the whole string and not only a substring):
Result:
^[^a]*a[^a]*$
Once you know there is only one "a", you can use the way you want to extract/replace/remove it depending of the language you use.