How do you specify multiples in negative character classes in regular expressions? - regex

I am trying to write a regular expression to search for anything but digits or the * or - characters, with one caveat. Where I'm hitting a wall is that I need to be able to allow three or less digits to be found but not four or more, though even one * or - shouldn't be found.
This is what I have so far (for three matches):
.*?([^0-9\*-]+).*?([^0-9\*-]+).*?([^0-9\*-]+).*?
I have no idea where to insert {4,} for the digits (I've tried and it doesn't seem to work anywhere) or how to change it to do as I want.
For instance, in "Jack has* 777 1883874 -sheep-" I'd like it to return "Jack has 777 sheep". Or in "2343klj-3***.net" I'd like it to return "klj 3 .net"

You may use the following regex (replacing with a literal space, " "):
(?:[-*\s]|\d{4,})+
See the regex demo. Replace with $1 (to insert one captured horizontal whitespace if any).
Details
(?:[-*\s]|\d{4,})+ - a non-capturing group matching one or more consecutive repetitions of
[-*\s] - 0+ whitespaces, - or/and *
| - or
\d{4,} - 4+ digits.
Next, to remove all leading and trailing whitespace you may use
^\s+|\s+$
and replace with an empty string. ^\s+ matches 1+ whitespaces at the start of the string and \s+$ matches 1+ whitespaces at the end of the string.

With the help here, this is what works. It may be impossible to do it all in one regex because of the conflict of needing no spaces at the beginning and end but spaces in between each remaining grouping.
First, a find and replace using ([-*\h]|\d{4,})+ and replacing with a space.
Second, using ^\s*(.*)\s*$.

Related

Regex for 5-7 characters, or 6-8 if including a space (no special characters allowed)

I am trying to create a regex for some basic postcode validation. It doesn't need to provide full validation (in my usage it's fine to miss out the space, for example), but it does need to check for the number of characters being used, and also make sure there are no special characters other than spaces.
This is what I have so far:
^[\s.]*([^\s.][\s.]*){5,7}$
This mostly works, but it has two flaws:
It allows for ANY character, rather than just alphanumeric characters + spaces
It allows for multiple spaces to be inserted:
I have tried updating it as follows:
^[\s.]*([a-zA-Z0-9\s.][\s.]*){5,7}$
This seems to have fixed the character issue, but still allows multiple spaces to be inserted. For example, this should be allowed:
AB14 4BA
But this shouldn't:
AB1 4 4BA
How can I modify the code to limit the number of spaces to a maximum of one (it's fine to have none at all)?
With your current set of rules you could say:
^(?:[A-Za-z0-9]{5,7}|(?=.{6,8}$)[A-Za-z0-9]+\s[A-Za-z0-9]+)$
See an online demo
^ - Start-line anchor;
(?: - Open non-capture group for alternations;
[A-Za-z0-9]{5,7} - Just match 5-7 alphanumeric chars;
| - Or;
(?=.{6,8}$) - Positive lookahead to assert position is followed by at least 6-8 characters until the end-line anchor;
[A-Za-z0-9]+\s[A-Za-z0-9]+ - Match 1+ alphanumeric chars on either side of the whitespace character;
)$ - Close non-capture group and match the end-line anchor.
Alternatively, maybe a negative lookahead to prevent multiple spaces to occur (or at the start):
^(?!\S*\s\S*\s|\s)(?:\s?[A-Za-z0-9]){5,7}$
See an online demo where I replaced \s with [^\S\n] for demonstration purposes. Also, though being the shorter expression, the latter will take more steps to evaluate the input.

Regex With Conditional - Not Desired Output

Was actually glossing over a question and found myself struggling to perform something really simple.
If a string contains % I want to use a particular regex, else I want to use a different one.
I tried the following: https://regex101.com/r/UvFZpo/1/
Regex: (%)(?(1)[^$]+|[^%]+).
Test string: abc%
But I'm not getting the expected results.
I was expecting to see abc% matched as it contains %.
If the string was, abc$, I'd expect it to use the second expression.
Where am I going wrong?
Regex parses strings from left to right, position by position.
Once your pattern matches &, its index is at the end of string, hence, it fails since there are no more chars to be matched by the subsequent [^$]+ pattern.
You can use a mere alternation here:
^(?:([^$]*%[^$]*)|([^%]+))$
See the regex demo
If the string contains %, the Group 1 will be populated, else, Group 2 will.
Details
^ - start of string
(?:([^$]*%[^$]*)|([^%]+)) - either of the two alternatives:
([^$]*%[^$]*) - Group 1: any 0+ chars other than $, as many as possible, % any 0+ chars other than $, as many as possible,
| - or
([^%]+) - any 1+ chars other than %, as many as possible
$ - end of string.

.net Regex to look ahead and eliminate strings in advance that dont contain certain characters

I am Using .Net Flavor of Regex.
Suppose i have a string 123456789AB
and i want to match AB (Could be any two Capital letters) only if the string part containing numbers(123456789) has 5 and 8 in it.
So what i came up with was
(?=5)(?=8)([A-Z]{2})
But this is not working.
After some trail error on RegexStorm
I got to
(?=(.*5))(?=(.*8))[A-Z]{2}
What i am expecting is it will start matching from the start of the string as look ahead does not consume any characters.
But the part "[A-Z]{2}" does not move ahead to match AB in the input string.
My question is why is that so?
i know replacing it with .*[A-Z]{2} will make it move ahead but then the string matched has entire string in it.
What is the solution in this case other than putting word part ([A-Z]{2}) in a separate group and then catching only that group.
Lookaheads check for the pattern match immediately to the right of the current position in the string. (?=(.*5))(?=(.*8)) matches a location that is immediately followed with any 0 or more chars other than line break chars as many as possible and then 5 and then - at the same position - another similar check if performed but requiring 8 after any zero or more chars, as many as possible.
You may use as many as lookbehinds as there are required substrings before the two letters:
(?s)(?<=5.*?)(?<=8.*?)[A-Z]{2}
See the regex demo
Details
(?s) - makes the . match newline characters, too
(?<=5.*?) - a location that is immediately preceded with 5 and then 0 or more chars as few as possible
(?<=8.*?) - a location that is immediately preceded with 8 and then 0 or more chars as few as possible
[A-Z]{2} - two ASCII uppercase letters.
An alternative would be to "unfold" what you expect to match using exclusionary character classes and alternation of match order. Not pretty, but pretty fast:
(?<=\b[^58]*?(?:5[^8]*8|8[^5]*5)[^A-Z]*?)[A-Z]{2}

Regex to find a starting pattern including either of 2 strings but not contain a specific text

I want to use Regex to find a line containing a particular pattern.
The pattern should be a string starting with 2 characters (a-zA-Z0-9) followed by a dash then either "FAL" or "SAL" and does not include the term "OJT" at all.
Just want to make sure I have the right or am I missing something as it doesn't appear to work as expected
^[a-zA-z0-9]{1,2}(?=.*?\-SAL|-FAL\b)((?!OJT).)*$
You may use
^[a-zA-Z0-9]{1,2}(?!.*OJT).*?(?:-SAL|-FAL)\b.*
See the regex demo
Details
^ - start of string
[a-zA-Z0-9]{1,2} - one or two alphanumeric chars
(?!.*OJT) - any 0+ chars, as few as possible, followed with OJT char sequence should not appear immediately to the right of the current location
.*? - any 0+ chars other than line break chars as few as possible
(?:-SAL|-FAL)\b - -SAL or -FAL not followed with a word char
.* - the rest of string.
See the regex graph:

Regex for finding words with no or only one word between them

I need to find into multiple strings two words with no words or only one word between them. I created the regex for the case to find if those two words exist in string:
^(?=[\s\S]*\bFirst\b)(?=[\s\S]*\bSecond\b)[\s\S]+
and it works correctly.
Then I tried to insert in this regex additional code:
^(?=[\s\S]*\bFirst\b)(\b\w+\b){0,1}(?=[\s\S]*\bSecond\b)[\s\S]+
but it didn't work. It selects text with two or more words between searched words. It is not what I need.
First Second - must be selected
First word1 Second - must be selected
First word1 word2 Second - must be not selected by regex, but my regex select it.
Can I get advise how to solve this problem?
Root cause
You should bear in mind that lookarounds match strings without moving along the string, they "stand their ground". Once you write ^(?=[\s\S]*\bFirst\b)(\b\w+\b){0,1}(?=[\s\S]*\bSecond\b), the execution is as follows:
^ - the regex engine checks if the current position is the start of string
(?=[\s\S]*\bFirst\b) - the positive lookahead requires the presence of any 0+ chars followed with a whole word First - note that the regex index is still at the start of the string after the lookahead returns true or false
(\b\w+\b){0,1} - this subpattern is checked only if the above check was true (i.e. there is a whole word First somewhere) and matches (consumes, moves the regex index) 1 or 0 occurrences of a whole word (i.e. there must be 1 or more word chars right at the string start
(?=[\s\S]*\bSecond\b) - another positive lookahead that makes sure there is a whole word Second somewhere after the first whole word consumed with \b\w+\b - if any. Even if the word Second is the first word in the string, this will return true since backtracking will step back the word matched with (\b\w+\b){0,1} (see, it is optional), and the Second will get asserted, and [\s\S]+ will grab the whole string (Group 1 will be empty). See the regex demo with Second word word2 First string.
So, your approach cannot guarantee the order of First and Second in the string, they are just required to be present but not necessarily in the order you expect.
Solution
If you need to check the order of First and Second in the string, you need to combine all the checks into one single lookahead. The approach might turn out very inefficient with longer strings and multiple alternatives in the lookaround, consider either unrolling the patterns, or trying mutliple regex patterns (like this pseudo-code if /\bFirst\b/.finds_match().index < /\bSecond\b/.finds_match().index => Good, go on...).
If you plan to go on with the regex approach, you may match a string that contains First....Second only in this order:
^(?=[\s\S]*\bFirst(?:\W+\w+)?\W+Second\b)[\s\S]+
See the regex demo
Details:
^ - start of string
(?=[\s\S]*\bFirst(?:\W+\w+)?\W+Second\b) - there must be:
[\s\S]* - any zero or more chars up to the last
\bFirst - whole word First
(?:\W+\w+)? - optional sequence (1 or 0 occurrences) of 1+ non-word chars and 1+ word chars
\W+ - 1+ non-word chars
Second\b - Second as a whole word
[\s\S]+ - any 1 or more characters (empty string won't match).