Regex With Conditional - Not Desired Output - regex

Was actually glossing over a question and found myself struggling to perform something really simple.
If a string contains % I want to use a particular regex, else I want to use a different one.
I tried the following: https://regex101.com/r/UvFZpo/1/
Regex: (%)(?(1)[^$]+|[^%]+).
Test string: abc%
But I'm not getting the expected results.
I was expecting to see abc% matched as it contains %.
If the string was, abc$, I'd expect it to use the second expression.
Where am I going wrong?

Regex parses strings from left to right, position by position.
Once your pattern matches &, its index is at the end of string, hence, it fails since there are no more chars to be matched by the subsequent [^$]+ pattern.
You can use a mere alternation here:
^(?:([^$]*%[^$]*)|([^%]+))$
See the regex demo
If the string contains %, the Group 1 will be populated, else, Group 2 will.
Details
^ - start of string
(?:([^$]*%[^$]*)|([^%]+)) - either of the two alternatives:
([^$]*%[^$]*) - Group 1: any 0+ chars other than $, as many as possible, % any 0+ chars other than $, as many as possible,
| - or
([^%]+) - any 1+ chars other than %, as many as possible
$ - end of string.

Related

how to create regex function to select an extract a query?

I'm trying to extract a query from a string, I tried writing my own function, but it doesn't match my needs totally.
What I need is:
www.website.com/8056432988456?id=5, I need 8056432988456, with or without the / i.e. preceding a ?.
This is the regex I made for it : (?<=\/)(.*?)(?=\?)|(?<=\?)
Can someone help me out?
You can use
(?<=\/)\d+(?=(?:\/?\?.*)?$)
See the regex demo.
Details:
(?<=\/) - there must be a / immediately on the left
\d+ - one or more digits
(?=(?:\/?\?.*)?$) - immediately on the right, there must be an optional occurrence of:
(?:\/?\?.*)? - an optional occurrence of an optional /, then ? and then any zero or more chars other than line break chars as many as possible
$ - end of string.

.net Regex to look ahead and eliminate strings in advance that dont contain certain characters

I am Using .Net Flavor of Regex.
Suppose i have a string 123456789AB
and i want to match AB (Could be any two Capital letters) only if the string part containing numbers(123456789) has 5 and 8 in it.
So what i came up with was
(?=5)(?=8)([A-Z]{2})
But this is not working.
After some trail error on RegexStorm
I got to
(?=(.*5))(?=(.*8))[A-Z]{2}
What i am expecting is it will start matching from the start of the string as look ahead does not consume any characters.
But the part "[A-Z]{2}" does not move ahead to match AB in the input string.
My question is why is that so?
i know replacing it with .*[A-Z]{2} will make it move ahead but then the string matched has entire string in it.
What is the solution in this case other than putting word part ([A-Z]{2}) in a separate group and then catching only that group.
Lookaheads check for the pattern match immediately to the right of the current position in the string. (?=(.*5))(?=(.*8)) matches a location that is immediately followed with any 0 or more chars other than line break chars as many as possible and then 5 and then - at the same position - another similar check if performed but requiring 8 after any zero or more chars, as many as possible.
You may use as many as lookbehinds as there are required substrings before the two letters:
(?s)(?<=5.*?)(?<=8.*?)[A-Z]{2}
See the regex demo
Details
(?s) - makes the . match newline characters, too
(?<=5.*?) - a location that is immediately preceded with 5 and then 0 or more chars as few as possible
(?<=8.*?) - a location that is immediately preceded with 8 and then 0 or more chars as few as possible
[A-Z]{2} - two ASCII uppercase letters.
An alternative would be to "unfold" what you expect to match using exclusionary character classes and alternation of match order. Not pretty, but pretty fast:
(?<=\b[^58]*?(?:5[^8]*8|8[^5]*5)[^A-Z]*?)[A-Z]{2}

How do you specify multiples in negative character classes in regular expressions?

I am trying to write a regular expression to search for anything but digits or the * or - characters, with one caveat. Where I'm hitting a wall is that I need to be able to allow three or less digits to be found but not four or more, though even one * or - shouldn't be found.
This is what I have so far (for three matches):
.*?([^0-9\*-]+).*?([^0-9\*-]+).*?([^0-9\*-]+).*?
I have no idea where to insert {4,} for the digits (I've tried and it doesn't seem to work anywhere) or how to change it to do as I want.
For instance, in "Jack has* 777 1883874 -sheep-" I'd like it to return "Jack has 777 sheep". Or in "2343klj-3***.net" I'd like it to return "klj 3 .net"
You may use the following regex (replacing with a literal space, " "):
(?:[-*\s]|\d{4,})+
See the regex demo. Replace with $1 (to insert one captured horizontal whitespace if any).
Details
(?:[-*\s]|\d{4,})+ - a non-capturing group matching one or more consecutive repetitions of
[-*\s] - 0+ whitespaces, - or/and *
| - or
\d{4,} - 4+ digits.
Next, to remove all leading and trailing whitespace you may use
^\s+|\s+$
and replace with an empty string. ^\s+ matches 1+ whitespaces at the start of the string and \s+$ matches 1+ whitespaces at the end of the string.
With the help here, this is what works. It may be impossible to do it all in one regex because of the conflict of needing no spaces at the beginning and end but spaces in between each remaining grouping.
First, a find and replace using ([-*\h]|\d{4,})+ and replacing with a space.
Second, using ^\s*(.*)\s*$.

How to get the first match in regexp?

I have three strings as list below:
Levofloxacin 500mg/100mL
Levofloxacin 500mg
Procaterol Hydrochloride …………… 25μg
The first line, I want to just get 'mg' without 'mL' in my result.
The second line, I want get 'mg'.
The third line, I want get 'ug'.
I have try regexp pattern like:
(?!(.*[ ]{1}[0-9]+))[a-zA-Zμ]+
However, the first line always returns 'mg' with 'mL'...
How could I just acquire 'mg' with regexp?
Any suggestions will be appreciated.
As mentioned in the comment section, try this regex:
^\D*[\d.]+\K[a-zμ]+
Click for Demo
Explanation:
^ - asserts the start of the string
\D* - matches 0+ occurrences of any character that is not a digit
[\d.]+ - matches 1+ occurrences of any character that is a digit
\K - removes what has been matched so far
[a-zμ]+ - this is what you want. This will contain the units like mg, ml appearing after the first number. If there are any other special characters like μ, you can add them too in this character list

Regex for finding words with no or only one word between them

I need to find into multiple strings two words with no words or only one word between them. I created the regex for the case to find if those two words exist in string:
^(?=[\s\S]*\bFirst\b)(?=[\s\S]*\bSecond\b)[\s\S]+
and it works correctly.
Then I tried to insert in this regex additional code:
^(?=[\s\S]*\bFirst\b)(\b\w+\b){0,1}(?=[\s\S]*\bSecond\b)[\s\S]+
but it didn't work. It selects text with two or more words between searched words. It is not what I need.
First Second - must be selected
First word1 Second - must be selected
First word1 word2 Second - must be not selected by regex, but my regex select it.
Can I get advise how to solve this problem?
Root cause
You should bear in mind that lookarounds match strings without moving along the string, they "stand their ground". Once you write ^(?=[\s\S]*\bFirst\b)(\b\w+\b){0,1}(?=[\s\S]*\bSecond\b), the execution is as follows:
^ - the regex engine checks if the current position is the start of string
(?=[\s\S]*\bFirst\b) - the positive lookahead requires the presence of any 0+ chars followed with a whole word First - note that the regex index is still at the start of the string after the lookahead returns true or false
(\b\w+\b){0,1} - this subpattern is checked only if the above check was true (i.e. there is a whole word First somewhere) and matches (consumes, moves the regex index) 1 or 0 occurrences of a whole word (i.e. there must be 1 or more word chars right at the string start
(?=[\s\S]*\bSecond\b) - another positive lookahead that makes sure there is a whole word Second somewhere after the first whole word consumed with \b\w+\b - if any. Even if the word Second is the first word in the string, this will return true since backtracking will step back the word matched with (\b\w+\b){0,1} (see, it is optional), and the Second will get asserted, and [\s\S]+ will grab the whole string (Group 1 will be empty). See the regex demo with Second word word2 First string.
So, your approach cannot guarantee the order of First and Second in the string, they are just required to be present but not necessarily in the order you expect.
Solution
If you need to check the order of First and Second in the string, you need to combine all the checks into one single lookahead. The approach might turn out very inefficient with longer strings and multiple alternatives in the lookaround, consider either unrolling the patterns, or trying mutliple regex patterns (like this pseudo-code if /\bFirst\b/.finds_match().index < /\bSecond\b/.finds_match().index => Good, go on...).
If you plan to go on with the regex approach, you may match a string that contains First....Second only in this order:
^(?=[\s\S]*\bFirst(?:\W+\w+)?\W+Second\b)[\s\S]+
See the regex demo
Details:
^ - start of string
(?=[\s\S]*\bFirst(?:\W+\w+)?\W+Second\b) - there must be:
[\s\S]* - any zero or more chars up to the last
\bFirst - whole word First
(?:\W+\w+)? - optional sequence (1 or 0 occurrences) of 1+ non-word chars and 1+ word chars
\W+ - 1+ non-word chars
Second\b - Second as a whole word
[\s\S]+ - any 1 or more characters (empty string won't match).