Conditional Regex not working as expected - regex

I'm trying to write a conditional Regex to achieve the following:
If the word "apple" or "orange" is present within a string:
there must be at least 2 occurrences of the word "HORSE" (upper-case)
else
there must be at least 1 occurrence of the word "HORSE" (upper-case)
What I wrote so far:
(?(?=((apple|orange).*))(HORSE.*){2}|(HORSE.*){1})
I was expecting this Regex to work as I'm following the pattern (?(?=regex)then|else).
However, it looks like (HORSE.*){1} is always evaluated instead. Why?
https://regex101.com/r/V5s8hV/1

The conditional is nice for checking a condition in one place and use outcome in another.
^(?=(?:.*?\b(apple|orange)\b)?)(.*?\bHORSE\b)(?(1)(?2))
The condition is group one inside an optional (?: non capturing group )
In the second group the part until HORSE which we always need gets matched
(?(1)(?2)) conditional if first group succeeded, require group two pattern again
See this demo at regex101 (more explanation on the right side)
The way you planned it does work as well, but needs refactoring e.g. that regex101 demo.
^(?(?=.*?\b(?:apple|orange)\b)(?:.*?\bHORSE\b){2}|.*?\bHORSE\b)
Or another way without conditional and a negative lookahead like this demo at regex101.
^(?:(?!.*?\b(?:apple|orange)\b).*?\bHORSE\b|(?:.*?\bHORSE\b){2})
FYI: To get full string in the output, just attach .* at the end. Further to mention, {1} is redundant. Used a lazy quantifier (as few as possible) in the dot-parts of all variants for improving efficiency.

I would keep it simple and use lookaheads to assert the number of occurrences of the word HORSE:
^((?=.*\bHORSE\b.*\bHORSE\b).*\b(?:apple|orange)\b.*|(?=.*\bHORSE\b)(?!.*\b(?:apple|orange)\b).*)$
Demo
Explanation:
^ from the start of the string
( match either of
(?=.*\bHORSE\b.*\bHORSE\b) assert that HORSE appears at least twice
.* match any content
\b(?:apple|orange)\b match apple or orange
.* match any content
| OR
(?=.*\bHORSE\b) assert that HORSE appears at least once
(?!.*\b(?:apple|orange)\b) but apple and orange do not occur
.* match any content
) close alternation
$ end of the string

Related

Regex Pattern to Match except when the clause enclosed by the tilde (~) on both sides

I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.

Select Northings from a 1 Line String

I have the following string;
Start: 738392E, 6726376N
I extracted 738392 ok using (?<=.art\:\s)([0-9A-Z]*). This gave me a one group match allowing me to extract it as a column value
.
I want to extract 6726376 the same way. Have only one group appear because I am parsing that to a column value.
Not sure why is (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) giving me the entire line after S.
Helping me get it right with an explanation will go along way.
Because you used positive lookaheads. Those just make some assertions, but don't "move the head along".
(?=(art\:\s\s*)) makes sure you're before "art: ...". The next thing is another positive lookahead that you quantify with a star to make it optional. Finally you match anything, so you get the rest of the line in your capture group.
I propose a simpler regex:
(?<=(art\:\s))(\d+)\D+(\d+)
Demo
First we make a positive lookback that makes sure we're after "art: ", then we match two numbers, seperated by non-numbers.
There is no need for you to make it this complicated. Just use something like
Start: (\d+)E, (\d+)N
or
\b\d+(?=[EN]\b)
if you need to match each bit separately.
Your expression (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) has several problems besides the ones already mentioned: 1) your first and second lookahead match at different locations, 2) your second lookahead is quantified, which, in 25 years, I have never seen someone do, so kudos. ;), 3) your capturing group matches about anything, including any line or the empty string.
You match the whole part after it because you use .* which will match until the end of the line.
Note that this part [0-9]* at the end of the pattern does not match because it is optional and the preceding .* already matches until the end of the string.
You could get the match without any lookarounds:
(art:\s)(\d+)[^,]+,\s(\d+)
Regex demo
If you want the matches only, you could make use of the PyPi regex module
(?<=\bStart:(?:\s+\d+[A-Z],)* )\d+(?=[A-Z])
Regex demo (For example only, using a different engine) | Python demo

Regex don't match over sentence breaks

I want to match certain words in the context of other words, like if I wanted to try and capture a filling when we're talking about sandwiches I could do:
(?:sandwich|toastie).{0,100}(ham|cheese|pickle)
Which would match something like Andy sat down to enjoy his sandwich which, unusally for him, was filled with delicious ham
However this would also capture across "context breaks" such as end-of-sentence punctuation or line breaks e.g. Victorians enjoyed a good sandwich after work. They also enjoyed cheese rolling.. In this context I'd want to negate the match as it crosses a sentence.
So I tried to do (?:sandwich|toastie)(?:\w\. ){0}.{0,100}(ham|cheese|pickle) but that doesn't work. What I'm imagining is something like [^\w\. ] but that isn't right either
The way you are trying to reject the sample string, you need to use a tempered greedy token, instead of the way you are writing, and need to write your regex as this,
(?:sandwich|toastie)(?:(?!\w\. ).){0,100}(ham|cheese|pickle)
Regex Demo
So basically, as you were trying to negate (?:\w\. ) pattern so the match fails, you need to write (?:(?!\w\. ).) instead of just . which would fail the match and the words from those two parenthesis will not get matched across two different sentences.
You could make use of a tempered greedy token with a negated character class to assert what is on the right is not any of the listed words, a dot followed by a space or for example a newline:
(?:sandwich|toastie)(?:(?!(?:ham|cheese|pickle|\w\. +|(?:\r?\n|\r))).){1,100}(?:ham|cheese|pickle)
Explanation
(?:sandwich|toastie) Match one of the options
(?: Non capturing group
(?! Negative lookahead to prevent over matching, assert what follows is not
(?:ham|cheese|pickle|\w\. |(?:\r?\n|\r)) Match any of the options
). Close negative lookahead and match any character
){1,100} Close non capturing group and repeat 1 - 100 times
(?:ham|cheese|pickle) match one of the options
Regex demo
You might consider using word boundaries \b for \b(?:sandwich|toastie)\b and \b(?:ham|cheese|pickle)\b to prevent the words being part of a larger word.

Regex: don't match if the pattern start with /

My regex (PCRE):
\b([\w-.]*error)\b(?:[^-\/.]|\.\W|\.$|$)
is a match (the actual match is surrounded by stars) :
**this.is.an.error**
**this.IsAnerror**
**this.is.an.error**.
**this.is.an.error**(
bla **this_is-an-error**
**this.is.an.error**:
this is an (**error**)
not a match:
this.is.an.error.but.dont.match
this.is.an.error-but.dont.match
this.is.an.error/but.dont.match
this.is.an.error/
/this.is.an.error
for this sample: /this.is.an.error
I can't manage to have a condition that will reject the whole match if it starts with the character /.
every combination I've tried resulted in some partial catch (which is not the desired).
Is there any simple or fancy way to do that?
You can try to add lookabehinds at the beginning instead of a word boundary:
(?<!\/)(?<=[^\w-.])([\w-.]*error)\b(?:[^-\/.]|\.\W|\.$|$)
Explanation:
(?<!\/) - negative lookbehind assuring there is no / before the first character;
(?<=[^\w-.]) - word boundary implementation taking into account your extended definition of characters accepted for a word [\w-.];
Demo
Prepend your regex with \/.*|:
\/.*|\b([\w-.]*error)\b(?=[^-\/.]|(?:\.\W?)?$)
Now just like before the first capturing group holds the desired part.
See live demo here
Note: I made some modifications to your regex to remove unnecessary alternations.

Extracting part of a string using regex

I am trying to extract part of a strings below
I tried (.*)(?:table)?,it fails in the last case. How to make the expression capture entire string in the absence of the text "table"
Text: "diningtable" Expected Match: dining
Text: "cookingtable" Match: cooking
Text: "cooking" Match:cooking
Text: "table" Match:""
Rather than try to match everything but table, you should do a replacement operation that removes the text table.
Depending on the language, this might not even need regex. For example, in Java you could use:
String output = input.replace("table", "");
If you want to use regex, you can use this one:
(^.*)(?=table)|(?!.*table.*)(^.+)
See demo here: regex101
The idea is: match everything from the beginning of the line ^ until the word table or if you don't find table in the string, match at least one symbol. (to avoid matching empty lines). Thus, when it finds the word table, it will return an empty string (because it matches from the beginning of the line till the word table).
The (.*)(?:table)? fails with table (matches it) as the first group (.*) is a greedy dot matching pattern that grabs the whole string into Group 1. The regex engine backtracks and looks for table in the optional non-capturing group, and matches an empty string at the end of the string.
The regex trick is to match any text that does not start with table before the optional group:
^((?:(?!table).)+)(?:table)?$
See the regex demo
Now, Group 1 - ((?:(?!table).)+) - contains a tempered greedy token (?:(?!table).)+ that matches 1 or more chars other than a newline that do not start a table sequence. Thus, the first group will never match table.
The anchors make the regex match the whole line.
NOTE: Non-regex solutions might turn out more efficient though, as a tempered greedy token is rather resource consuming.
NOTE2: Unrolling the tempered greedy token usually enhances performance n times:
^([^t]*(?:t(?!able)[^t]*)*)(?:table)?$
See another demo
But usually it looks "cryptic", "unreadable", and "unmaintainable".
Despite other great answers, you could also use alternation:
^(?|(.*)table$|(.*))$
This makes use of a branch reset, so your desired content is always stored in group 1. If your language/tool of choice doesn't support it, you would have to check which of groups 1 and 2 contains the string.
See Demo