Match a pattern after a pattern? - regex

Can you match a pattern in text that occurs after a pattern for instance in:
ssasabafra
Match all the a's after the b? Ive tried using a look behind like so:
(?<=b)[a]+
But it only matches the first a is there a way to match all occurences after b?

If you are using an expression engine that allows repetition in lookbehind expressions, how about:
(?<=b.*?)a
This looks behind for a b followed by any number of characters, and matches a
For most regex engines however, I don't think this is possible. But, what you can do is split the string on b, match the second part with /a/, then join the two strings again with b.

How about this:
a(?=[^b]*$)
However, this doesn't make sure that there is some b before the a. I guess you want to match all the a that is not followed by some substring containing b.
See demo on RegexPal
If you want to make sure that, there must be b somewhere before the a, then you should probably use the string manipulating functions, in your particular programming language.

Related

find a regular expression where a is never immediately followed by b (Theory of formal languages)

I need to find a simplified regular expression for the language of all strings
of a's, b's, and c's where a is never immediately followed by b.
I tried something and reached till (a+c)*c(b+c)* + (b+c)*(a+c)*
Is this fine and if so can this be simplified?
Thanks in advance.
You are looking for a negative lookbehind:
(?<!a)b
This will find you all the b instances that are not immediately following a
Or a negative lookahead:
a(?!b)
This will find you all the a instances that are not immediately followed by b
Here is a regex101 example for the lookbehind:
https://regex101.com/r/RsqXbW/1
Here is a regex101 example for the lookahead:
https://regex101.com/r/qiDIZU/1
You solution contains only strings from the desired language. However, it does not contain all of them. For example acbac is not contained. Your basic idea is fine, but you need to be able to iterate the possible factors. In:
(b+c)*(a (a)*(c(b+c)*)*)*
the first part generates all strings withhout a.
After the first a there come either nothing, another a or c. Another a leaves us with the same three options. c basically starts the game again. This is what the part after the first a formalizes. The many * are needed to possibly generate the empty string in all of the different options.

Regex (.*) without matching the second case

Given the following sample input text:
{{A1|def|ghi|jkl}}hello world. {{A2|mno}}bye world.
How can I create a regex pattern to only matching the first instance of {{ ... }} (i.e. only {{A1|def|ghi|jkl}}). A1 and A2 are fixed inputs and def, ghi, jkl, and mno could be anything.
I've tried this:
\{\{A1\|(.*)\|(.*)\|(.*)\}\}
But that returns everything ({{A1|def|ghi|jkl}}hello world. {{A2|mno}}).
Note that def or ghi or jkl or mno could be numbers, English letters or other languages (e.g. Chinese/Japanese/Korean).
It's a little unclear what you are trying to accomplish. At first, I thought that your problem was just that you were getting the entire thing when all you really wanted was the A1 or A2 part. If so, here's the answer:
Since you didn't specify which flavor of regex you are using, it's hard to say for sure. If you are using a version which supports look-arounds, you could do something like this:
(?<={{)\w+(?=(\|[^|}]*)+}})
Here's the meaning of the pattern:
(?<={{) - This is a positive look-behind expression which means that it asserts that any match must be preceded by certain characters. In this case, the characters are {{.
\w+ - This is the actual part that we are matching. In this case, it's one or more word characters. \w is a special character class. This varies, though, depending on which regex engine you are using. Something like [A-Z][0-9] may be more appropriate, depending on your needs.
(?=(\|[^|}]*)+}}) - This is a positive look-ahead expression. That means that it asserts that any match must be followed by some particular pattern of characters. In this case, it's looking for matches to be followed by (\|[^|}]*)+}}.
However, if look-arounds are not possible, then you can match it with a capturing group, like this:
{{(\w+)(\|[^|}]*)+}}
If you do it that way, you'll need to read the value of the first group for each match.
As far as only finding the first match goes, that really depends on which tool or language you are using. Most regex engines only find the first match by default and only find additional matches when a global modifier is specified (often /g at the end).
However, now, after having edited your question, and trying better to understand what you meant, I think that your real problem is greediness. The repetitions, such as *, in regex are greedy by default. That means they will capture as much text as they possibly can and still have it match. In this case, you don't want it to find the longest possible match. In this case, you want it to find the shortest possible match. You could do that simply by making the repetitions lazy (i.e. non-greedy). To do that, simply add a ? after the *. For instance:
\{\{A1\|(.*?)\|(.*?)\|(.*?)\}\}
However, that's not very efficient. If this pattern is going to be used often or on large inputs it would be better to use a more restrictive character class, such as [^}|] instead of ., so that the lazy modifier is unnecessary. For example:
\{\{A1\|([^}|]*)\|([^}|]*)\|([^}|]*)\}\}
Or, more simply:
{{A1(\|([^}|]*)){3}}}
The problem with your pattern is simply that you've made all of the * quantifiers greedy. They're matching as much of the string as they can (while still allowing the whole pattern to match). Just make them non-greedy *?:
\{\{A1\|(.*?)\|(.*?)\|(.*?)\}\}
https://regex101.com/r/pK4gE7/1

Regex Expressions

I am trying to learn to handle Regex Expressions and got some exercises but no solutions to it. One Question is: all lower-case words except 'if'.
Can I do this one like this:
[a-z][a-z]^[if] | [a-z][a-z][a-z]+
I'm expect that a word has at least two characters. So every word with three or more is okay.
Well... the full real solution would be something like that:
\b(?!if\b)\p{Ll}+\b
Demo
But I suppose it's, well, "higher level" regex that you didn't learn yet.
So, let's keep things simple. If you can accept to ignore words of less than 3 characters, you can write this:
\b[a-hj-z][a-eg-z][a-z]+|i[a-z]{2,}
Demo
The first two character classes are just [a-z] without i and f respectively.
If you want to include words of less than 3 characters, this will do:
\b(?:i|if[a-z]+|i[a-eg-z][a-z]*|[a-hj-z][a-z]*)\b
Demo
But it gets complicated at this point...
All sequences of two or more lower-case letters, except "if":
[a-hj-z][a-z]+|i(?:[a-eg-z][a-z]*|f[a-z]+)
With negative look-ahead, you can also do:
(?!if\b)[a-z]{2,}
A simple solution would be to place what you want to ignore on the left side of the alternation operator and place what you want to match in a capturing group on the right side of the alternation operator as you were attempting.
\bif\b|([a-z]{2,})
Note: The caret ^ outside of a character class does not mean negation, it asserts the position at start of the string. And unless you are using the x (free-spacing) modifier, you need to remove the spaces between the alternation.

Regex: How to optionally match something at beginning or end, but not both?

I have situation where in the regular expression is something like this:
^b?A+b?$
So b may match at the start of the string 0 or 1 times, and A must match one or more times. Again b may match at the end of the string 0 or 1 times.
Now I want to modify this regular expression in such way that it may match b either at the start or at the end of the string, but not both.
How do I do this?
Theres a nice "or" operator in regexps you can use.
^(b?A+|A+b?)$
Try this:
^(bA+|A+b?)$
This allows a b at the start and then at least one A, or some As at the start and optionally a b at the end. This covers all the possibilities and is slightly faster than the accepted answer in the case that it doesn't match as only one of the two options needs to be tested (assuming A cannot begin with b).
Just to be different from the other answers here, if the expression A is quite complex but b is simple, then you might want to do it using a negative lookahead to avoid repeating the entire expression for A in your regular expression:
^(b(?!.*b$))?A+b?$
The second might be more readable if your A is complex, but if performance is an issue I'd recommend the first method.
^(b+A+b?|b?A+b+)$
why doesn't that work?

Regular Expression Opposite

Is it possible to write a regex that returns the converse of a desired result? Regexes are usually inclusive - finding matches. I want to be able to transform a regex into its opposite - asserting that there are no matches. Is this possible? If so, how?
http://zijab.blogspot.com/2008/09/finding-opposite-of-regular-expression.html states that you should bracket your regex with
/^((?!^ MYREGEX ).)*$/
, but this doesn't seem to work. If I have regex
/[a|b]./
, the string "abc" returns false with both my regex and the converse suggested by zijab,
/^((?!^[a|b].).)*$/
. Is it possible to write a regex's converse, or am I thinking incorrectly?
Couldn't you just check to see if there are no matches? I don't know what language you are using, but how about this pseudocode?
if (!'Some String'.match(someRegularExpression))
// do something...
If you can only change the regex, then the one you got from your link should work:
/^((?!REGULAR_EXPRESSION_HERE).)*$/
The reason your inverted regex isn't working is because of the '^' inside the negative lookahead:
/^((?!^[ab].).)*$/
^ # WRONG
Maybe it's different in vim, but in every regex flavor I'm familiar with, the caret matches the beginning of the string (or the beginning of a line in multiline mode). But I think that was just a typo in the blog entry.
You also need to take into account the semantics of the regex tool you're using. For example, in Perl, this is true:
"abc" =~ /[ab]./
But in Java, this isn't:
"abc".matches("[ab].")
That's because the regex passed to the matches() method is implicitly anchored at both ends (i.e., /^[ab].$/).
Taking the more common, Perl semantics, /[ab]./ means the target string contains a sequence consisting of an 'a' or 'b' followed by at least one (non-line separator) character. In other words, at ANY point, the condition is TRUE. The inverse of that statement is, at EVERY point the condition is FALSE. That means, before you consume each character, you perform a negative lookahead to confirm that the character isn't the beginning of a matching sequence:
(?![ab].).
And you have to examine every character, so the regex has to be anchored at both ends:
/^(?:(?![ab].).)*$/
That's the general idea, but I don't think it's possible to invert every regex--not when the original regexes can include positive and negative lookarounds, reluctant and possessive quantifiers, and who-knows-what.
You can invert the character set by writing a ^ at the start ([^…]). So the opposite expression of [ab] (match either a or b) is [^ab] (match neither a nor b).
But the more complex your expression gets, the more complex is the complementary expression too. An example:
You want to match the literal foo. An expression, that does match anything else but a string that contains foo would have to match either
any string that’s shorter than foo (^.{0,2}$), or
any three characters long string that’s not foo (^([^f]..|f[^o].|fo[^o])$), or
any longer string that does not contain foo.
All together this may work:
^[^fo]*(f+($|[^o]|o($|[^fo]*)))*$
But note: This does only apply to foo.
You can also do this (in python) by using re.split, and splitting based on your regular expression, thus returning all the parts that don't match the regex, how to find the converse of a regex
In perl you can anti-match with $string !~ /regex/;.
With grep, you can use --invert-match or -v.
Java Regexps have an interesting way of doing this (can test here) where you can create a greedy optional match for the string you want, and then match data after it. If the greedy match fails, it's optional so it doesn't matter, if it succeeds, it needs some extra data to match the second expression and so fails.
It looks counter-intuitive, but works.
Eg (foo)?+.+ matches bar, foox and xfoo but won't match foo (or an empty string).
It might be possible in other dialects, but couldn't get it to work myself (they seem more willing to backtrack if the second match fails?)