Regex Not Capturing input - regex

I need to match over the alphabet {a,b} (meaning that we can discard any other letter since only a and b will exist):
All strings containing 2 or more as.
All strings that do not contain the substring bbb.
Why is this RegEx:
((b{0,2}aaa*)+)|((aaa*b{0,2})+)
Not capturing aab?

Because aa got captured by your first pattern. To get the desired output, you need to change the pattern order.
((aaa*b{0,2})+)|((b{0,2}aaa*)+)
Note that regex engine always try to match the input against the pattern which resides on the left side then it goes further to the right side. So it would be like,
1st|2nd|3rd
Update:
^(?!.*?bbb).*a.*a.*
DEMO

Your requirements:
All strings containing 2 or more a's.
All strings that do not contain the substring bbb.
seem to argue for a simpler, lookahead based approach, instead of the trickier consuming pattern (depends on your exact workflow):
(?=a.*a)(?!.*bbb).*
regex demo
edit: to exclude all letters except a and b:
^(?=.*a.*a)(?!.*bbb)[ab]+$
regex demo

This seems to work too:
(ab{0,2}a)(ab{0,2})*

Related

Using regex to match with patterns like <some_pattern>Dog<the_exact_pervious_pattern>Cat

Is it possible to write a regex in Perl that matches with strings like
<some_pattern>Dog<the_exact_pervious_pattern>Cat
or in other words it would match with
CarDogCarCat ChairDogChairCat
but not with ChairDogCarCat
to substitute it with another text. I don't care about the exact characters that form <some_pattern> or <the_exact_pervious_pattern>, I only care that both of the them has the exact same characters.
I tried something like
s/(.*(Dog)?)Cat/replacement/g;
But I know that would match with ChairDogCarCat, also tried
s/(.*)Dog$1Cat/replacement/g;
but it didn't work. Is that something that I can't do with regex and would need other string processing functions (i.e. splitting) to implement? Regex always seems to be the easier and shorter solution for me but I don't know if it has limitations with some patterns.
Thanks in advance.
Use relative backreference
/(.*?) Dog \g{-1} Cat/x
This is if you don't need to capture Dog. If you do, then count off one further back
/(.*?)(Dog)\g{-2}(Cat)/
where the {-2} means to match the same subpattern as the second last capture.
Create a capture group, and back-reference it with \1. Your example:
/([A-Z][a-z]+)Dog\1Cat/
Examples:
CarDogCarCat ==> match
ChairDogChairCat ==> match
ChairDogCarCat ==> no match

Is there a way to use periodicity in a regular expression?

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

RegEx: How to make greedy quantifier really really greddy (and never give back)?

For example I have this RegEx:
([0-9]{1,4})([0-9])
Which gives me these matching groups when testing with string "3041":
As you can see, group2 is filled before group1 even if the quantifier is greedy.
How can I instead make sure to fill group1 before group2?
EDIT1: I want to have the same regEx, but have "3041" in group1 and group2 empty.
EDIT2: I want to have "3041" in group1 and group2 empty. And, yes, I want the regEx to not match!,
For an input "1234", the pattern: ([0-9]{1,4})([0-9]) is being as greedy as possible.
The first capture group cannot contain four characters, otherwise the last part of the pattern would not match.
Perhaps what you're looking for is:
([0-9]{1,4})([0-9]?)
By making the second group optionally empty, the first group can contain all four characters.
Edit:
I want the regEx to not match!, I want only 5 digits strings to match the whole RegEx.
In this case, your pattern should not really be "1-4 characters" in the first group, since you only want to match a group of 4:
([0-9]{4})([0-9])
In some regex flavours (i.e. not all languages support this), it is also possible to make quantifiers possessive (although this is unnecessary in your case, as shown above). For example:
([0-9]{1,4}+)([0-9])
This will force the first group to match as far as it can (i.e. 4 characters), so a 3-character match does not get attempted and the overall pattern fails to match.
Edit2:
Is "possessiveness" available in Javascript? If not, any workarounds?
Unfortunately, possessive quantifiers are not available in JavaScript.
However, you can emulate the behaviour (in a slightly ugly way) with a lookahead:
(?=([0-9]{1,4}))\1([0-9])
In general, a possessive quantifier a++ can be emulated as: (?=(a+))\1.
As it stands you only need anchors:
^([0-9]{4})([0-9])$
This will only match five digits strings and will fail on any other string.

Regular Expression for matching a single digital followed by a word exactly in Notepad++

:Statement
Say we have following three records, and we just want to match the first one only -- exactly one digital followed by a specific word, what is the regular expression can be used to make it(in NotePad ++)?
2Cups
11Cups
222Cups
The expressions I tried and their problems are:
Proposal 1:\d{1}Cups
it will find the "1Cups" and "2Cups" substrings in the second and third record respectively, which is what we do not want here
Proposal 2:[^0-9]+[0-9]Cups
same as the above
(PS: the records can be "XX 2Cups", "YY22Cups" and "XYZ 333Cups", i.e., no assumption on the position of the matchable parts)
Any suggestions?
:Reference
[1] The reg definition in NotePad++ (Same as SciTe)
As mentioned in Searching for a complex Regular Expression to use with Notepad++, it is: http://www.scintilla.org/SciTERegEx.html
[2] Matching exact number of digits
Here is an example: regular expression to match exactly 5 digits.
However, we do not want to find the match-able substring in longer records here.
If the string actually has the numbered sequence (1. 2Cups 2. 11Cups), you can use the white space that follows it:
\s\d{1}Cups
If there isn't the numbered list before, but the string will be at the beginning of the line, you can anchor it there:
^\d{1}Cups
Tested in Notepad++ v6.5.1 (Unicode).
It sounds like you want to match the digit only at the start of the string or if it has a space before it, so this would work:
(^|\b)\dCups
Debuggex Demo
Explanation:
(^|\b) Match the start of the string or beginning of a word (technically, word break)
\d Match a digit ({1} is redundant)
Cups Match Cups
This will work:
\b\dCups
If "Cups" must be a whole word (ie not matching 2Cupsizes:
\b\dCups\b
Note that \b matches even if at start or end of input.
I found one possible solution:
Using ^\d{1}Cups to match "Starting with one digital + Cups" cases, as suggested by Ken, Cottrell and Bohemian.
Using [^\d]\dCups to match other cases.
However, haven't found a solution using just one regex to solve the problem yet.
Have a try with:
(?:^|\D)\dCups
This will match xCups only if there aren't digit before.

Fetch one out of two Numbers out of String

I hav a list of strings, such as: Ø20X400
I need to extract the first of the numbers - between Ø and X
I've come so far to match the numbers in general with \d+ - as simple as it is...
But I need an expression to get the first value separated, not both of them...
You can use lookarounds (?<=..) and (?=..):
(?<=Ø)\d+(?=X)
or in Java style:
(?<=Ø)\\d+(?=X)
A second way is to use a capture group:
Ø(\d+)X
or
Ø(\\d+)X
Then you can extract the content of the group.
The regex engines I know parse \n as a newline. \d is used for numbers.
The following regex gives you the first number between a Ø and a X in a capture group:
^.*?Ø(\d+)X.*
Edit live on Debuggex
This Regex will do it for you, (\d+?)X, and here is a Rubular to prove it. See, you want to group digits together, but make it non-greedy, ending the evaluation on X.
Try this one:
\d+(?=\D)
Should find first number wich has some not a number ahead
With normal regular expressions, I would say:
Ø(\d+)X
This finds the Ø character, followed by one or more numbers, followed by an X. Also, the numbers will be stored in the first capture group. Capture groups differ from one regex implementation to another, but this would typically be denoted by \1. Capture group zero, \0, is usually the matched string itself. In this version, \d denotes digits 0-9, but if your regex engine uses \n for that purpose, use:
Ø(\n+)X