Why doesn't the regex ^([0|1]1)+$ match the string "111"?

Why doesn't the regex ^([0|1]1)+$ match the string "111"? - regex

I'm trying to write a regex to match binary strings where every odd character is a 1.
I came up with this:
^([0|1]1)+$
My logic:
^ matches the start of the line
( starts a capture group
[0|1] match a 0 or 1 (since the 0th position is even)
1 the previous character (0 or 1) must be followed by a 1
+ repeat the previous pattern one or more times
$ matches the end of the line
So by my logic, it the above regex should match binary strings where every other character (with the first "other" character being the second one in the string) is a 1.
However, it doesn't work correctly. As an example, the string 111 is not matched.
Why isn't it working and what should I change to make it work?
Regex101 Test

If you need every odd character to be a 1, then you need something more like this:
^([01]1)*[01]?$
The first character can be anything, the next has to be 1, then repeated several times while the last character can be 0 or 1.
The pipe in your character class is not needed, and is actually making your regex also match a pipe character. So remove it entirely. You use the pipe in groups (i.e. (?: ... ) or ( ... ) to denote alternation).
The above will also match an empty string, so you could add (?=.) at the beginning to force matching at least 1 character (i.e. ^(?=.)([01]1)*[01]?$.
The above will match where you have (where x is either 0 or 1):
x
x1
x1x
x1x1
x1x1x
x1x1x1
etc.
Your current regex on the other side is attempting to match even number of characters. You repeat the group ([0|1]1) which matches 2 characters exactly (no more no less) so the length of your whole match will be a multiple of 2.
Adding the optional [01] at the end allows for strings with odd number of characters to match.

Your regex is for even-length strings only. [01] and 1 each match a character, therefore your capturing group matches 2 characters.
This modifies your regex to accept odd-length strings:
^([01](1|$))+$

Firstly, the [0|1] should read [01]. Otherwise you have a character group that matches, 0, | or 1.
Now, [01]1 matches exactly two characters. Thus ([01]1)+ cannot match a string whose length is not a multiple of two.
To make it work with inputs of odd length, change the regex to
^(([01]1)+[01]?|1)$

You can use this pattern:
^1?([01]1)+$|^1$
or
^(1?([01]1)+|1)$
To deal with an odd or even number of digits you need to put an optional 1? at the begining. To ensure that there is at least one digit, you can't use a * quantifier for the group, otherwhise the pattern can match the empty string. This why, you need to use + for the group and add the case of a single 1

Related

How to match strings that are entirely composed of a predefined set of substrings with regex

How to match strings that are entirely composed of a predefined set of substrings. For example, I want to see if a string is composed of only the following allowed substrings:
,
034
140
201
In the case when my string is as follows:
034,201
The string is fully composed of the 'allowed' substrings, so I want to positively match it.
However, in the following string:
034,055,201
There is an additional 055, which is not in my 'allowed' substrings set. So I want to not match that string.
What regex would be capable of doing this?

Try this one:
^(034|201|140|,)+$
Here is a demo
Step by step:
^ begining of a line
(034|201|140|,) captures group with alternative possible matches
+ captured group appears one or more times
$ end of a line

This regex will match only your values and ensure that the line doesn't start or end with a comma. Only matches in group 0 if it is valid, the groups are non-matching.
^(?:034|140|201)(?:,(?:034|140|201))*$
^: start
(?:034|140|201): non-matching group for your set of items (no comma)
(?:,(?:034|140|201))*: non-matching group of a comma followed by non-matching group of values, 0 or more times
$: end

Match all elements with n occurrences

I want to select the same element with exact n occurrences.
Match letters that repeats exact 3 times in this String: "aaaaabbbcccccccccdddee"
this should return "bbb" and "ddd"
If I define what I should match like "b{3}" or "d{3}", this would be easier, but I want to match all elements
I've tried and the closest I came up is this regex: (.)\1{2}(?!\1)
Which returns "aaa", "bbb", "ccc", "ddd"
And I can't add negative lookbehind, because of "non-fixed width" (?<!\1)

One possibility is to use a regex that looks for a character which is not followed by itself (or beginning of line), followed by three identical characters, followed by another character which is not the same as the second three i.e.
(?:(.)(?!\1)|^)((.)\3{2})(?!\3)
Demo on regex101
The match is captured in group 2. The issue with this though is that it absorbs a character prior to the match, so cannot find adjacent matches: as shown in the demo, it only matches aaa, ccc and eee in aaabbbcccdddeee.
This issue can be resolved by making the entire regex a lookahead, a technique which allows for capturing overlapping matches as described in this question. So:
(?=(?:(.)(?!\1)|^)((.)\3{2})(?!\3))
Again, the match is captured in group 2.
Demo on regex101

You could match what you don't want to keep, which is 4 or more times the same character.
Then use an alternation to capture what you want to keep, which is 3 times the same character.
The desired matches are in capture group 2.
(.)\1{3,}|((.)\3\3)
(.) Capture group 1, match a single character
\1{3,} Repeat the same char in group 1, 3 or more times
| Or
( Capture group 2
(.)\3\3 Capture group 3, match a single character followed by 2 backreferences matching 2 times the same character as in group 3
) Close group 2
Regex demo

This gets sticky because you cannot put a back reference inside a negative character set, so we'll use a lookbehind followed by a negative lookahead like this:
(?<=(.))((?!\1).)\2\2(?!\2))
This says find a character but don't include it in the match. Then look ahead to be certain the next character is different. Next consume it into capture group 2 and be certain that the next two characters match it, and the one after does not match.
Unfortunately, this does not work on 3 characters at the beginning of the string. I had to add a whole alternation clause to handle that case. So the final regex is:
(?:(?<=(.))((?!\1).)\2\2(?!\2))|^(.)\3\3(?!\3)
This handles all cases.
EDIT
I found a way to handle matches at the beginning of the string:
(?:(?<=(.))|^)((?!\1).)\2\2(?!\2)
Much nicer and more compact, and does not require looking in capture groups to get the answer.

If your environment permits the use of (*SKIP)(*FAIL), you can manage to return a lean set of matches by consuming substrings of four or more consecutive duplicate characters then discard them. In the alternation, match the desired 3 consecutive duplicated characters.
PHP Code: (Demo)
$string = 'aaaaabbbcccccccccdddee';
var_export(
preg_match_all(
'/(?:(.)\1{3,}(*SKIP)(*F)|(.)\2{2})/',
$string,
$m
)
? $m[0]
: 'no matches'
);
Output:
array (
0 => 'bbb',
1 => 'ddd',
)
This technique uses no lookarounds and does not generate false positive matches in the matches array (which would otherwise need to be filtered out).
This pattern is efficient because it never needs to look backward and by consuming the 4 or more consecutive duplicates, it can rule-out long substrings quickly.

Regex to add a 0 to the 2nd digit after comma if missing

I want to add a 0 to the 2nd digit after comma in case it's missing. For example: Having the values -2,3; 45,5; 3.000,0; and replacing to -2,30; 45,50; 3.000,00.
I thought about matching with .*,\d{1} in an IF statement first (i.e. checking if the value has just 1 digit after the comma) and then replacing with the pattern (.*) and replace function ${1}0, but this seems to be adding two zeroes instead of one, e.g. resulting into -2,300; 45,500, etc..
Edit: I just realized that I could also just concatenate the string with a "0" if regex matching returns true.

You do not need to check if a string ends with a comma and one digit. You can use
(,\d)$
Replace with ${1}0.
See the regex demo.
Now, the consuming part will never match an empty string and will match
(,\d) - a comma and a digit (Capturing group 1)
$ - end of string.
${1}0 will replace the match with the Group 1 value with 0 after it .

What is the point of having * in a regular expression

Recently I am thinking the reason why we need a * in regular expression. For example, if we want to represent A0,A1..,Z99, we can do:
[A-Z][0-9][0-9]*
But A0A (which is not we want) is also valid according to the above. What benefit does the * give me?

* is just a quantifier, matching between zero and unlimited times.
[A-Z][0-9][0-9]* matches A0,A1..,Z99 and also A10000,Z123456789...
Remembering that if you dont put the ^ and $ as anchors, the processor will match the specified part, and return true even if the input contain more characters, because you don't said that you want a positive result ONLY if the entire input matches the regex.
If your goal is to match just A0,A1..,Z99, the regex should be:
^[A-Z][0-9][0-9]?$
Or simply:
^[A-Z]\d{1,2}$
\d means 'digit', and is the same as [0-9].
{1,2} means at least 1 time and nothing more than 2 times.
? also is a quantifier, matching 0 or 1 time.

But A0A (which is not we want) is also valid
No it is not valid, you just need to use anchors:
^[A-Z][0-9][0-9]*$
^ will ensure this matches at line start and $ ensures it matches till line end.
Also if only 2nd digit is optional then better to use:
^[A-Z][0-9][0-9]?$
Since * matches 0 or more times whereas ? matches 0 or 1 time.

Seems like you're trying to match the strings starts with an uppercase alphabet and the following numbers ranges from 1 to 99.
^[A-Z][1-9]?[0-9]$
^ asserts that we are at the start and $ asserts that we are at the end. So this helps to do an exact string match. It won't match at the middle or start or at the end of a string or line. That is, [A-Z][1-9]?[0-9] will match A10 in fooA10 string but ^[A-Z][1-9]?[0-9]$ won't produce a match in fooA10 string.

Why does this not match my example?

as I go through the regex101 quiz/lessons, I am supposed to match an IP address (without leading zeros).
Now the following
^[^0]+[0-9]+\\.[^0]+[0-9]+\\.[^0]+[0-9]+\\.[^0]+[0-9]+$
matches 23.34.7433.33
but fails to match single digit numbers like 1.2.3.4
Why is this so, when my + is supposed to match "1 to infinite" times...?

You are in fact matching more than 2 digits for each number in the IP address because you have:
[^0]+[0-9]+
[^0]+ matches at least one character, and [0-9]+ matches at least 1 character. Both will match 'at least 2 characters' (characters being in scope of the character classes).
Also 23.34.7433.3 doesn't match your regex for the reason I stated above.
And you might try this regex for the purpose you stated:
^(?:[1-9][0-9]{0,2}\.){3}[1-9][0-9]{0,2}$
[1-9][0-9]{0,2} will match up to 3 digits, with a non leading 0.
EDIT: You mentioned in a comment that 0.0.0.0 (single digit zeroes) are to be accepted as well. The modified regex from above would be:
^(?:(?:[1-9][0-9]{0,2}|0)\.){3}(?:[1-9][0-9]{0,2}|0)$

Assuming you want to check an IPv4, I suggest you this pattern:
^(?<nb>2(?>[0-4][0-9]|5[0-5])|1[0-9]{2}|[1-9]?[0-9])(?>\.\g<nb>){3}$
I have defined a named subpattern nb to make the pattern shorter, but if you prefer, you can rewrite all and replace \g<nb>:
^(?>2(?>[0-4][0-9]|5[0-5])|1[0-9]{2}|[1-9]?[0-9])(?>\.(?>2(?>[0-4][0-9]|5[0-5])|1[0-9]{2}|[1-9]?[0-9])){3}$
Numbers greater than 255 are not allowed.
pattern details:
The goal is to describe what is allowed:
numbers with 3 digits that begins with "2" can be followed by a digit in [0-4] and a digit in [0-9] OR by 5 and a digit in [0-5] because it can exceed 255.
numbers with 3 digits that begins with "1" can be followed by any two digits.
any number with 2 digits that doesn't begin with "0"
any number with 1 digit (zero included)
If I add one by one these rules, I obtain
2(?>[0-4][0-9]|5[0-5])
2(?>[0-4][0-9]|5[0-5]) | 1[0-9]{2}
2(?>[0-4][0-9]|5[0-5]) | 1[0-9]{2} | [1-9][0-9]
2(?>[0-4][0-9]|5[0-5]) | 1[0-9]{2} | [1-9][0-9] | [0-9]
Now I have a definition for allowed numbers. I can reduce a little the size of the pattern replacing [1-9][0-9] | [0-9] by [1-9]?[0-9]
Then you only have to add the dot repeat the subpattern four times: x.x.x.x
But since there is only three dots, I write the first number and I repeat 3 times a group that contains a dot and a number:
2(?>[0-4][0-9]|5[0-5])|1[0-9]{2}|[1-9]?[0-9] # the first number
(?>\.2(?>[0-4][0-9]|5[0-5])|1[0-9]{2}|[1-9]?[0-9]){3} # the group repeated 3 times
To be sure that the string doesn't contain anything else that the IP I described, I add anchors for the start of string ^ and for the end of string $, then the string begins and ends with the IP.
To reduce the size of a pattern you can define a named group which allows to reuse the subpattern it contains,
Then you can rewrite the pattern like this:
^
(?<nb> 2(?>[0-4][0-9]|5[0-5])|1[0-9]{2}|[1-9]?[0-9] ) # named group definition
(?> \. \g<nb> ){3} # \g<nb> is the reference to the subpattern named nb
$

[0-9]+ should be [0-9]*
* matches 0 or more.
+ matches 1 or more.
You already have the case [^0] <--- this actually wrong because it will match letters also.
besides that it will match the first character that's NOT zero then at least one number after that.
It should be written as
[1-9][0-9]*
This essentially checks the first letter and sees if its a number that's between 1-9 then the next numbers(0 nums to infinite nums) after that is a number 0-9.
Then this will come out to.
^[1-9][0-9]*\.[1-9][0-9]*\.[1-9][0-9]*\.[1-9][0-9]*$
Edit live on Debuggex
cleaning it up.
^(?:[1-9][0-9]*\.){3}[1-9][0-9]*$
this should work...
^(?:[1-9][0-9]*\.|[0-9])(?:[1-9][0-9]*\.|[0-9])(?:[1-9][0-9]*\.|[0-9])(?:[1-9][0-9]*|[0-9])$
cleaned up.
^(?:(?:[1-9][0-9]*|0)\.){3}(?:[1-9][0-9]*|0)$

Your regex would match ABCDEFG999.FOOBSR888 etc, because [^0] is any character other than a zero, and bith character classes are required by the +.
I think you want this:
^[1-9]\d*(\\.[1-9]\d*){3}$
having replaced various verbose expressions with their equivalent, this is 4 groups of digits each starting with a non-zero.
Actually the problem is far more complicated, because your approach (once corrected) allows 999.999.999.999, which is not a valid IP.

It might be because you need at least two digits between two dots '.'
try using this pattern: ^[^0]+[0-9]*\.[^0]+[0-9]*\.[^0]+[0-9]*\.[^0]+[0-9]*$

to match ip address you should use this pattern:
\b(?:\d{1,3}.){3}\d{1,3}\b
taken from here:
http://www.regular-expressions.info/examples.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why doesn't the regex ^([0|1]1)+$ match the string "111"? - regex

Your regex is for even-length strings only. [01] and 1 each match a character, therefore your capturing group matches 2 characters. This modifies your regex to accept odd-length strings: ^([01](1|$))+$

Firstly, the [0|1] should read [01]. Otherwise you have a character group that matches, 0, | or 1. Now, [01]1 matches exactly two characters. Thus ([01]1)+ cannot match a string whose length is not a multiple of two. To make it work with inputs of odd length, change the regex to ^(([01]1)+[01]?|1)$

Related

How to match strings that are entirely composed of a predefined set of substrings with regex

Match all elements with n occurrences

Regex to add a 0 to the 2nd digit after comma if missing

What is the point of having * in a regular expression

Why does this not match my example?

Categories

Resources