Confusion about the Kleene star

Confusion about the Kleene star - regex

I have been struggling to understand one key property about the closure of two union'ed expressions. Basically what I need to know exactly how the Kleene star works.
I.E If Regular Expression R = (0+1)* Does the expression have to evaluate to something like 000111/01/00001111, or can we have a unequal amount of 0's & 1's, such as 0011111/000001/111111/0000?

The amount of 0's and 1's can be unequal; you can even have the 0's and 1's in any order! a* means "zero or more as, where each a is evaluated independently"; thus, in a string matching (0+1)*, each character can match (0+1) without regard for how the other characters in the string are matching it.
Consider the pattern (0+1)(0+1); it matches the strings 00, 01, 10, and 11. As you can see, the 0's and 1's don't have to occur in equal amounts and don't have to occur in any specific order. The Kleene star extends this to strings of any length; after all, (0+1)* just means <empty>+(0+1)+(0+1)(0+1)+(0+1)(0+1)(0+1)+ ....

Related

How to match regular expression for lines there is just one number and all words vice versa?

I have some words how can i match regular expression just one number and all words vice versa?
YV932X6R
V5R67HD1
5R3XPD61
57342D61
CRHXPDV2
12345678
CDHKPQRV
I've tried to use this way, but it's not quite what I want
^(?=.*[0-9])(?=.*[a-zA-Z])([a-zA-Z0-9]+)$
Output
YV932X6R
V5R67HD1
5R3XPD61
57342D61
CRHXPDV2
Expected Output
CRHXPDV2
OR
57342D61

If you don't care about the length, then you may use the following patterns for just one number or just one letter:
^[A-Za-z]*[0-9][A-Za-z]*$
^[0-9]*[A-Za-z][0-9]*$
If you also have a length requirement of 8 characters, you could enforce that via a positive lookahead. For example, the pattern for one digit and the rest letters would become:
^(?=.{8}$)[A-Za-z]*[0-9][A-Za-z]*$

Regular expression - Kleene star of a union expression

I'm trying to code something that returns randomly a possible result after going through a regular expression.
I was sort of confused on how to tackle this when you have kleene star of a union expression.
If you have (a + b)* then does this mean that you indefinitely choose between a or b and repeat it a definite number of times, or do you just randomly choose between a or b twice.
If it is the former, then would it logically make sense to first generate a random number to determine how many times I'm going to randomly choose between a or b, and then for each time I randomly choose the element I generate another random number that then repeats the element that many times?

If you're asking what kind of things match (a | b)*, you might as well think of it in terms of a grammar:
<expression> := <empty> | <parens><expression>
<parens> := a | b
That's what a * operator really means: for any expression x, x* matches either the empty string or (x)(x*) (this is a recursive definition).
If you want to randomly generate a string that matches the expression, then that's a much more complicated matter. You now have to think in terms of which distribution you want to use, because the length of the string is unbounded, and it's impossible to have a uniform distribution over an unbounded range. (In other words, you can't pick a random length between 0 and infinity uniformly, so you'd have to decide how you're going to pick that in the first place.) Once you have your length problem resolved, expand (a | b)* into (a | b) repeated N times (where N is your randomly-chosen length) and resolve each parenthesized subexpression separately — for instance, if you choose to expand the subexpression 3 times, that would become (a | b)(a | b)(a | b), which will match all of aaa, baa, aba, bba, aab, bab, abb and bbb.

If you want to test if the string is a member of a Kleene star applied
set, such as:
{"a", "b"}* = {ε, "a", "b", "aa", "ab", "ba", "bb", "aaa", "aab", ...}
then the regex ^[ab]*$ will work including an empty string.
If you want to limit the length of the string, say 10, then try ^[ab]{,10}$.

Regular expression not containing 101

I came across the regular expression not containing 101 as follows:
0∗1∗0∗+(1+00+000)∗+(0+1+0+)∗
I was unable to understand how the author come up with this regex. So I just thought of string which did not contain 101:
01000100
I seems that above string will not be matched by above regex. But I was unsure. So tried translating to equivalent pcre regex on regex101.com, but failed there too (as it can be seen my regex does not even matches string containing single 1.
Whats wrong with my translation? Is above regex indeed correct? If not what will be the correct regex?

Here is a bit shorter expression ^0*(1|00+)*0*$
https://www.regex101.com/r/gG3wP5/1
Explanation:
(1|00+)* we can mix zeroes and ones as long as zeroes occur in groups
^0*...0*$ we can have as many zeroes as we want in prefix/suffix
Direct translation of the original regexp would be like
^(0*1*0*|(1|00|000)*|(0+1+0+)*)$
Update
This seems like artificially complicated version of the above regexp:
(1|00|000)* is the same as (1|00+)*
it is almost the solution, but it does not match strings 0, 01.., and ..10
0*1*0* doesn't match strings with 101 inside, but matches 0 and some of 01.., and ..10
we still need to match those of 01.., and ..10 which have 0 & 1 mixed inside, e.g. 01001.. or ..10010
(0+1+0+)* matches some of the remaining cases but there are still some valid strings unmatched
e.g. 10010 is the shortest string that is not matched by all of the cases.
So, this solution is overly complicated and not complete.

read the explanation in the right side tab in regex101 it tells you what your regex does( I think you misunderstood what list operator does) , inside a list operator ( [ ) , the other characters such as ( won't be metacharacters anymore so the expression [(0*1*0*)[1(00)(000)] will be equivalent to [01()*[] which means it matches 0 or 1 or ( or ) or [
The correct translation of the regular expression 0∗1∗0∗+(1+00+000)∗+(0+1+0+)∗
will be as follows:
^((?:0*1*0*)|(?:1|00|000)*|(?:0+1+0+)*)$
regex101
Debuggex Demo
What your regex [(0*1*0*)[1(00)(000)]*(0+1+0+)*] does:
[(0*1*0*)[1(00)(000)]* -> matches any of characters 0,(,),*,[ zero or more times followed by
(0+1+0+)* --> matches the pattern 0+1+0+ 0 or more times followed by
] --> matches the character ]
so you expression is equivalent to
[([)01](0+1+0+)*] which is not a regular expression to match strings that do not contain 101

0* 1* ( (00+000)* 1*)* (ε+0)
i think this expression covers all cases because --
any number apart from 1 can be broken into constituent 2's and 3's i.e. any number n=2*i+3*j. So there can be any number of 0's between 2 consecutive 1's apart from one 0.Hence, 101 cannot be obtained.
ε+0 for expressions ending in one 0.

The RE for language not containing 101 as sub-string can also be written as (0*1*00)*.0*.1*.0*
This may me a smaller one then what you are using. Try to make use of this.

Regular Expression I got (0+10)1. (looks simple :P)
I just considered all cases to make this.
you consider two 1's we have to end up with continuous 1's
case 1: 11111111111111...
case 2: 0000000011111111111111...(once we take two 1's we cant accept 0's so one and only chance is to continue with 1's)
if you consider only one 1 which was followed by 0 So, no issue and after one 1 we can have any number of 0's.
case 3: 00000000 10100100010000100000100000 1111111111
=>(0*+10*)1
final answer (0+10)1.
Thanks for your patience.

Regex Verification of String in Correct Order with Delimiters in PHP

I'm trying to make a expression to verify that the string supplied is a valid format, but it seems that if I don't use regex in a few months, I forget everything I learned and have to relearn it.
My expression is supposed to match a format like this: 010L0404FFCCAANFFCC00M000000XXXXXX
The four delimiters are (L, N, K, M) which arent in the 0-9A-F hexidecimal range to indicate uniqueness must be in that order or not in the list. Each delimiter can only exist once!
It breaks down to this:
Starts off with a 3 digit numbers, which is simply ^([0-9]{3}) and is always required
Second set begins with L, and must be 2 digits + 2 digits + 6 hexdecimal and is optional
Third set begins with N and must be a 6 digit hexdecimal and is optional
The fourth set K is simply any amount of numbers and is optional
The fifth set is M and can be any 6 hexdecimals or XXXXXX to indicate nothing, it must be in multiples of 6 excluding 0, like 336699 (6) or 336699XXXXXXFFCC00 (18) and is optional
The hardest part I cant figure out making it require it in that order, and in multiples, like the L delimiter must come before and K always if it's there (the reason so I don't get variations of the same string which means the same thing with delimiters swapped). I can already parse it, I just want to verify the string is the correct format.
Any help would be appreciated, thanks.

Requiring the order isn't too bad. Just make each set optional. The regex will still match in order, so if the L section, for example, isn't there and the next character is N, it won't let L occur later since it won't match any of the rest of the regex.
I believe a direct translation of your requirements would be:
^([0-9]{3})(L[0-9]{4}[0-9A-F]{6})?(N[0-9A-F]{6})?(K[0-9]+)?(M([0-9A-F]{6}|X{6})+)?$
No real tricks, just making each group optional except for the first three digits, and adding an internal alternative for the two patterns of six digits in the M block.

^([0-9]{3})(L[0-9]{4}[0-9A-F]{6})?(N[0-9A-F]{6})?(K[0-9]+)?(M([0-9A-F]{6})+|MX{6})$

R regular expressions: unexpected behavior of "[:digit:]"

I'd like to extract elements beginning with digits from a character vector but there's something about POSIX regular expression syntax that I don't understand.
I would think that
vec <- c("012 foo", "305 bar", "other", "notIt 7")
grep(pattern="[:digit:]", x=vec)
would return 1 2 4 since they are the four elements that have digits somewhere in them. But in fact it returns 3 4.
Likewise grep(pattern="^0", x=vec) returns 1 as I would expect because element 1 starts with a zero. However grep(pattern="^[:digit:]", x=vec) returns integer(0) whereas I would expect it to return 1 2 since those are the elements that start with digits.
How am I misunderstanding the syntax?

Try
grep(pattern="[[:digit:]]", x=vec)
instead as the 'meta-patterns' between colons usually require double brackets.

Another solution
grep(pattern="\\d", x=vec)

man 7 regex
Within a bracket expression, the name of a character class enclosed in "[:" and ":]" stands for the list of all characters belonging to that class. Standard character class names are:
alnum digit punct
alpha graph space
blank lower upper
cntrl print xdigit
Therefore a character class that is the sole member of a bracket expression will look like double-brackets, such as [[:digit:]]. As another example, consider that [[:alnum:]] is equivalent to [[:alpha:][:digit:]].

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Confusion about the Kleene star - regex

Related

How to match regular expression for lines there is just one number and all words vice versa?

Regular expression - Kleene star of a union expression

Regular expression not containing 101

Regex Verification of String in Correct Order with Delimiters in PHP

R regular expressions: unexpected behavior of "[:digit:]"

Categories

Resources