Regular expression - Kleene star of a union expression

Regular expression - Kleene star of a union expression - regex

I'm trying to code something that returns randomly a possible result after going through a regular expression.
I was sort of confused on how to tackle this when you have kleene star of a union expression.
If you have (a + b)* then does this mean that you indefinitely choose between a or b and repeat it a definite number of times, or do you just randomly choose between a or b twice.
If it is the former, then would it logically make sense to first generate a random number to determine how many times I'm going to randomly choose between a or b, and then for each time I randomly choose the element I generate another random number that then repeats the element that many times?

If you're asking what kind of things match (a | b)*, you might as well think of it in terms of a grammar:
<expression> := <empty> | <parens><expression>
<parens> := a | b
That's what a * operator really means: for any expression x, x* matches either the empty string or (x)(x*) (this is a recursive definition).
If you want to randomly generate a string that matches the expression, then that's a much more complicated matter. You now have to think in terms of which distribution you want to use, because the length of the string is unbounded, and it's impossible to have a uniform distribution over an unbounded range. (In other words, you can't pick a random length between 0 and infinity uniformly, so you'd have to decide how you're going to pick that in the first place.) Once you have your length problem resolved, expand (a | b)* into (a | b) repeated N times (where N is your randomly-chosen length) and resolve each parenthesized subexpression separately — for instance, if you choose to expand the subexpression 3 times, that would become (a | b)(a | b)(a | b), which will match all of aaa, baa, aba, bba, aab, bab, abb and bbb.

If you want to test if the string is a member of a Kleene star applied
set, such as:
{"a", "b"}* = {ε, "a", "b", "aa", "ab", "ba", "bb", "aaa", "aab", ...}
then the regex ^[ab]*$ will work including an empty string.
If you want to limit the length of the string, say 10, then try ^[ab]{,10}$.

Related

Confusion about the Kleene star

I have been struggling to understand one key property about the closure of two union'ed expressions. Basically what I need to know exactly how the Kleene star works.
I.E If Regular Expression R = (0+1)* Does the expression have to evaluate to something like 000111/01/00001111, or can we have a unequal amount of 0's & 1's, such as 0011111/000001/111111/0000?

The amount of 0's and 1's can be unequal; you can even have the 0's and 1's in any order! a* means "zero or more as, where each a is evaluated independently"; thus, in a string matching (0+1)*, each character can match (0+1) without regard for how the other characters in the string are matching it.
Consider the pattern (0+1)(0+1); it matches the strings 00, 01, 10, and 11. As you can see, the 0's and 1's don't have to occur in equal amounts and don't have to occur in any specific order. The Kleene star extends this to strings of any length; after all, (0+1)* just means <empty>+(0+1)+(0+1)(0+1)+(0+1)(0+1)(0+1)+ ....

Match list of incrementing integers using regex

Is it possible to match a list of comma-separated decimal integers, where the integers in the list always increment by one?
These should match:
0,1,2,3
8,9,10,11
1999,2000,2001
99,100,101
These should not match (in their entirety - the last two have matching subsequences):
42
3,2,1
1,2,4
10,11,13

Yes, this is possible when using a regex engine that supports backreferences and conditions.
First, the list of consecutive numbers can be decomposed into a list where each pair of numbers are consecutive:
(?=(?&cons))\d+
(?:,(?=(?&cons))\d+)*
,\d+
Here (?=(?&cons)) is a placeholder for a predicate that ensures that two numbers are consecutive. This predicate might look as follows:
(?<cons>\b(?:
(?<x>\d*)
(?:(?<a0>0)|(?<a1>1)|(?<a2>2)|(?<a3>3)|(?<a4>4)
|(?<a5>5)|(?<a6>6)|(?<a7>7)|(?<a8>8))
(?:9(?= 9*,\g{x}\d (?<y>\g{y}?+ 0)))*
,\g{x}
(?(a0)1)(?(a1)2)(?(a2)3)(?(a3)4)(?(a4)5)
(?(a5)6)(?(a6)7)(?(a7)8)(?(a8)9)
(?(y)\g{y})
# handle the 999 => 1000 case separately
| (?:9(?= 9*,1 (?<z>\g{z}?+ 0)))+
,1\g{z}
)\b)
For a brief explanation, the second case handling 999,1000 type pairs is easier to understand -- there is a very detailed description of how it works in this answer concerned with matching a^n b^n. The connection between the two is that in this case we need to match 9^n ,1 0^n.
The first case is slightly more complicated. The largest part of it handles the simple case of incrementing a decimal digit, which is relatively verbose due to the number of said digits:
(?:(?<a0>0)|(?<a1>1)|(?<a2>2)|(?<a3>3)|(?<a4>4)
|(?<a5>5)|(?<a6>6)|(?<a7>7)|(?<a8>8))
(?(a0)1)(?(a1)2)(?(a2)3)(?(a3)4)(?(a4)5)
(?(a5)6)(?(a6)7)(?(a7)8)(?(a8)9)
The first block will capture whether the digit is N into group aN and the second block will then uses conditionals to check which of these groups was used. If group aN is non-empty, the next digit should be N+1.
The remainder of the first case handles cases like 1999,2000. This again falls into the pattern N 9^n, N+1 0^n, so this is a combination of the method for matching a^n b^n and incrementing a decimal digit. The simple case of 1,2 is handled as the limiting case where n=0.
Complete regex: https://regex101.com/r/zG4zV0/1
Alternatively the (?&cons) predicate can be implemented slightly more directly if recursive subpattern references are supported:
(?<cons>\b(?:
(?<x>\d*)
(?:(?<a0>0)|(?<a1>1)|(?<a2>2)|(?<a3>3)|(?<a4>4)
|(?<a5>5)|(?<a6>6)|(?<a7>7)|(?<a8>8))
(?<y>
,\g{x}
(?(a0)1)(?(a1)2)(?(a2)3)(?(a3)4)(?(a4)5)
(?(a5)6)(?(a6)7)(?(a7)8)(?(a8)9)
| 9 (?&y) 0
)
# handle the 999 => 1000 case separately
| (?<z> 9,10 | 9(?&z)0 )
)\b)
In this case the two grammars 9^n ,1 0^n, n>=1 and prefix N 9^n , prefix N+1 0^n, n>=0 are pretty much just written out explicitly.
Complete alternative regex: https://regex101.com/r/zG4zV0/3

Fetch Words Starting/Containing/Ending With Fragment in a Word Dictionary

Assuming that we have a list of all dictionary words from A-Z from the English dictionary.
I have three cases to perform on these list of words:
1) find out all the words that are "starting with" a particular fragment
eg: If my fragment is 'car', word 'card' should be returned
2) find out all the words that "contains" the fragment as the substring
eg: If my fragment is 'ace', word 'facebook' should be returned
3) find out all the words that are "ending with" a particular fragment
eg: If my fragment is 'age', word 'image' should be returned
After some searching exercise on the internet, I found that 1) can be done through trie/compressed trie and 3) can be done through suffix tree.
I am unsure of how 2) can be achieved.
Plus are there any better scenarios wherein all these three cases can be handled? As maintaining both prefix and suffix tree could be a memory intensive task.
Kindly let me know for any other areas to look out for.
Thanks in advance.
PS: I will be using C++ to achieve this
EDIT 1: For the time being I constructed a suffix tree with immense help from here.
Single Word Suffix Tree Generation in C Language
Here, I need to construct a suffix tree for the entire english dictionary words. So should I create a
a) separate suffix tree for each word OR
b) create a generalized suffix tree for all words.
I am not sure how to track individual trees for each word while substring matching in the a) case
Any pointers?

As I pointed out in a comment, the prefix and suffix cases are covered by the general substring case (#2). All prefixes and suffixes are by definition substrings as well. So all we have to solve is the general substring problem.
Since you have a static dictionary, you can preprocess it relatively easily into a form that is fast to query for substrings. You could do this with a suffix tree, but it's far easier to construct and deal with simple sorted flat vectors of data, so that's what I'll describe here.
The end goal, then, is to have a list of sub-words that are sorted so that a binary search can be done to find a match.
First, observe that in order to find the longest substrings that match the query fragment it is not necessary to list all possible substrings of each word, but merely all possible suffixes; this is because all substrings can be merely thought of as prefixes of suffixes. (Got that? It's a little mind-bending the first time you encounter it, but simple in the end and very useful.)
So, if you generate all the suffixes of each dictionary word, then sort them all, you have enough to find any specific substring in any of the dictionary words: Do a binary search on the suffixes to find the lower bound (std::lower_bound) -- the first suffix that starts with the query fragment. Then find the upper bound (std::upper_bound) -- this will be one past the last suffix that starts with the query fragment. All of the suffixes in the range [lower, upper[ must start with the query fragment, and therefore all of the words that those suffixes originally came from contain the query fragment.
Now, obviously actually spelling out all the suffixes would take an awful lot of memory -- but you don't need to. A suffix can be thought of as merely an index into a word -- the offset at which the suffix begins. So only a single pair of integers is required for each possible suffix: one for the (original) word index, and one for the index of the suffix in that word. (You can pack these two together cleverly depending on the size of your dictionary for even greater space savings.)
To sum up, all you need to do is:
Generate an array of all the word-suffix index pairs for all the words.
Sort these according to their semantic meaning as suffixes (not numerical value). I suggest std::stable_sort with a custom comparator. This is the longest step, but it can be done once, offline since your dictionary is static.
For a given query fragment, find the lower and upper bounds in the sorted suffix indices. Each suffix in this range corresponds to a matching substring (of the length of the query, starting at the suffix index in the word at the word index). Note that some words may match more than once, and that matches may even overlap.
To clarify, here's a minuscule example for a dictionary composed of the words "skunk" and "cheese".
The suffixes for "skunk" are "skunk", "kunk", "unk", "nk", and "k". Expressed as indices, they are 0, 1, 2, 3, 4. The suffixes for "cheese" are "cheese", "heese", "eese", "ese", "se", and "e". The indices are 0, 1, 2, 3, 4, 5.
Since "skunk" is the first word in our very limited imaginary dictionary, we'll assign it index 0. "cheese" is at index 1. So the final suffixes are: 0:0, 0:1, 0:2, 0:3, 0:4, 1:0, 1:1, 1:2, 1:3, 1:4, 1:5.
Sorting these suffixes yields the following suffix dictionary (I added the actual corresponding textual substrings for illustration only):
0 | 0:0 | cheese
1 | 0:5 | e
2 | 0:2 | eese
3 | 0:3 | ese
4 | 0:1 | heese
5 | 1:4 | k
6 | 1:1 | kunk
7 | 1:3 | nk
8 | 0:4 | se
9 | 1:0 | skunk
10 | 1:2 | unk
Consider the query fragment "e". The lower bound is 1, since "e" is the first suffix that is greater than or equal to the query "e". The upper bound is 4, since 4 ("heese") is the first suffix that is greater than "e". So the suffixes at 1, 2, and 3 all start with the query, and therefore the words they came from all contain the query as a substring (at the suffix index, for the length of the query). In this case, all three of these suffixes belong to "cheese", at different offsets.
Note that for a query fragment that's not a substring of any of the words (e.g. "a" in this example), there are no matches; in such a case, the lower and upper bounds will be equal.

You can try aho corasick algorithm. The automaton is in fact a trie and the failure function is a breadth-first traversal of the trie.

RE: Number of a's is divisible by 6 and Number of b's is divisible by 8

Find a regular expression which represents strings made of {a, b}, where number of a's is divisible by 6 and number of b's is divisible by 8.
I tried to create a DFA which accepts such strings. My idea was to use all the remainders mod 6 and mod 8 leading to a total of 48 remainders. Thus each state in DFA is a pair (r, s) where r varies from 0 to 6 and s varies from 0 to 7. Start state (as well as accepting state) is (0, 0) and by we can easily give transitions to states by noting that if we input "a" the state (r, s) transitions to (r + 1, s) and on input "b" it transitions to state (r, s + 1).
However it is too difficult to work with a DFA of 48 states and I am not sure if this can be minimized by using the DFA minimization algorithm by hand.
I am really not sure how then we can get to a regular expression representing such strings.

If you are allowed to use lookaheads:
^(?=b*((ab*){6})+$)a*((ba*){8})+$
Debuggex Demo
Example of matched string: bbaabbaabbaabb
Idea is simple: We know how to match string having number of as divisible by 6 - ^((b*ab*){6})+$, also we know how to match string having number of bs divisible by 8 - ^((a*ba*){8})+$. So we just put one regex to lookahead, and another one to matching part.
In case if you also need to match strings consisting only of as or only of bs, then the following regex will do:
^(?=b*((ab*){6})*$)a*((ba*){8})*$
Examples of matched strings: aaaaaa, bbbbbbbb, bbaabbaabbaabb
Debuggex Demo

Regexes for integer constants and for binary numbers

I have tried 2 questions, could you tell me whether I am right or not?
Regular expression of nonnegative integer constants in C, where numbers beginning with 0 are octal constants and other numbers are decimal constants.
I tried 0([1-7][0-7]*)?|[1-9][0-9]*, is it right? And what string could I match? Do you think 034567 will match and 000083 match?
What is a regular expression for binary numbers x such that hx + ix = jx?
I tried (0|1){32}|1|(10)).. do you think a string like 10 will match and 11 won’t match?
Please tell me whether I am right or not.

You can always use http://www.spaweditor.com/scripts/regex/ for a quick test on whether a particular regex works as you intend it to. This along with google can help you nail the regex you want.

0([1-7][0-7])?|[1-9][0-9] is wrong because there's no repetition - it will only match 1 or 2-character strings. What you need is something like 0[0-7]*|[1-9][0-9]*, though that doesn't take hexadecimal into account (as per spec).
This one is not clear. Could you rephrase that or give some more examples?

Your regex for integer constants will not match base-10 numbers longer than two digits and octal numbers longer than three digits (2 if you don't count the leading zero). Since this is a homework, I leave it up to you to figure out what's wrong with it.
Hint: Google for "regular expression repetition quantifiers".

Question 1:
Octal numbers:
A string that start with a [0] , then can be followed by any digit 1, 2, .. 7 [1-7](assuming no leading zeroes) but can also contain zeroes after the first actual digit, so [0-7]* (* is for repetition, zero or more times).
So we get the following RegEx for this part: 0 [1-7][0-7]*
Decimal numbers:
Decimal numbers must not have a leading zero, hence start with all digits from 1 to 9 [1-9], but zeroes are allowed in all other positions as well hence we need to concatenate [0-9]*
So we get the following RegEx for this part: [1-9][0-9]*
Since we have two options (octal and decimal numbers) and either one is possible we can use the Alternation property '|' :
L = 0[1-7][0-7]* | [1-9][0-9]*
Question 2:
Quickly looking at Fermat's Last Theorem:
In number theory, Fermat's Last Theorem (sometimes called Fermat's conjecture, especially in older texts) states that no three positive integers a, b, and c can satisfy the equation an + bn = cn for any integer value of n greater than two.
(http://en.wikipedia.org/wiki/Fermat%27s_Last_Theorem)
Hence the following sets where n<=2 satisfy the equation: {0,1,2}base10 = {0,1,10}base2
If any of those elements satisfy the equation, we use the Alternation | (or)
So the regular expression can be: L = 0 | 1 | 10 but can also be L = 00 | 01 | 10 or even be L = 0 | 1 | 10 | 00 | 01
Or can be generalized into:
{0} we can have infinite number of zeroes: 0*
{1} we can have infinite number of zeroes followed by a 1: 0*1
{10} we can have infinite number of zeroes followed by 10: 0*10
So L = 0* | 0*1 | 0*10

max answered the first question.
the second appears to be the unsolvable diophantine equation of fermat's last theorem. if h,i,j are non-zero integers, x can only be 1 or 2, so you're looking for
^0*10?$
does that help?

There are several tool available to test regular expressions, such as The Regulator.
If you search for "regular expression test" you will find numerous links to online testers.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js