Regular expression for equal number of 0 and 1 - regex

How to find regular expression with equal number of 1 and 0.
I am also interested in how you think such solution ?
example:
should match : 1100, 00100111 , 01 .
shouldn't match: 110 , 0, 11001.
I need regular expression which gives set of all such string .
If length of string in set given by regular expression in 2n then number of 0s should be equal to number 1s = n.

It is not possible to generate a regular expression for the language L = (0,1) (same number of 1s and 0s).
This is not a regular language, so cannot be described with a regular expression. It's not regular because an automaton which accepts it would need differing amounts of memory depending on the length of the input. A regular language is one which uses constant memory, regardless of the length of the input.
The language you describe can be generated by a Context Free Grammar, but not a regular expression.
The following CFG generates strings where the numbers of 0s and the number of 1s are equal. If S is any word in the language:
S -> SS
S -> 0S1
S -> 1S0
S -> ε (the empty word)
For this language you need a stack, and a pushdown automaton could be designed to accept it, or a Turing machine.

Not possible with regular grammar (finite state automaton) : http://en.wikipedia.org/wiki/Regular_language

Here is a regex pattern for the .NET engine that does satisfy your needs. See it in action at ideone.com.
^((?(D)(?!))(?<C>1)|(?(D)(?!))(?<-C>0)|(?(C)(?!))(?<D>0)|(?(C)(?!))(?<-D>1))*(?(C)(?!))(?(D)(?!))$
It works by using two stacks, using one (C) if there are curretly more 1s than 0s and the other one (D) if there are more zeroes than ones.
Not pretty, definitely not usable, but it works. (Ha!)

While this is not possible with a regular grammar as stated in another answer, it should be relatively easy to scan the string, increment a counter for each 1 and decrement it for each 0. If the final count is 0, then the number of 0s and 1s is equal (modulo 2^wordsize - watching out for overflow would make it a little trickier, but depending on whether there are other assumptions that can be made regarding the input, that may not be necessary).

Related

Convert a regulation expression to DFA

I have been trying different ways to solve this problem for over an hour and am getting very frustrated.
The problem is: Give regular expressions and DFAs for each of the following languages over Sigma = {0,1}.
a). {w ∈ Σ* | w contains an even number of 0s or an odd number of 1s}
If anyone could provide hints or get me started on figuring this one out, it would be very appreciated!
I know it is something along the lines of this DFA but this one is for
{w ∈ Σ* | w contains an even number of 0s or exactly two 1's}
so it's a bit different but I can't figure it out.
You can see it as follows: you always have to remember two things:
whether the number of 0s is even or odd; and
whether the number of 1s is even or odd.
Now if we denote even with e and odd with o, we consider four states: ee (both even), eo (even number of 0s and odd number of 1s), oe and oo.
Now when we read a zero (0), we simply swap the first state token, so it means we introduce transitions from:
ee - 0 -> oe;
eo - 0 -> oo;
oe - 0 -> ee; and
oo - 0 -> eo.
The same for ones (1):
ee - 1 -> eo;
eo - 1 -> ee;
oe - 1 -> oo; and
oo - 1 -> oe.
Now we only need to determine the initial state and the accepting state(s). The intial state is ee, since at that moment we have considered no zeros and no ones.
Furthermore the accepting state can by determined by the condition:
w contains an even number of 0s or an odd number of 1s
So that means the accepting states are ee, eo and oo. A drawing of this DFA is shown below:
There exists an algorithmic way to convert a DFA into an equivalent regular expression as is stated here.
You can construct a regular expression by splitting the problem into two easier problems:
a regex that checks if the number of 0s is even; and
a regex that checks if the number of 1s is odd.
For the first, you can use the regex:
(1*01*0)*1*
Indeed: you first have a group (1*01*0). This group ensures that there are two zeros, and 1s can appear everywhere in between. We allow an arbitrary number of repetitions, since the number always remains even. The regex ends with 1* since it is still possible that there are additional ones in the string.
The second problem can be solved with the regex:
0*1(0*10*1)*0*
The solution is more or less the same. The expression between the brackets: (0*10*1) ensures that the ones occur evenly. By adding a 1 in front, we ensure the number of 1s is odd.
A regular expression that then solves the problem is:
(1*01*0)*1*|0*1(0*10*1)*0*
Since the "pipe" (|) means "or".
Think about what possible states you can ever be in.
A number contains either an even number of 0's or an odd number of 0's. (2 possible states)
A number contains either an even number of 1's or an odd number of 1's. (2 possible states)
Now let's look at what combinations are accepted by your language:
even 0's, even 1's: accept
even 0's, odd 1's: accept
odd 0's, even 1's: reject
odd 0's, odd 1's: accept
As a result, your DFA will need 4 states, of which 3 are accept states and 1 is a reject state. Every state will have 2 transitions leading to a different state. Since the empty string has an even number of 0's and an even number of 1's, the first state will be the initial state.
For making this into a regular expression: think about how you'd match an even number of 0's, then how you'd match an odd number of 1's. The language is just the union of these two.
Alternatively, as suggested by Willem, you can use an algorithm to convert any NFA to a regular expression. It has the advantage of being very general, but it's also more technical. Either way, it should lead to an equivalent regular expression.
What does a number with an even number of 0's look like? It might start with any number of 1's, but when we do find a 0 we better find another one! There can be any number of 1's in between, but we only care about the 0's. Thus, we come up with the following regular expression:
1*(01*01*)*
You should be able to apply a similar logic to match an odd number of 1's. Finally, OR the two expressions to get the requested regular expression.

How do I convert language set notation to regular expressions?

I have this following questing in regular expression and I just can't get my head around these kind of problems.
L1 = { 0n1m | n≥3 ∧ m is odd }
How would I write a regular expression for this sort of problem when the alphabet is {0,1}.
What's the answer?
The regular expression for your example is:
000+1(11)*1
So what does this do?
The first two characters, 00, are literal zeros. This is going to be important for the next point
The second two characters, 0+, mean "at least one zero, no upper bound". These first four characters satisfy the first condition, which is that we have at least three zeros.
The next character, 1, is a literal one. Since we need to have an odd number of ones, this is the smallest number we're allowed to have
The last-but-one characters, (11), represent a logical grouping of two literal ones, and the ending * says to match this grouping zero or more times. Since we always have at least one 1, we'll always match an odd number. So we're done.
How'd I get that?
The key is knowing regular expression syntax. I happen to have quite a bit of experience in it, but this website helped me to verify.
Once you know the basic building blocks of regex, you need to break down your problem into what you can represent.
For example, regex allows us to specify a lower AND upper bound for matching (the {x,y} syntax), but doesn't allow to specify just a lower bound ({x} will match exactly x times). So I knew I would have to use either + or * to specify the zeros, as those are the only specifiers that permit an infinite number of matches. I also knew that it didn't make sense to apply those modifiers to a group; the restriction that we must have at least 3 zeroes doesn't imply that we must have a multiple of three, for example, so (000)+ was out. I had to apply the modifier to only one character, which meant I had to match a few literals first. 000 guarantees matching exactly three 0s, and 0* (Final expression 0000*) does exactly what I want, and then I condensed that to the equivalent 000+.
For the second condition, I had to think about what an odd number is. By definition, an odd number can be expressed by 2*k + 1, where k is an integer. So I had to match one 1 (Hence the literal 1), and some number of the substring 11. That led me to the group, and then the *. On a slightly different problem, you could write 1(11)+ to match any odd number of ones, and at least 3.
1 A colleague of mine pointed out to me that the + operator isn't technically part of the formal definition of regular expressions. If this is an academic question rather than a programming one, you might find the 0000* version more helpful. In that case, the final string would be 0000*1(11)*

Compiler matching string to regex, using NFA

I was reading about compilers, the chapter about lexical analyzers(scanners) and I'm puzzled by the following statement:
For an input string X and a regular expression R, the complexity for finding a match using non-deterministic finite automata is:
O(len R * len X)
How can the complexity be polynomial in len R?
I'm under the impression that it depends exponentially on len R, because whenever we have a character which may appear a variable number of times(ie followed by the * symbol), we must test all possible number of occurences. If we have multiple characters which appear a variable number of times, we must check all possibilities(by backtracking).
Where am I wrong?
we must check all possibilities(by backtracking).
Not necessarily by backtracking. There are many ways to implement an NFA. By moving through the input linearly, and transitioning to multiple states at the same time (storing the set of active states in an O(1)-lookup structure), you will get exactly the mentioned runtime - number of states in NFA is linear to length of regex.
See also the popular articel Regular Expression Matching Can Be Simple And Fast.

Is there a way to negate a regular expression?

Given a regular expression R that describes a regular language (no fancy backreferences). Is there an algorithmic way to construct a regular expression R* that describes the language of all words except those described by R? It should be possible as Wikipedia says:
The regular languages are closed under the various operations, that is, if the languages K and L are regular, so is the result of the following operations: […] the complement ¬L
For example, given the alphabet {a,b,c}, the inverse of the language (abc*)+ is (a|(ac|b|c).*)?
As DPenner has already pointed out in the comments, the inverse of a regular expresion can be exponentially larger than the original expression. This makes inversing regular expressions unsuitable to implement negative partial expression syntax for searching purposes. Is there an algorithm that preserves the O(n*m) runtime characteristic (where n is the size of the regex and m is the length of the input) of regular expression matching and allows for negated subexpressions?
Unfortunately, the answer given by nhahdtdh in the comments is as good as we can do (so far). Whether a given regular expression generates all strings is PSPACE-complete. Since all problems in NP are in PSPACE-complete, an efficient solution to the universality problem would imply that P=NP.
If there were an efficient solution to your problem, would you be able to resolve the universality problem? Sure you would.
Use your efficient algorithm to generate a regular expression for the negation;
Determine whether the resulting regular expression generates the empty set.
Note that the problem "given a regular expression, does it generate the empty set" is fairly straightforward:
The regular expression {} generates the empty set.
(r + s) generates the empty set iff both r and s generate the empty set.
(rs) generates the empty set iff either r or s generates the empty set.
Nothing else generates the empty set.
Basically, it's pretty easy to tell whether a regular expression generates the empty set: just start evaluating the regular expression.
(Note that while the above procedure is efficient in terms of the output length, it might not be efficient in terms of the input length, if the output length is more than polynomially faster than the input length. However, if that were the case, we'd have the same result anyway, i.e., that your algorithm isn't really efficient, since it would take exponentially many steps to generate an exponentially longer output from a given input).
Wikipedia says: ... if there exists at least one regex that matches a particular set then there exist an infinite number of such expressions. We can deduct from this statement that there is an infinite number of expressions that describe the language of all words except those described by R.
Again, (as also #nhahtdh tried to explain) the simplest algorithm to address this question is to extend the scope of evaluation outside the context of the regular expression language itself. That is: match the strings you want to exclude (which represent a finite subset to work with) by using the original regular expression and then treat any failure to match as an actual match (out of an infinite set of other possibilities). So, if the result of the match is negative, your candidate strings are a subset of the valid solutions.

Algorithm to match list of regular expressions

I have two algorithmic questions for a project I am working on. I have thought about these, and have some suspicions, but I would love to hear the community's input as well.
Suppose I have a string, and a list of N regular expressions (actually they are wildcard patterns representing a subset of full regex functionality). I want to know whether the string matches at least one of the regular expressions in the list. Is there a data structure that can allow me to match the string against the list of regular expressions in sublinear (presumably logarithmic) time?
This is an extension of the previous problem. Suppose I have the same situation: a string and a list of N regular expressions, only now each of the regular expressions is paired with an offset within the string at which the match must begin (or, if you prefer, each of the regular expressions must match a substring of the given string beginning at the given offset).
To give an example, suppose I had the string:
This is a test string
and the regex patterns and offsets:
(a) his.* at offset 0
(b) his.* at offset 1
The algorithm should return true. Although regex (a) does not match the string beginning at offset 0, regex (b) does match the substring beginning at offset 1 ("his is a test string").
Is there a data structure that can allow me to solve this problem in sublinear time?
One possibly useful piece of information is that often, many of the offsets in the list of regular expressions are the same (i.e. often we are matching the substring at offset X many times). This may be useful to leverage the solution to problem #1 above.
Thank you very much in advance for any suggestions you may have!
I will assume that you really have the power of regular expressions.
To determine whether a string is matched by one of expressions e_1, e_2, ..., e_n, just match against the expression e_1 + e_2 + ... + e_n (sometimes the + operator is written as |).
Given expression-offset pairs (e_1, o_1), ..., (e_n, o_n) and a string, you can check whether there is i such that the string is matched by expression e_i at offset o_i by matching against the expression .{o_1}e_1 + ... + .{o_n}e_n.
Depending on the form of the individual expressions, you can get sublinear performance (not in general though).
If your expressions are sufficiently simple (wildcard patterns are), AND your set of expressions are predetermined, i.e. only the input to be matched changes, THEN you may construct a finite state machine that matches the union of your expressions, i.e., the expression "(r1)|(r2)|...".
Constructing that machine takes time and space at least O(N) (but I guess it is not exponential, which is the worst case for regular expressions in general). Matching is then O(length(input)), independent of N.
OTOH, if your set of expressions is to be part of the program's input, then there is no sublinear algorithm, simply because each expression must be considered.
(1) Combine all the regular expressions as a big union: (r1)|(r2)|(r3)|...
(2) For each regex with offset n add n dots to the beginning plus an anchor. So his.* at offset 6 becomes ^......his.*. Or if your regex syntax supports it, ^.{6}his.*.