converting CFG to regular expression

converting CFG to regular expression - regex

Here's a CFG that generates strings of 0s, 1s, or 0s and 1s arranged like this (001, 011) where one of the characters must have a bigger count than the other like in 00011111 or 00000111 for example.
S → 0S1 | 0A | 0 | 1B | 1
A → 0A | 0
B → 1B | 1
I tried converting it to regular expression using this guide but I got stuck here since I have trouble converting 0S1 given that anything similar to it can't be found in that guide.
S → 0S1 | 0+ | 0 | 1+ | 1
A → 0A | 0 = 0+
B → 1B | 1 = 1+
One of my previous attempts is 0+0+1|0+1+1|1+|0+ but it doesn't accept strings I mentioned above like 00011111 and 00000111.

Plug and Play
^(?!01$)(?!0011$)(?!000111$)(?!00001111$)(?=[01]{1,8}$)0*1*$
You cannot perfectly translate this to a regular expression, but you can get close, by ensuring that the input does not have equal number of 0 and 1. This matches up to 8 digits.
How it works
^ first you start from the beginning of a line
(?!01$) ensure that the characters are not 01
(?!0011$) ensure that the characters are not 0011
the same for 000111 and 00001111
then ensure that there are from 1 to 8 zeroes and ones (this is needed, to ensure that the input is not made of more digits like 000000111111, because their symmetry is not verified)
then match these zeroes and ones till the end of the line
for longer inputs you need to add more text, for up to 10 digits it is this: ^(?!01$)(?!0011$)(?!000111$)(?!00001111$)(?!0000011111$)(?=[01]{1,10}$)0*1*$ (you jump by 2 by adding one more symmetry validation)
it is not possible by other means with regular expressions alone, see the explanation.
Explanation
The A and B are easy, as you saw 0+ and 1+. The concatenations in S after the first also are easy: 00+, 0, 11+, 1, that all mixed into one lead to (0+|1+). The problem is with the first concatenation 0S1.
So the problem can be shorten to S = 0S1. This grammar is recursive. But neither left linear nor right linear. To recognize an input for this grammar you will need to "remember" how many 0 you found, to be able to match the same amount of 1, but the finite-state machines that are created from the regular grammars (often and from regular expressions) do not have a computation history. They are only states and transitions, and the machinery "jumps" from one state to the other and does not remember the "path" traveled over the transitions.
For this reason you need more powerful machinery (like the push-down automaton) that can be constructed from a context-free grammar (as yours).

Related

Regular expression not containing 101

I came across the regular expression not containing 101 as follows:
0∗1∗0∗+(1+00+000)∗+(0+1+0+)∗
I was unable to understand how the author come up with this regex. So I just thought of string which did not contain 101:
01000100
I seems that above string will not be matched by above regex. But I was unsure. So tried translating to equivalent pcre regex on regex101.com, but failed there too (as it can be seen my regex does not even matches string containing single 1.
Whats wrong with my translation? Is above regex indeed correct? If not what will be the correct regex?

Here is a bit shorter expression ^0*(1|00+)*0*$
https://www.regex101.com/r/gG3wP5/1
Explanation:
(1|00+)* we can mix zeroes and ones as long as zeroes occur in groups
^0*...0*$ we can have as many zeroes as we want in prefix/suffix
Direct translation of the original regexp would be like
^(0*1*0*|(1|00|000)*|(0+1+0+)*)$
Update
This seems like artificially complicated version of the above regexp:
(1|00|000)* is the same as (1|00+)*
it is almost the solution, but it does not match strings 0, 01.., and ..10
0*1*0* doesn't match strings with 101 inside, but matches 0 and some of 01.., and ..10
we still need to match those of 01.., and ..10 which have 0 & 1 mixed inside, e.g. 01001.. or ..10010
(0+1+0+)* matches some of the remaining cases but there are still some valid strings unmatched
e.g. 10010 is the shortest string that is not matched by all of the cases.
So, this solution is overly complicated and not complete.

read the explanation in the right side tab in regex101 it tells you what your regex does( I think you misunderstood what list operator does) , inside a list operator ( [ ) , the other characters such as ( won't be metacharacters anymore so the expression [(0*1*0*)[1(00)(000)] will be equivalent to [01()*[] which means it matches 0 or 1 or ( or ) or [
The correct translation of the regular expression 0∗1∗0∗+(1+00+000)∗+(0+1+0+)∗
will be as follows:
^((?:0*1*0*)|(?:1|00|000)*|(?:0+1+0+)*)$
regex101
Debuggex Demo
What your regex [(0*1*0*)[1(00)(000)]*(0+1+0+)*] does:
[(0*1*0*)[1(00)(000)]* -> matches any of characters 0,(,),*,[ zero or more times followed by
(0+1+0+)* --> matches the pattern 0+1+0+ 0 or more times followed by
] --> matches the character ]
so you expression is equivalent to
[([)01](0+1+0+)*] which is not a regular expression to match strings that do not contain 101

0* 1* ( (00+000)* 1*)* (ε+0)
i think this expression covers all cases because --
any number apart from 1 can be broken into constituent 2's and 3's i.e. any number n=2*i+3*j. So there can be any number of 0's between 2 consecutive 1's apart from one 0.Hence, 101 cannot be obtained.
ε+0 for expressions ending in one 0.

The RE for language not containing 101 as sub-string can also be written as (0*1*00)*.0*.1*.0*
This may me a smaller one then what you are using. Try to make use of this.

Regular Expression I got (0+10)1. (looks simple :P)
I just considered all cases to make this.
you consider two 1's we have to end up with continuous 1's
case 1: 11111111111111...
case 2: 0000000011111111111111...(once we take two 1's we cant accept 0's so one and only chance is to continue with 1's)
if you consider only one 1 which was followed by 0 So, no issue and after one 1 we can have any number of 0's.
case 3: 00000000 10100100010000100000100000 1111111111
=>(0*+10*)1
final answer (0+10)1.
Thanks for your patience.

Regular Expression (RegEx) For Hours with Increments

I need to only accept input that meets these rules...
0.25-24
Increments of .25 (.00, .25, .50, .75)
First digit doesn't have to be required.
Would like trailing zeros to be optional.
Examples of some valid entries:
0.25
.50
.5
1
1.0
5.50
23.75
24 (max allowed)
UPDATE: nothing at all, null/blank, should also be accepted as valid
Example of some invalid entries:
0
.0
.00
0.0
0.00
24.25
-1
I understand that RegEx is a pattern matching language therefore it's not great for ranges, less-than, and great-than checking. So to check if it's less than or equal to 24 means I'd have to find a pattern, right? So there are 24 possible patters which would make this a long RegEx, am I understanding this correctly? I could use ColdFusion to do the check to make sure it's in the 0-24 range. It's not the end of the world if I have use ColdFusion for this part, but it'd be nice to get it all into the RegEx if it doesn't cause it to be too long. This is what I have so far:
^\d{0,2}((\.(0|00|25|5|50|75))?)$
http://regex101.com/r/iS7zM3
This handles pretty much all of it except for the 0-24 range check or the check for just a zero. I'll keep plugging away at it but any help would be appreciated. Thanks!

Change \d{0,2} to (?:1[0-9]?|2[0-4]?|[3-9])? and it'll match from 1 to 24 (or nothing).
You can also simplify the second part to (?:\.(?:00?|25|50?|75))? - you could go further to (?:\.(?:[05]0?|[27]5))? but that might obfuscate the intent a bit too far.
To exclude 24.25 you could perhaps use a negative lookahead (?!24\.[^0]) to prevent anything other than 24.0 or 24.00, but it's probably simpler to just exclude 24 from the main pattern and include a specific check for 24/24.0/24.00 at the start:
(?x)
# checks for 24
^24$|^24\.00?$
|
# integer part
^
(?:1[0-9]?|2[0-3]?|[3-9]|0(?=\.[^0])|(?=\.[^0]))
# decimal part
(?:\.(?:00?|25|50?|75))?
$
That also includes a check for 0(?=\.[^0]) which uses a positive lookahead to only allow an initial 0 if the next char is a . followed by a non-zero (so 0.0 and 0.00 isn't allowed).
The (?x) flag allows whitespace to be ignored, allowing readable regex in your code - obviously preferable to squashing it all onto a single line - and also enables the use of # to start line comments to explain parts of a pattern. (Literal whitespaces and hashes can be escaped with backslash, or encoded via e.g. \x23 for hash.)
For comparison, here's a pure-CFML way of doing it:
IsNumeric(Num)
AND Num GT 0
AND Num LTE 24
AND NOT find('.',Num*4)
Now, are you really sure it's better as a regex...

You could try this regex (broken down):
^
(?:
(?:[1-9]|1\d|2[0-3])(?:\.(?:[05]0?|[27]5))? # Non-zeros with optional decimal
|
0?(?:\.(?:50?|[27]5)) # Decimals under 1
|
24(?:\.00?)? # The maximum
)
$
In one line:
^(?:(?:[1-9]|1\d|2[0-3])(?:\.(?:[05]0?|[27]5))?|0?(?:\.(?:50?|[27]5))|24(?:\.00?)?)$
regex101 demo

^([0-1]?[0-9]|2[0-4])((\.(0|00|25|5|50|75))?)$
This means the one's place can be 0-9 if the tens place is missing, a 0, or 1.
If the tens place is a 2, then the ones place can be 0-4.
The second part is great, it's simple and readable too. It has an extra set of parens though that can be removed, reducing it to this:
^([0-1]?[0-9]|2[0-4])(\.(0|00|25|5|50|75))?$

Sequential renaming "x+2" with Reg Exp

I have a batch of files with a pattern
AB 001 CD.txt
AB 002 FG.txt
AB 003 ID.txt ...
where the first 2 chars are constant, and the last 2 are all different.
I'd like to keep the first and last 2 chars intact, and just change the digits in the middle to (xxx + 2).
AB 001 CD.txt -> AB 003 DC.txt
AB 002 FG.txt -> AB 004 FG.txt
I am new to regular expressions, so the best I can do so far is find the digits with [0-9] but I need the replacement pattern.
Note that I'm only trying to test this with an application called "RegExr" to find a replacement match. Not sure this is doable with regex.

To match filenames you can use following regex:
^AB (\d{3}) .*$
However you cannot replace number part just with regex. You have to use some funcion specified for language you're using.
From now you have to get first matched group, convert it to an integer, increase with 2, add missing zeros to the left of value and replace with this string mentioned first group.

This isn’t really a good job for regular expressions – you cannot compute with regex. Numeric manipulation is therefore severely limited. Regex is the wrong technology for this job, I’m afraid. What you can do, though, is use a callback (depending on the language you use), something akin to the following (pseudocode, depends on the language you’re using):
result = regex_replace_callback('^(\w\w) (\d+) (\w\w).txt)$',
increment_match,
your_string)
function increment_match(match):
return match[1] + " " + (int(match[2]) + 2) + " " + match[3]

Constructing finite state automata corresponding to regular expressions. Are my solutions correct?

I have drawn my answers in paint, are they correct?
(4c) For the alphabet {0, 1} construct ﬁnite state automata corresponding to each of the following regular expressions:
(i) 0
(ii) 1 | 0
(iii) 0 * (1 | 0)

The first two are correct, although the first one might be able to be written as (depending on your convention)
(0) -- 0 --> ((1))
The last one is also correct, but can be simplified to (whenever you have ε appearing, there is likely to be a way to compress the edges and states together to remove it)
+- 0 -+
| |
v |
(0) ---+
/ \
1 0
\ /
v
((1))
(Excuse my ascii diagrams. I'm using (..) for each state, and ((..)) for final states.)
Notice that the 0* is basically a loop from a state to itself, since after reading a 0 the remaining regex to match is the same (as long as we aren't at the end of a string).

Regexes for integer constants and for binary numbers

I have tried 2 questions, could you tell me whether I am right or not?
Regular expression of nonnegative integer constants in C, where numbers beginning with 0 are octal constants and other numbers are decimal constants.
I tried 0([1-7][0-7]*)?|[1-9][0-9]*, is it right? And what string could I match? Do you think 034567 will match and 000083 match?
What is a regular expression for binary numbers x such that hx + ix = jx?
I tried (0|1){32}|1|(10)).. do you think a string like 10 will match and 11 won’t match?
Please tell me whether I am right or not.

You can always use http://www.spaweditor.com/scripts/regex/ for a quick test on whether a particular regex works as you intend it to. This along with google can help you nail the regex you want.

0([1-7][0-7])?|[1-9][0-9] is wrong because there's no repetition - it will only match 1 or 2-character strings. What you need is something like 0[0-7]*|[1-9][0-9]*, though that doesn't take hexadecimal into account (as per spec).
This one is not clear. Could you rephrase that or give some more examples?

Your regex for integer constants will not match base-10 numbers longer than two digits and octal numbers longer than three digits (2 if you don't count the leading zero). Since this is a homework, I leave it up to you to figure out what's wrong with it.
Hint: Google for "regular expression repetition quantifiers".

Question 1:
Octal numbers:
A string that start with a [0] , then can be followed by any digit 1, 2, .. 7 [1-7](assuming no leading zeroes) but can also contain zeroes after the first actual digit, so [0-7]* (* is for repetition, zero or more times).
So we get the following RegEx for this part: 0 [1-7][0-7]*
Decimal numbers:
Decimal numbers must not have a leading zero, hence start with all digits from 1 to 9 [1-9], but zeroes are allowed in all other positions as well hence we need to concatenate [0-9]*
So we get the following RegEx for this part: [1-9][0-9]*
Since we have two options (octal and decimal numbers) and either one is possible we can use the Alternation property '|' :
L = 0[1-7][0-7]* | [1-9][0-9]*
Question 2:
Quickly looking at Fermat's Last Theorem:
In number theory, Fermat's Last Theorem (sometimes called Fermat's conjecture, especially in older texts) states that no three positive integers a, b, and c can satisfy the equation an + bn = cn for any integer value of n greater than two.
(http://en.wikipedia.org/wiki/Fermat%27s_Last_Theorem)
Hence the following sets where n<=2 satisfy the equation: {0,1,2}base10 = {0,1,10}base2
If any of those elements satisfy the equation, we use the Alternation | (or)
So the regular expression can be: L = 0 | 1 | 10 but can also be L = 00 | 01 | 10 or even be L = 0 | 1 | 10 | 00 | 01
Or can be generalized into:
{0} we can have infinite number of zeroes: 0*
{1} we can have infinite number of zeroes followed by a 1: 0*1
{10} we can have infinite number of zeroes followed by 10: 0*10
So L = 0* | 0*1 | 0*10

max answered the first question.
the second appears to be the unsolvable diophantine equation of fermat's last theorem. if h,i,j are non-zero integers, x can only be 1 or 2, so you're looking for
^0*10?$
does that help?

There are several tool available to test regular expressions, such as The Regulator.
If you search for "regular expression test" you will find numerous links to online testers.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

converting CFG to regular expression - regex

Related

Regular expression not containing 101

Regular Expression (RegEx) For Hours with Increments

Sequential renaming "x+2" with Reg Exp

Constructing finite state automata corresponding to regular expressions. Are my solutions correct?

Regexes for integer constants and for binary numbers

Categories

Resources