REGULAR LANGUAGES AND REGULAR EXPRESSIONS (theory of automata) - regex

I am going through book of "introduction to language and the theory of computation by John C martin" chapter # 3 section 3.1. Following exercise, question # 3.7 (i)"The language of all strings containing both bb and aba as sub-strings." this question puzzled me".
here is the expression i made. i do not know its good or wrong:
"(a+b)*((bb(a+b)*aba)+(bb(a+b)*aba))(a+b)*".
I am also confuse with "+" and "|" symbols. I think its same. is not it? (yes?/no?)???

+ and | are actually very different. a+ is the same as writing a(a*). It is telling you write the string one or more times. | is an operator that gives you a choice. (a|b) is telling you to choose a or b.
Your expression that you chose seems correct except that all + should be converted to |.

Related

Regex and grammar free context conversion

Can the following CFG be converted to a Regex ?
Someone said that this could be it regex: (ab* a + b)*
is this true and why? I cant seem to understand it
It's not a regular language.
Consider the subset of the language with exactly one b. (In other words, the intersection of the language with a*ba*.) If the language were regular, that subset would also be regular, since it would be the intersection of two regular languages.
But it's not regular, since it consists of strings in which the number of as following the b is at least as large as the number of as preceding the b, and that is not a regular language ("regular languages can't count").

How Can I demonstrate this grammar is not ambiguous?

I know I need to show that there is no string that I can get using only left most operation that will lead to two different parsing trees. But how can I do it? I know there is not a simple way of doing it, but since this exercise is on the compilers Dragon book, then I am pretty sure there is a way of showing (no need to be a formal proof, just justfy why) it.
The Gramar is:
S-> SS* | SS+ | a
What this grammar represents is another way of simple arithmetic(I do not remember the name if this technique of anyone knows, please tell me ) : normal sum arithmetic has the form a+a, and this just represents another way of summing and multiplying. So aa+ also mean a+a, aaa*+ is a*a+a and so on
The easiest way to prove that a CFG is unambiguous is to construct an unambiguous parser. If the grammar is LR(k) or LL(k) and you know the value of k, then that is straightforward.
This particular grammar is LR(0), so the parser construction is almost trivial; you should be able to do it on a single sheet of paper (which is worth doing before you try to look up the answer.
The intuition is simple: every production ends with a different terminal symbol, and those terminal symbols appear nowhere else in the grammar. So when you read a symbol, you know precisely which production to use to reduce; there is only one which can apply, and there is no left-hand side you can shift into.
If you invert the grammar to produce Polish (or Łukasiewicz) notation, then you get a trivial LL grammar. Again the parsing algorithm is obvious, since every right hand side starts with a unique terminal, so there is only one prediction which can be made:
S → * S S | + S S | a
So that's also unambiguous. But the infix grammar is ambiguous:
S → S * S | S + S | a
The easiest way to provide ambiguity is to find a sentence which has two parses; one such sentence in this case is:
a + a + a
I think the example string aa actually shows what you need. Can it not be parsed as:
S => SS* => aa OR S => SS+ => aa

Math: Giving regular expression for a language:

I am going over and learning regular expressions and languages. I was working through some questions about giving a regular expression to represent a specified language. The question I was a little stuck on is this:
Come up with a regular expression that expresses the following
language. The alphabet of the langauge is {a,b}.
The language of all strings with two consecutive a's, but no three
consecutive a's. (ie, "aa", "aabaa", "babaa" are in the language,
while "abab", "aaaab" is not).
My answer for this so far is:
(b*(e+a+aa)bb*)* (aa) (bb*(e+a+aa)b*)*
where 'e' is the empty string and '+' functions essentially as an 'or'.
I guess what I am wondering is if my answer is correct (I believe it is), and if it can at all be simplified?
Thanks guys.
I believe that your regular expression is correct. It ensures that an aa exists in the string, and makes sure that aaa cannot exist. As for being simplest (simplest being subjective here), I would say the following is simpler:
(b + ab + aab)* aa (b + ba + baa)*
Note that you could actually derive the above from the regular expression that you have. Taking just the part before the aa in your regular expression, we have:
(b*(e+a+aa)bb*)*
= (b*bb* + b*abb* + b*aabb*)*
= (b + ab + aab)*
That last step is a little bit of a jump, but it takes noticing that all those b*'s are redundant due to the * on the whole expression, and a b existing inside the brackets.
I think this regex matches your language as well:
^((ab|b)*aa(ba|b)*)*$

Regular Expression: Mathematically vs. Programmatically

Consider the following regular expressions:
7+
(7)+
Does anyone that is very familiar with regular expression theory in Mathematics agree that the two regular expressions are semantically the same?
Programmatically (as in evaluated by the regular expression engine of a language) it only differs in the capturing groups resulting.
Other than that, they are the same. It is as writing ((7) + (1)) as opposed as 7 + 1. They evaluate to are the same. (Yeah, mathematically speaking, regular languages doesn't evaluate to anything)
Yes, those two regular expressions are the same because they both recognize the same language. The fact that they are not written identically is just a notational issue.
Do they describe the same language? Yes. Do they mean the same thing to someone trying to interpret the language? No. The second one tells me that I should be more interested in the 7s.
The second reduces to first. Do you agree that
ab+
and
a(b)+
and
(ab)+
are semantically different?
The only difference is the parens assign the enclosed pattern to a group so you can reference that little piece after it's been evaluated.

Why can't Regular Expressions use keywords instead of characters?

Okay, I barely understand RegEx basics, but why couldn't they design it to use keywords (like SQL) instead of some cryptic wildcard characters and symbols?
Is it for performance since the RegEx is interpreted/parsed at runtime? (not compiled)
Or maybe for speed of writing? Considering that when you learn some "simple" character combinations it becomes easier to type 1 character instead of a keyword?
You really want this?
Pattern findGamesPattern = Pattern.With.Literal(#"<div")
.WhiteSpace.Repeat.ZeroOrMore
.Literal(#"class=""game""").WhiteSpace.Repeat.ZeroOrMore.Literal(#"id=""")
.NamedGroup("gameId", Pattern.With.Digit.Repeat.OneOrMore)
.Literal(#"-game""")
.NamedGroup("content", Pattern.With.Anything.Repeat.Lazy.ZeroOrMore)
.Literal(#"<!--gameStatus")
.WhiteSpace.Repeat.ZeroOrMore.Literal("=").WhiteSpace.Repeat.ZeroOrMore
.NamedGroup("gameState", Pattern.With.Digit.Repeat.OneOrMore)
.Literal("-->");
Ok, but it's your funeral, man.
Download the library that does this here:
http://flimflan.com/blog/ReadableRegularExpressions.aspx
Regular expressions have a mathematical (actually, language theory) background and are coded somewhat like a mathematical formula. You can define them by a set of rules, for example
every character is a regular expression, representing itself
if a and b are regular expressions, then a?, a|b and ab are regular expressions, too
...
Using a keyword-based language would be a great burden for simple regular expressions. Most of the time, you will just use a simple text string as search pattern:
grep -R 'main' *.c
Or maybe very simple patterns:
grep -c ':-[)(]' seidl.txt
Once you get used to regular expressions, this syntax is very clear and precise. In more complicated situations you will probably use something else since a large regular expression is obviously hard to read.
Perl 6 is taking a pretty revolutionary step forward in regex readability. Consider an address of the form:
100 E Main St Springfield MA 01234
Here's a moderately-readable Perl 5 compatible regex to parse that (many corner cases not handled):
m/
([1-9]\d*)\s+
((?:N|S|E|W)\s+)?
(\w+(?:\s+\w+)*)\s+
(ave|ln|st|rd)\s+
([:alpha:]+(?:\s+[:alpha:]+)*)\s+
([A-Z]{2})\s+
(\d{5}(?:-\d{4})?)
/ix;
This Perl 6 regex has the same behavior:
grammar USMailAddress {
rule TOP { <addr> <city> <state> <zip> }
rule addr { <[1..9]>\d* <direction>?
<streetname> <streettype> }
token direction { N | S | E | W }
token streetname { \w+ [ \s+ \w+ ]* }
token streettype {:i ave | ln | rd | st }
token city { <alpha> [ \s+ <alpha> ]* }
token state { <[A..Z]>**{2} }
token zip { \d**{5} [ - \d**{4} ]? }
}
A Perl 6 grammar is a class, and the tokens are all invokable methods. Use it like this:
if $addr ~~ m/^<USMailAddress::TOP>$/ {
say "$<city>, $<state>";
}
This example comes from a talk I presented at the Frozen Perl 2009 workshop. The Rakudo implementation of Perl 6 is complete enough that this example works today.
Well, if you had keywords, how would you easily differentiate them from actually matched text? How would you handle whitespace?
Source text
Company: A Dept.: B
Standard regex:
Company:\s+(.+)\s+Dept.:\s+(.+)
Or even:
Company: (.+) Dept. (.+)
Keyword regex (trying really hard not get a strawman...)
"Company:" whitespace.oneplus group(any.oneplus) whitespace.oneplus "Dept.:" whitespace.oneplus group(any.oneplus)
Or simplified:
"Company:" space group(any.oneplus) space "Dept.:" space group(any.oneplus)
No, it's probably not better.
Because it corresponds to formal language theory and it's mathematic notation.
It's Perl's fault...!
Actually, more specifically, Regular Expressions come from early Unix development, and concise syntax was a lot more highly valued then. Storage, processing time, physical terminals, etc were all very limited, rather unlike today.
The history of Regular Expressions on Wikipedia explains more.
There are alternatives to Regex, but I'm not sure any have really caught on.
EDIT: Corrected by John Saunders: Regular Expressions were popularised by Unix, but first implemented by the QED editor. The same design constraints applied, even more so, to earlier systems.
Actually, no, the world did not begin with Unix. If you read the Wikipedia article, you'll see that
In the 1950s, mathematician Stephen Cole Kleene described these models using his mathematical notation called regular sets. The SNOBOL language was an early implementation of pattern matching, but not identical to regular expressions. Ken Thompson built Kleene's notation into the editor QED as a means to match patterns in text files. He later added this capability to the Unix editor ed, which eventually led to the popular search tool grep's use of regular expressions
This is much earlier than PERL. The Wikipedia entry on Regular Expressions attributes the first implementations of regular expressions to Ken Thompson of UNIX fame, who implemented them in the QED and then the ed editor. I guess that the commands had short names for performance reasons, but much before being client-side. Mastering Regular Expressions is a great book about regular expressions, which offers the option to annotate a regular expression (with the /x flag) to make it easier to read and understand.
Because the idea of regular expressions--like many things that originate from UNIX--is that they are terse, favouring brevity over readability. This is actually a good thing. I've ended up writing regular expressions (against my better judgement) that are 15 lines long. If that had a verbose syntax it wouldn't be a regex, it'd be a program.
It's actually pretty easy to implement a "wordier" form of regex -- please see my answer here. In a nutshell: write a handful of functions that return regex strings (and take parameters if necessary).
I don't think keywords would give any benefit. Regular expressions as such are complex but also very powerful.
What I think is more confusing is that every supporting library invents its own syntax instead of using (or extending) the classic Perl regex (e.g. \1, $1, {1}, ... for replacements and many more examples).
I know its answering your question the wrong way around, but RegExBuddy has a feature that explains your regexpression in plain english. This might make it a bit easier to learn.
If the language you are using supports Posix regexes, you can use them.
An example:
\d
would be the same as
[:digit:]
The bracket notation is much clearer on what it is matching. I would still learn the "cryptic wildcard characters and symbols, since you will still see them in other people's code and need to understand them.
There are more examples in the table on regular-expressions.info's page.
For some reason, my previous answer got deleted. Anyway, i thing ruby regexp machine would fit the bill, at http://www.rubyregexp.sf.net. It is my own project, but i think it should work.