Can you use regular expressions to implement the shunting algorithm? - regex

Can you implement the shunting yard algorithm in terms of regular expressions?

I do not think so. Regular expressions can only match regular languages (See Regular language), while infix expressions are a kind of context-free language (See Context-free language). For example, you cannot match expressions made of properly matched parentheses with a regular expression.

I believe this has been answered here: Can the shunting yard algorithm parse POSIX regular expressions?
I will say that the answer to your question is "no, you cannot
implement the shunting yard algorithm using a regular expression."
This is for the same reason you cannot parse arbitrary HTML using
regular expressions. Which boils down to this:
Regular expressions do not have a stack. Because the shunting yard
algorithm relies on a stack (to push and pop operands as you convert
from infix to RPN), then regular expressions do not have the
computational "power" to perform this task.
This glosses over many details, but a "regular expression" is one way
to define a regular language. When you "use" a regular expression, you
are asking the computer to say: "Look at a body of text and tell me
whether or not any of those strings are in my language. The language
that I defined using a regular expression." I'll point to this most
excellent answer which you and everyone reading this should upvote
for more on regular languages.
So now you need some mathematical concept to augment "regular
languages" in order to create more powerful languages. If you were to
characterize the shunting yard algorithm as an realization of a model
of computational power, then you might say that the algorithm would be
described as a context-free grammar (hey what do you know, that
link uses an expression parse tree as an example.) A push-down
automata. Something with a stack.
If you are less-than-familiar with automata theory and complexity
classes, then those wikipedia articles are probably not that helpful
without explaining them from the ground up.
The point being, you may be able to use regex to help writing shunting
yard. But regex are not very good at doing operations that have an
arbitrary depth, which this problem has. So I would not spend too much
time going down the regex avenue for this problem.

Related

How are DFA and NFA related to regular expressions?

As the title suggests, how are DFA and NFA related to regular expressions? Would learning DFA and NFA be useful in having a better understanding in regular expressions?
Finite automata (fa), regular expression(re), and also regular grammars, all are finite representation for regular languages. The purpose all of them is to express a regular set/language (and same is true for other class of languages like cfg, csl etc).
Automata comparatively more useful for theoretical purpose, to analysis language properties — class of complexity.
In case of finite automata, both deterministic (DFA) and non-deterministic (NFA) models represent same class of language, called "regular language" (that is not true for other languages for npda ≭ pda).
Regular expression (re): is another way to represent regular languages in alphabetical form, which is very much helpful to represent a set of valid strings in programming languages (here automata can't be useful directly whereas regular expression is not much helpful to analysis language properties eg. to fully describe pumping lemma).
How are DFA and NFA related to regular expressions?
Both represent same class of languages — regular languages
Its not possible to construct automata or regular expression algorithmically from English description of language directly. Although, if we have any one representation (FA or RE) then we can systematically write other representation eg.we can write regular expression for a DFA/NFA in step by step and systematic manner, using Arden's theorem. (check this link)
Lets take a language example: L = "Even number of a's and b's".
regular expression for L is:
(
(a + b(aa)*ab)(bb)*(ba(aa)*ab(bb)*)*a +
(b + a(bb)*ba)(aa)*(ab(bb)*ba(aa)*)*b
)*
Its very tough to write regular expression for this language directly(even its bit typical to understand this re quickly).
But from DFA and using Arden't theorm it was simple to write regular expression for language L.
Important is that drawing DFA for this language is comparatively simple (also easy to memorize).
One more example can be language over "symbols 0 and 1, where decimal equivalent of binary string is divisible by 5", writing RE for this will be very hard compare to writing DFA.
We can also draw DFA from a regular language algorithimally.
Would learning DFA and NFA be useful in having a better understanding in regular expressions?
Yes, because of following reasons:
Sometime it is hard to write RE directly.
A regular expression that is written directly from English description can be buggy. Chances of buggy dfa would be less than buggy regex that is why when we writing compiler for some language then preferable/correct steps are considered to draw DFA first from each token, then write their equvilent regex — DFA will be consider proof of correctness - dfa are more descriptive and easy to grasp language construct (DFA is correct then RE will be correct).
If re is complex and you are to find "what is the language description?", then you can draw DFA from re and then give language description.
Sometime to find better re, you can draw DFA then translate it to minimize DFA then write re using minimized DFA may give you better solution. (Its not general technique, may be helpful sometime)
If its hard to compare two regular expressions then you can compare their corresponding DFAs to check for equivalence.
Note: Sometime writing regular expression is much simpler then drawing DFAs.
A non-deterministic finite automaton (NFA) is a machine that can recognize a regular language.
A regular expression is a string that describes a regular language.
It is possible to algorithmically build an NFA that will recognize the language described by a given regular expression. Running the NFA on an input string will tell you if the regular expression matches the input string or not.
So NFAs can be used to implement regular expression engines, but knowledge of them is not required to use regular expressions to their full potential.

Simplify Regular Expression in Mathematica

I recently found out about Kleene algebra for manipulating and simplifying regular expressions.
I'm wondering if this has been build into any computational software programs like Mathematica? It would be great to have a computational tool for doing unions and concatenations of large expressions and have the computer simplify them.
If you are not aware of any programs with this algebra built in, do you know any programs that allow extending their engines with new algebras?
On http://www.maplesoft.com/msw/program/MSW04FinalProgram.pdf, it states:
One of the basic results of the theory of finite automata is the
famous Kleene theorem, which states that a language is acceptable by a
finite automaton if and only if it can be represented by a regular
expression.
and
The main difficulty of the algorithmic treatment of regular
expressions is, however, their simplification. Although several
identities are known concerning regular expressions, e.g., the rules
of Kleene algebra, there does not exist an effective algorithm for
solving the simplification problem of regular expressions.
and
Under the circumstances, the only way left is to develop heuristic
algorithms for simplifying regular expressions. For the aut package,
this paper outlines the Maple procedures Rsimplify, Rabsorb and
Rexpand.
Im wondering if open-source implementations of Kleene Algebra algorithms exist.

Mutually exclusive regular expressions

If I have a list of regular expressions, is there an easy way to determine that no two of them will both return a match for the same string?
That is, the list is valid if and only if for all strings a maximum of one item in the list will match the entire string.
It seems like this will be very hard (maybe impossible?) to prove definitively, but I can't seem to find any work on the subject.
The reason I ask is that I am working on a tokenizer that accepts regexes, and I would like to ensure only one token at a time can match the head of the input.
If you're working with pure regular expressions (no backreferences or other features that cause them to recognize context-free or more complicated languages), what you ask is possible.
What you can do is convert each regex to a DFA, then (since regular languages are closed under intersection) combine them into a DFA that recognizes
the intersection of the two languages. If that DFA has a path from the start state to an accepting state, that string is accepted by both input regexen.
The problem with this is that the first step of the usual regex->DFA algorithm is to
convert the regex to a NFA, then convert the NFA to a DFA. But that last step can
result in an exponential blowup in the number of DFA states, so this will only be
feasible for very simple regular expressions.
If you are working with extended regex syntax, all bets are off: context-free languages
are not closed under intersection, so this method won't work.
The Wkipedia article on regular expressions does state
It is possible to write an algorithm which for two given regular expressions decides whether the described languages are essentially equal, reduces each expression to a minimal deterministic finite state machine, and determines whether they are isomorphic (equivalent).
but gives no further hints.
Of course the easy way you are after is to run a lot of tests -- but we all know the shortcomings of testing as a method of proof.
You can't do that by only looking at the regular expression.
Consider the case where you have [0-9] and [0-9]+. They are obviously different expressions, but when applied to the string "1", they both produce the same result. When applied to string "11" they produce different results.
The point is that a regular expression isn't enough information. The result depends both on the regex and the target string.

Formal language expressiveness of Perl patterns

Classical regular expressions are equivalent to finite automata. Most current implementations of "regular expressions" are not strictly speaking regular expressions but are more powerful. Some people have started using the term "pattern" rather than "regular expression" to be more accurate.
What is the formal language classification of what can be described with a modern "regular expression" such as the patterns supported in Perl 5?
Update: By "Perl 5" I mean that pattern matching functionality implemented in Perl 5 and adopted by numerous other languages (C#, JavaScript, etc) and not anything specific to Perl. I don't want to consider, for example, tricks for embedding Perl code in a pattern.
Perl regexps, as those of any pattern language, where "backreferences" are allowed, are not actually "regular".
Backreferences is the mechanism of matching the same string that was matched by a sub-pattern before. For example, /^(a*)\1$/ matches only strings with even number of as, because after some as there should follow the same number of those.
It's easy to prove, that, for instance, pattern /^((a|b)*)\1$/ matches words from a non-regular language(*), so it's more expressive that ant finite automaton. Regular expressions can't "remember" a string of arbitrary length and then match it again (the length may be very long, while finite-state machine only can simulate finite amount of "memory").
A formal proof would use the pumping lemma. (By the way, this language can't be described by context-free grammar as well.)
Let alone the tricks that allow to use perl code in perl regexps (non-regular language of balanced parentheses there).
(*) "Regular languages" are sets of words that are matched by finite automata. I already wrote an answer about that.
There was a recent discussion on this topic a Perlmonks: Turing completeness and regular expressions
I've always heard perl's regex implementation described as an NFA with backtracking. Wikipedia seems to have a little section on this:
This is possibly slightly too fuzzy but it's informative non the less:
From Wikipedia:
There are at least three different
algorithms that decide if and how a
given regular expression matches a
string.
The oldest and fastest two rely on a
result in formal language theory that
allows every nondeterministic finite
state machine (NFA) to be transformed
into a deterministic finite state
machine (DFA). The DFA can be
constructed explicitly and then run on
the resulting input string one symbol
at a time. Constructing the DFA for a
regular expression of size m has the
time and memory cost of O(2m), but it
can be run on a string of size n in
time O(n). An alternative approach is
to simulate the NFA directly,
essentially building each DFA state on
demand and then discarding it at the
next step, possibly with caching. This
keeps the DFA implicit and avoids the
exponential construction cost, but
running cost rises to O(nm). The
explicit approach is called the DFA
algorithm and the implicit approach
the NFA algorithm. As both can be seen
as different ways of executing the
same DFA, they are also often called
the DFA algorithm without making a
distinction. These algorithms are
fast, but using them for recalling
grouped subexpressions, lazy
quantification, and similar features
is tricky.[12][13]
The third algorithm is to match the
pattern against the input string by
backtracking. This algorithm is
commonly called NFA, but this
terminology can be confusing. Its
running time can be exponential, which
simple implementations exhibit when
matching against expressions like
(a|aa)*b that contain both alternation
and unbounded quantification and force
the algorithm to consider an
exponentially increasing number of
sub-cases. More complex
implementations will often identify
and speed up or abort common cases
where they would otherwise run slowly.
Although backtracking implementations
only give an exponential guarantee in
the worst case, they provide much
greater flexibility and expressive
power. For example, any implementation
which allows the use of
backreferences, or implements the
various extensions introduced by Perl,
must use a backtracking
implementation.
Some implementations try to provide
the best of both algorithms by first
running a fast DFA match to see if the
string matches the regular expression
at all, and only in that case perform
a potentially slower backtracking
match.

proofs about regular expressions

Does anyone know any examples of the following?
Proof developments about regular expressions (possibly extended with backreferences) in proof assistants (such as Coq).
Programs in dependently-typed languages (such as Agda) about regular expressions.
Certified Programming with Dependent Types has a section on creating a verified regular expression matcher. Coq Contribs has an automata contribution that might be useful. Jan-Oliver Kaiser formalized the equivalence between regular expressions, finite automata and the Myhill-Nerode characterization in Coq for his bachelors thesis.
Moreira, Pereira & de Sousa, On the Mechanisation of Kleene Algebra in Coq gives a nice verified construction of the Antimirov derivative of regexps in Coq. It's pretty easy to read off a CFA from this construction, and to compute the intersection of regexps.
I'm not sure why you separate Coq from dependently typed programming: Coq essentially is programming in a polymorphic dependently typed lambda calculus with inductive types (i.e., CIC, the calculus of inductive constructions).
I've never heard of a formalisation of regexps in a dependently typed language, nor have I heard of something such as an Antimirov-like derivative for regexps with backtracking, but Becchi & Crowley, Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions provide a notion of finite-state automata that matches a Perl-like regexp languages. That might attractive to formalisers in the near future.
See Perl Regular Expression Matching is NP-Hard
Regex matching is NP-hard when regexes are allowed to have backreferences.
Reduction of 3-CNF-SAT to Perl Regular Expression Matching
[...] 3-CNF-SAT is NP-complete. If there
were an efficient (polynomial-time)
algorithm for computing whether or not
a regex matched a certain string, we
could use it to quickly compute
solutions to the 3-CNF-SAT problem,
and, by extension, to the knapsack
problem, the travelling salesman
problem, etc. etc.
I don't know of any development that treats regular expressions by themselves.
Finite automata, however, relevant since NFAs are the standard way to match those regular expressions, have been studied in NuPRL. Have a look at : Robert L. Constable, Paul B. Jackson, Pavel Naumov, Juan Uribe. Constructively Formalizing Automata Theory.
Should you be interested in approaching those formal languages through algebra, esp. developing finite semigroup theory, there are a number of algebra libraries developed in various theorem provers that you could think of using, with one particularly efficient in a finite setting.
The proof assistant Isabelle/HOL ships a number of formalized proofs regarding regular expressions (without back reference):
http://afp.sourceforge.net/browser_info/devel/HOL/Regular-Sets/
(here is a paper by the authors regarding what they did exactly).
Another approach is to characterize regular expressions via Myhill-Nerode Theorem:
http://www.dcs.kcl.ac.uk/staff/urbanc/Publications/itp-11.pdf