Regular Expression formal documention - regex

Is there any formal documentation of how to implement your own regular expression library? What formal documentation, if any, did the makers of the exisiting regular expression libriaries base their code on?

I have written (and abandoned) a javascript parser, including regular expression support. I based it on the ECMAscript definition, which according to itself uses Perl5 regular expressions. This is ECMA 262, I used the 3rd edition, from december 1999. (There is a newer one by now, I don't know if it is as complete in its definition of regular expressions.)

Any good textbook on automata theory and/or compiler construction, e.g. Hopcroft and Ullman, covers regular expressions and their relation to finite-state automata, to which they can be compiled. So do several textbooks on natural language processing, where finite-state methods are commonly used, e.g. Jurafsky and Martin.
(There was even a course by Ullman himself on Coursera, but a new session is yet to be announced.)
As for the question what documentation current RE libraries are based on: on textbooks like the one I cited and existing implementations. The first RE implementation that I'm aware of is the one in Ken Thompson's version of QED, ca. 1967. Unfortunately, the tech report on the QED website cites very few references and none related to RE/FA theory. I'm sure the ideas ultimately trace back to Kleene's theory of regular languages, which was developed in the 1950s.

Regular expressions are called regular because that's a property of the state machine they're a representation of. Simply put, a possible implementation might use state machines which are just tables. The regex parser would create a number of states and transitions for a regex, executing it goes through the states according to the transitions.
e.g. /ab+/ generates something like:
state \ next char: a b $ *
[initial state] goto 1 fail fail fail
1 fail goto 2 fail fail
2 fail goto 2 match fail
(where $ is the end of the string, * is any other character)

I have been searching for regular expression, and have found an intresting and as far as I see realting question about them Question: Why can regular expressions have an exponential running time?
The accepted answer suggests based on the linked articles that RegExp implemetnations (also used in Perl) are a "bit" slower, and there is a faster/simpler algorithm for them, used by many good old Unix tools like grep.
This link directly leads to the mentioned article: Regular Expression Matching Can Be Simple And Fast, (part2, part3)
If it's actual, you should take it into the cosideration using this algorithm rather than Perl's.

Related

Special construct of expression use

With the construct of using If-Then-Else Conditionals in regular expressions, I would like to know the possible outcome of trying to manipulate many constructs into a single expression for multiple matches.
Let's take this example below.
foo(bar)?(?(1)baz|quz)
Now being combined with an expression, which matches the previous conditions and then we add on to the previous with the following conditions..
foo(bar)?(?(1)baz|quz)|(?(?=.*baz)bar|foo)
Mainly I am asking should you construct a regular expression in this way, and what would ever be the purpose that you would need to use it in this way?
should you construct a regular expression in this way, and what would
ever be the purpose that you would need to use it in this way?
In this case, and probably most cases, I would say "no".
I often find that conditionals can be rewritten as lookarounds or simplified within alternations.
For instance, it seems to me that the regex you supplied,
foo(bar)?(?(1)baz|quz)|(?(?=.*baz)bar|foo)
could be replaced for greater clarity by
bar(?=.*baz)|foo(?:quz|barbaz)?
which gives us two simple match paths.
But it's been six months since you posted the question. During this time, answering a great many questions on SO, have you ever felt the need for that kind of construction?
I believe the answer to this would be ultimately be specific to the regex library being used or a language's implementation. ("Comparison of regular expression engines", Wikipedia.)
There isn't an official RFC or specification for regular expressions. And the diversity of implementations leads to frustration doing even "simple" expressions--the nuances you're considering are probably implementation-specific instead of regex-specific.
And even beyond your specific question, I think most of the regex-related questions on StackOverflow would be improved if people were more specific about the language being used or the library employed by the application they're using. When troubleshooting my own regular expressions in various applications (text editors for example) it took me awhile before I realized the need to understand the specific library being used.

How the Perl regular expressions dialect/implementation is called?

The engine for parsing strings which is called "regular expressions" in Perl is very different from what is known by the term "regular expressions" in books.
So, my question is: is there some document describing the Perl's regexp implementation and how and in what ways does it really differ from the classic one (by classic I mean a regular expressions that can really be transformed to ordinary DFA/NFA) and how it works?
Thank you.
Perl regular expressions are of course called Perl regular expressions, or regexes for short. They may also be called patterns or rules. But what they are, or at least can be, is recursive descent parsers. They’re implemented using a recursive backtracker, although you can swap in a DFA engine if you prefer to offload DFA‐solvable tasks to it.
Here are some relevant citations about these matters, with all emboldening — and some of the text :) — mine:
You specify a pattern by creating a regular expression (or regex),
and Perl’s regular expression engine (the “Engine”, for the rest of this
chapter) then takes that expression and determines whether (and how) the
pattern matches your data. While most of your data will probably be
text strings, there’s nothing stopping you from using regexes to search
and replace any byte sequence, even what you’d normally think of as
“binary” data. To Perl, bytes are just characters that happen to have
an ordinal value less than 256.
If you’re acquainted with regular expressions from some other venue, we
should warn you that regular expressions are a bit different in Perl.
First, they aren’t entirely “regular” in the theoretical sense of the
word, which means they can do much more than the traditional regular
expressions taught in computer science classes. Second, they are used
so often in Perl that they have their own special variables, operators,
and quoting conventions which are tightly integrated into the language,
not just loosely bolted on like any other library.
      — Programming Perl, by Larry Wall, Tom Christiansen, and Jon Orwant
This is the Apocalypse on Pattern Matching, generally having to do with
what we call “regular expressions”, which are only marginally related to
real regular expressions. Nevertheless, the term has grown with the
capabilities of our pattern matching engines, so I’m not going to try to
fight linguistic necessity here. I will, however, generally call them
“regexes” (or “regexen”, when I’m in an Anglo‐Saxon mood).
      — Perl6 Apocalypse 5: Pattern Matching, by Larry Wall
There’s a lot of new syntax there, so let’s step through it slowly, starting with:
$file = rx/ ^ <$hunk>* $ /;
This statement creates a pattern object. Or, as it’s known in Perl 6, a
“rule”. People will probably still call them “regular expressions” or
“regexes” too (and the keyword rx reflects that), but Perl patterns long
ago ceased being anything like “regular”, so we’ll try and avoid those
terms.
[Update: We’ve resurrected the term “regex” to refer to these patterns in
general. When we say “rule” now, we’re specifically referring to the kind
of regex that you would use in a grammar. See S05.]
      — Perl6 Exegesis 5: Pattern Matching, by Damian Conway
This document summarizes Apocalypse 5, which is about the new regex syntax.
We now try to call them regex rather than “regular expressions” because
they haven’t been regular expressions for a long time, and we think the
popular term “regex” is in the process of becoming a technical term with a
precise meaning of: “something you do pattern matching with, kinda like a regular
expression”. On the other hand, one of the purposes of the redesign
is to make portions of our patterns more amenable to analysis under
traditional regular expression and parser semantics, and that involves
making careful distinctions between which parts of our patterns and
grammars are to be treated as declarative, and which parts as procedural.
In any case, when referring to recursive patterns within a grammar, the
terms rule and token are generally preferred over regex.
      — Perl6 Synopsis 5: Regexes and Rules,
by Damian Conway, Allison Randal, Patrick Michaud, Larry Wall, and Moritz Lenz
The O'Reilly book 'Mastering Regular Expressions' has a very good explanation of Perl's and other engines. For me this is the reference book on the topic.
There is no formal mathematical name for the language accepted by PCREs.
The term "regular expressions with backtracking" or "regular expressions with backreferences" is about as close as you will get. Anybody familiar with the difference will know what you mean.
(There are only two common types of regexp implementations: DFA-based, and backtracking-based. The former generally accept the "regular languages" in the traditional Computer Science sense. The latter generally accept... More, and it depends on the specific implementation, but backreferences are always one the non-DFA features.)
I asked the same question on the theoretical CS Stack Exchange (Regular expressions aren't), and the answer that got the most upvotes was “regex.”
The dialect is called PCRE (Perl-compatible Regular Expressions).
It's documented in the Perl manual.
Or in "Programming Perl" by Wall, Orwant and Christiansen

How are regular expressions processed?

How are regular expressions processed?
A regular expression describes a ruleset for a state machine. It generally moves over a string one character at a time, making decisions based on what happened on previous characters and what is described in the regex.
Any regex can also be written as a loop over a string one character at a time. Some of these could be fairly simple, but the power of regex is found when what appears to be a simple regex, with a few lookbehinds and subgroups would take a thousand lines of code to reproduce in your own state machine.
Regular expressions can be modeled as a Deterministic Finite State Machine. That would probably be a good place to start if you wanted to "process" one.
This question is very broad. This is not a complete answer but Jeff Moser has an excellent write-up on his blog that walks through .NET's regex process: How .NET Regular Expressions Really Work
I suspect other answers will shed light on other areas of regular expressions unless your question is updated to be more specific.
This will depend on which regex implementation you're referring to.
There are 2 common but different techniques used in regex engines:
Nondeterministic Finite State Machine
Deterministic Finite State Machine
This MSDN article explains several techniques implemented in various engines and then goes onto explain .NET's implementation and why Microsoft chose what they chose for .NET.
They go even more in-depth in the various articles you see listed here.
http://www.moserware.com/2009/03/how-net-regular-expressions-really-work.html
Despite what everyone here says about state machines, you can write a remarkably simple regex recogniser using recursive techiques with very little state. There are examples of these in two of Brian Kernighan's books Software Tools In Pascal and The Practice Of Programming.

library for converting regular expressions to NFAs?

Is there a good library for converting Regular Expressions into NFAs? I see lots of academic papers on the subject, which are helpful, but not much in the way of working code.
My question is due partially to curiosity, and partially to an actual need to speed up regular expression matching on a production system I'm working on. Although it might be fun to explore this subject for learning's sake, I'm not sure it's a "practical" solution to speeding up our pattern matching. We're a Java shop, but would happily take pointers to good code in any language.
Edit:
Interesting, I did not know that Java's regexps were already NFAs. The title of this paper lead me to believe otherwise. Incidentally, we are currently doing our regexp matching in Postgres; if the simple solution is to move the matching into the Java code that would be great.
Addressing your need to speed up your regexes:
Java's implementation of its regex engine is NFA based. As such, to tune your regexes, I would say that you would benefit from a deeper understanding of how the engine is implemented.
And as such I direct you to: Mastering Regular Expressions The book gives substantial treatment to the NFA engine and how it performs matches, including how to tune your regex specific to the NFA engine.
Additionally, look into Atomic Grouping for tuning your regex.
Disclaimer: I'm not an expert on java+regexes. But, if I understand correctly...
If Java's regular expression matcher is similar to most others, it does use NFA's - but not the way you might expect. Instead of the forward-only implementation you may have heard about, it's using a backtracking solution which simplifies subexpression matching, and is probably required for Backreference usage. However, it performs alternation poorly.
You want to see: http://swtch.com/~rsc/regexp/regexp1.html (concerning edge cases which perform poorly on this altered architecture).
I've also written a question which I suppose comes down to the same thing:
Regex implementation that can handle machine-generated regex's: *non-backtracking*, O(n)?
But basically, it looks like for some very odd reason all common major-vendor regex implementaions have terrible performance when used on certain regexes, even though this is unnecessary.
Disclaimer: I'm a googler, not an expert on regexes.
There is a bunch of faster-than-JDK regex libraries one of which is dk.brics.automaton. According to the benchmark linked in the article, it is approximately x20 faster than the JDK implementation.
This library was written by Anders Møller and had also been mavenized.

Is it possible for a computer to "learn" a regular expression by user-provided examples?

Is it possible for a computer to "learn" a regular expression by user-provided examples?
To clarify:
I do not want to learn regular expressions.
I want to create a program which "learns" a regular expression from examples which are interactively provided by a user, perhaps by selecting parts from a text or selecting begin or end markers.
Is it possible? Are there algorithms, keywords, etc. which I can Google for?
EDIT: Thank you for the answers, but I'm not interested in tools which provide this feature. I'm looking for theoretical information, like papers, tutorials, source code, names of algorithms, so I can create something for myself.
Yes,
it is possible,
we can generate regexes from examples (text -> desired extractions).
This is a working online tool which does the job: http://regex.inginf.units.it/
Regex Generator++ online tool generates a regex from provided examples using a GP search algorithm.
The GP algorithm is driven by a multiobjective fitness which leads to higher performance and simpler solution structure (Occam's Razor).
This tool is a demostrative application by the Machine Lerning Lab, Trieste Univeristy (Università degli studi di Trieste).
Please look at the video tutorial here.
This is a research project so you can read about used algorithms here.
Behold! :-)
Finding a meaningful regex/solution from examples is possible if and only if the provided examples describe the problem well.
Consider these examples that describe an extraction task, we are looking for particular item codes; the examples are text/extraction pairs:
"The product code is 467-345A" -> "467-345A"
"The item 789-345B is broken" -> "789-345B"
An (human) guy, looking at the examples, may say: "the item codes are things like \d++-345[AB]"
When the item code is more permissive but we have not provided other examples, we have not proofs to understand the problem well.
When applying the human generated solution \d++-345[AB] to the following text, it fails:
"On the back of the item there is a code: 966-347Z"
You have to provide other examples, in order to better describe what is a match and what is not a desired match:
--i.e:
"My phone is +39-128-3905 , and the phone product id is 966-347Z" -> "966-347Z"
The phone number is not a product id, this may be an important proof.
The book An Introduction to Computational Learning Theory contains an algorithm for learning a finite automaton. As every regular language is equivalent to a finite automaton, it is possible to learn some regular expressions by a program. Kearns and Valiant show some cases where it is not possible to learn a finite automaton. A related problem is learning hidden Markov Models, which are probabilistic automata that can describe a character sequence. Note that most modern "regular expressions" used in programming languages are actually stronger than regular languages, and therefore sometimes harder to learn.
No computer program will ever be able to generate a meaningful regular expression based solely on a list of valid matches. Let me show you why.
Suppose you provide the examples 111111 and 999999, should the computer generate:
A regex matching exactly those two examples: (111111|999999)
A regex matching 6 identical digits (\d)\1{5}
A regex matching 6 ones and nines [19]{6}
A regex matching any 6 digits \d{6}
Any of the above three, with word boundaries, e.g. \b\d{6}\b
Any of the first three, not preceded or followed by a digit, e.g.
(?<!\d)\d{6}(?!\d)
As you can see, there are many ways in which examples can be generalized into a regular expression. The only way for the computer to build a predictable regular expression is to require you to list all possible matches. Then it could generate a search pattern that matches exactly those matches.
If you don't want to list all possible matches, you need a higher-level description. That's exactly what regular expressions are designed to provide. Instead of providing a long list of 6-digit numbers, you simply tell the program to match "any six digits". In regular expression syntax, this becomes \d{6}.
Any method of providing a higher-level description that is as flexible as regular expressions will also be as complex as regular expressions. All tools like RegexBuddy can do is to make it easier to create and test the high-level description. Instead of using the terse regular expression syntax directly, RegexBuddy enables you to use plain English building blocks. But it can't create the high-level description for you, since it can't magically know when it should generalize your examples and when it should not.
It is certainly possible to create a tool that uses sample text along with guidelines provided by the user to generate a regular expression. The hard part in designing such a tool is how does it ask the user for the guiding information that it needs, without making the tool harder to learn than regular expressions themselves, and without restricting the tool to common regex jobs or to simple regular expressions.
Yes, it's certainly "possible"; Here's the pseudo-code:
string MakeRegexFromExamples(<listOfPosExamples>, <listOfNegExamples>)
{
if HasIntersection(<listOfPosExamples>, <listOfNegExamples>)
return <IntersectionError>
string regex = "";
foreach(string example in <listOfPosExamples>)
{
if(regex != "")
{
regex += "|";
}
regex += DoRegexEscaping(example);
}
regex = "^(" + regex + ")$";
// Ignore <listOfNegExamples>; they're excluded by definition
return regex;
}
The problem is that there are an infinite number of regexs that will match a list of examples. This code provides the simplest/stupidest regex in the set, basically matching anything in the list of positive examples (and nothing else, including any of the negative examples).
I suppose the real challenge would be to find the shortest regex that matches all of the examples, but even then, the user would have to provide very good inputs to make sure the resulting expression was "the right one".
I believe the term is "induction". You want to induce a regular grammar.
I don't think it is possible with a finite set of examples (positive or negative). But, if I recall correctly, it can be done if there is an Oracle which can be consulted. (Basically you'd have to let the program ask the user yes/no questions until it was content.)
You might want to play with this site a bit, it's quite cool and sounds like it does something similar to what you're talking about: http://txt2re.com
There's a language dedicated to problems like this, based on prolog. It's called progol.
As others have mentioned, the basic idea is inductive learning, often called ILP (inductive logic programming) in AI circles.
Second link is the wiki article on ILP, which contains a lot of useful source material if you're interested in learning more about the topic.
#Yuval is correct. You're looking at computational learning theory, or "inductive inference. "
The question is more complicated than you think, as the definition of "learn" is non-trivial. One common definition is that the learner can spit out answers whenever it wants, but eventually, it must either stop spitting out answers, or always spit out the same answer. This assumes an infinite number of inputs, and gives absolutely no garauntee on when the program will reach its decision. Also, you can't tell when it HAS reached its decision because it might still output something different later.
By this definition, I'm pretty sure that regular languages are learnable. By other definitions, not so much...
I've done some research on Google and CiteSeer and found these techniques/papers:
Language identification in the limit
Probably approximately correct learning
Also Dana Angluin's "Learning regular sets from queries and counterexamples" seems promising, but I wasn't able to find a PS or PDF version, only cites and seminar papers.
It seems that this is a tricky problem even on the theoretical level.
If its possible for a person to learn a regular expression, then it is fundamentally possible for a program. However, that program will need to be correctly programmed to be able to learn. Luckily this is a fairly finite space of logic, so it wouldn't be as complex as teaching a program to be able to see objects or something like that.