Does lookaround affect which languages can be matched by regular expressions? - regex

There are some features in modern regex engines which allow you to match languages that couldn't be matched without that feature. For example the following regex using back references matches the language of all strings that consist of a word that repeats itself: (.+)\1. This language is not regular and can't be matched by a regex that does not use back references.
Does lookaround also affect which languages can be matched by a regular expression? I.e. are there any languages that can be matched using lookaround that couldn't be matched otherwise? If so, is this true for all flavors of lookaround (negative or positive lookahead or lookbehind) or just for some of them?

The answer to the question you ask, which is whether a larger class of languages than the regular languages can be recognised with regular expressions augmented by lookaround, is no.
A proof is relatively straightforward, but an algorithm to translate a regular expression containing lookarounds into one without is messy.
First: note that you can always negate a regular expression (over a finite alphabet). Given a finite state automaton that recognises the language generated by the expression, you can simply exchange all the accepting states for non-accepting states to get an FSA that recognises exactly the negation of that language, for which there are a family of equivalent regular expressions.
Second: because regular languages (and hence regular expressions) are closed under negation they are also closed under intersection since A intersect B = neg ( neg(A) union neg(B)) by de Morgan's laws. In other words given two regular expressions, you can find another regular expression that matches both.
This allows you to simulate lookaround expressions. For example u(?=v)w matches only expressions that will match uv and uw.
For negative lookahead you need the regular expression equivalent of the set theoretic A\B, which is just A intersect (neg B) or equivalently neg (neg(A) union B). Thus for any regular expressions r and s you can find a regular expression r-s which matches those expressions that match r which do not match s. In negative lookahead terms: u(?!v)w matches only those expressions which match uw - uv.
There are two reasons why lookaround is useful.
First, because the negation of a regular expression can result in something much less tidy. For example q(?!u)=q($|[^u]).
Second, regular expressions do more than match expressions, they also consume characters from a string - or at least that's how we like to think about them. For example in python I care about the .start() and .end(), thus of course:
>>> re.search('q($|[^u])', 'Iraq!').end()
5
>>> re.search('q(?!u)', 'Iraq!').end()
4
Third, and I think this is a pretty important reason, negation of regular expressions does not lift nicely over concatenation. neg(a)neg(b) is not the same thing as neg(ab), which means that you cannot translate a lookaround out of the context in which you find it - you have to process the whole string. I guess that makes it unpleasant for people to work with and breaks people's intuitions about regular expressions.
I hope I have answered your theoretical question (its late at night, so forgive me if I am unclear). I agree with a commentator who said that this does have practical applications. I met very much the same problem when trying to scrape some very complicated web pages.
EDIT
My apologies for not being clearer: I do not believe you can give a proof of regularity of regular expressions + lookarounds by structural induction, my u(?!v)w example was meant to be just that, an example, and an easy one at that. The reason a structural induction won't work is because lookarounds behave in a non-compositional way - the point I was trying to make about negations above. I suspect any direct formal proof is going to have lots of messy details. I have tried to think of an easy way to show it but cannot come up with one off the top of my head.
To illustrate using Josh's first example of ^([^a]|(?=..b))*$ this is equivalent to a 7 state DFSA with all states accepting:
A - (a) -> B - (a) -> C --- (a) --------> D
Λ | \ |
| (not a) \ (b)
| | \ |
| v \ v
(b) E - (a) -> F \-(not(a)--> G
| <- (b) - / |
| | |
| (not a) |
| | |
| v |
\--------- H <-------------------(b)-----/
The regular expression for state A alone looks like:
^(a([^a](ab)*[^a]|a(ab|[^a])*b)b)*$
In other words any regular expression you are going to get by eliminating lookarounds will in general be much longer and much messier.
To respond to Josh's comment - yes I do think the most direct way to prove the equivalence is via the FSA. What makes this messier is that the usual way to construct an FSA is via a non-deterministic machine - its much easier to express u|v as simply the machine constructed from machines for u and v with an epsilon transition to the two of them. Of course this is equivalent to a deterministic machine, but at the risk of exponential blow-up of states. Whereas negation is much easier to do via a deterministic machine.
The general proof will involve taking the cartesian product of two machines and selecting those states you wish to retain at each point you want to insert a lookaround. The example above illustrates what I mean to some extent.
My apologies for not supplying a construction.
FURTHER EDIT:
I have found a blog post which describes an algorithm for generating a DFA out of a regular expression augmented with lookarounds. Its neat because the author extends the idea of an NFA-e with "tagged epsilon transitions" in the obvious way, and then explains how to convert such an automaton into a DFA.
I thought something like that would be a way to do it, but I'm pleased that someone has written it up. It was beyond me to come up with something so neat.

As the other answers claim, lookarounds don't add any extra power to regular expressions.
I think we can show this using the following:
One Pebble 2-NFA (see the Introduction section of the paper which refers to it).
The 1-pebble 2NFA does not deal with nested lookaheads, but, we can use a variant of multi-pebble 2NFAs (see section below).
Introduction
A 2-NFA is a non deterministic finite automaton which has the ability to move either left or right on it's input.
A one pebble machine is where the machine can place a pebble on the input tape (i.e. mark a specific input symbol with a pebble) and do possibly different transitions based on whether there is a pebble at the current input position or not.
It is known the One Pebble 2-NFA has the same power as a regular DFA.
Non-nested Lookaheads
The basic idea is as follows:
The 2NFA allows us to backtrack (or 'front track') by moving forward or backward in the input tape. So for a lookahead we can do the match for the lookahead regular expression and then backtrack what we have consumed, in matching the lookahead expression. In order to know exactly when to stop backtracking, we use the pebble! We drop the pebble before we enter the dfa for the lookahead to mark the spot where the backtracking needs to stop.
Thus at the end of running our string through the pebble 2NFA, we know whether we matched the lookahead expression or not and the input left (i.e. what is left to be consumed) is exactly what is required to match the remaining.
So for a lookahead of the form u(?=v)w
We have the DFAs for u, v and w.
From the accepting state (yes, we can assume there is only one) of DFA for u, we make an e-transition to the start state of v, marking the input with a pebble.
From an accepting state for v, we e-transtion to a state which keeps moving the input left, till it finds a pebble, and then transitions to start state of w.
From a rejecting state of v, we e-transition to a state which keeps moving left until it finds the pebble, and transtions to the accepting state of u (i.e where we left off).
The proof used for regular NFAs to show r1 | r2, or r* etc, carry over for these one pebble 2nfas. See http://www.coli.uni-saarland.de/projects/milca/courses/coal/html/node41.html#regularlanguages.sec.regexptofsa for more info on how the component machines are put together to give the bigger machine for the r* expression etc.
The reason why the above proofs for r* etc work is that the backtracking ensures that the input pointer is always at the right spot, when we enter the component nfas for repetition. Also, if a pebble is in use, then it is being processed by one of the lookahead component machines. Since there are no transitions from lookahead machine to lookahead machine without completely backtracking and getting back the pebble, a one pebble machine is all that is needed.
For eg consider ([^a] | a(?=...b))*
and the string abbb.
We have abbb which goes through the peb2nfa for a(?=...b), at the end of which we are at the state: (bbb, matched) (i.e in input bbb is remaining, and it has matched 'a' followed by '..b'). Now because of the *, we go back to the beginning (see the construction in the link above), and enter the dfa for [^a]. Match b, go back to beginning, enter [^a] again two times, and then accept.
Dealing with Nested Lookaheads
To handle nested lookaheads we can use a restricted version of k-pebble 2NFA as defined here: Complexity Results for Two-Way and Multi-Pebble Automata and their Logics (see Definition 4.1 and Theorem 4.2).
In general, 2 pebble automata can accept non-regular sets, but with the following restrictions, k-pebble automata can be shown to be regular (Theorem 4.2 in above paper).
If the pebbles are P_1, P_2, ..., P_K
P_{i+1} may not be placed unless P_i is already on the tape and P_{i} may not be picked up unless P_{i+1} is not on the tape. Basically the pebbles need to be used in a LIFO fashion.
Between the time P_{i+1} is placed and the time that either P_{i} is picked up or P_{i+2} is placed, the automaton can traverse only the subword located between the current location of P_{i} and the end of the input word that lies in the direction of P_{i+1}. Moreover, in this sub-word, the automaton can act only as a 1-pebble automaton with Pebble P_{i+1}. In particular it is not allowed to lift up, place or even sense the presence of another pebble.
So if v is a nested lookahead expression of depth k, then (?=v) is a nested lookahead expression of depth k+1. When we enter a lookahead machine within, we know exactly how many pebbles have to have been placed so far and so can exactly determine which pebble to place and when we exit that machine, we know which pebble to lift. All machines at depth t are entered by placing pebble t and exited (i.e. we return to processing of a depth t-1 machine) by removing pebble t. Any run of the complete machine looks like a recursive dfs call of a tree and the above two restrictions of the multi-pebble machine can be catered to.
Now when you combine expressions, for rr1, since you concat, the pebble numbers of r1 must be incremented by the depth of r. For r* and r|r1 the pebble numbering remains the same.
Thus any expression with lookaheads can be converted to an equivalent multi-pebble machine with the above restrictions in pebble placement and so is regular.
Conclusion
This basically addresses the drawback in Francis's original proof: being able to prevent the lookahead expressions from consuming anything which are required for future matches.
Since Lookbehinds are just finite string (not really regexs) we can deal with them first, and then deal with the lookaheads.
Sorry for the incomplete writeup, but a complete proof would involve drawing a lot of figures.
It looks right to me, but I will be glad to know of any mistakes (which I seem to be fond of :-)).

I agree with the other posts that lookaround is regular (meaning that it does not add any fundamental capability to regular expressions), but I have an argument for it that is simpler IMO than the other ones I have seen.
I will show that lookaround is regular by providing a DFA construction. A language is regular if and only if it has a DFA that recognizes it. Note that Perl doesn't actually use DFAs internally (see this paper for details: http://swtch.com/~rsc/regexp/regexp1.html) but we construct a DFA for purposes of the proof.
The traditional way of constructing a DFA for a regular expression is to first build an NFA using Thompson's Algorithm. Given two regular expressions fragments r1 and r2, Thompson's Algorithm provides constructions for concatenation (r1r2), alternation (r1|r2), and repetition (r1*) of regular expressions. This allows you to build a NFA bit by bit that recognizes the original regular expression. See the paper above for more details.
To show that positive and negative lookahead are regular, I will provide a construction for concatenation of a regular expression u with positive or negative lookahead: (?=v) or (?!v). Only concatenation requires special treatment; the usual alternation and repetition constructions work fine.
The construction is for both u(?=v) and u(?!v) is:
In other words, connect every final state of the existing NFA for u to both an accept state and to an NFA for v, but modified as follows. The function f(v) is defined as:
Let aa(v) be a function on an NFA v that changes every accept state into an "anti-accept state". An anti-accept state is defined to be a state that causes the match to fail if any path through the NFA ends in this state for a given string s, even if a different path through v for s ends in an accept state.
Let loop(v) be a function on an NFA v that adds a self-transition on any accept state. In other words, once a path leads to an accept state, that path can stay in the accept state forever no matter what input follows.
For negative lookahead, f(v) = aa(loop(v)).
For positive lookahead, f(v) = aa(neg(v)).
To provide an intuitive example for why this works, I will use the regex (b|a(?:.b))+, which is a slightly simplified version of the regex I proposed in the comments of Francis's proof. If we use my construction along with the traditional Thompson constructions, we end up with:
The es are epsilon transitions (transitions that can be taken without consuming any input) and the anti-accept states are labeled with an X. In the left half of the graph you see the representation of (a|b)+: any a or b puts the graph in an accept state, but also allows a transition back to the begin state so we can do it again. But note that every time we match an a we also enter the right half of the graph, where we are in anti-accept states until we match "any" followed by a b.
This is not a traditional NFA because traditional NFAs don't have anti-accept states. However we can use the traditional NFA->DFA algorithm to convert this into a traditional DFA. The algorithm works like usual, where we simulate multiple runs of the NFA by making our DFA states correspond to subsets of the NFA states we could possibly be in. The one twist is that we slightly augment the rule for deciding if a DFA state is an accept (final) state or not. In the traditional algorithm a DFA state is an accept state if any of the NFA states was an accept state. We modify this to say that a DFA state is an accept state if and only if:
= 1 NFA states is an accept state, and
0 NFA states are anti-accept states.
This algorithm will give us a DFA that recognizes the regular expression with lookahead. Ergo, lookahead is regular. Note that lookbehind requires a separate proof.

I have a feeling that there are two distinct questions being asked here:
Are Regex engines that encorporate "lookaround" more
powerful than Regex engines that don't?
Does "lookaround"
empower a Regex engine with the ability to parse languages that are
more complex than those generated from a Chomsky Type 3 - Regular grammar?
The answer to the first question in a practical sense is yes. Lookaround will give a Regex engine that
uses this feature fundamentally more power than one that doesn't. This is because
it provides a richer set of "anchors" for the matching process.
Lookaround lets you define an entire Regex as a possible anchor point (zero width assertion). You can
get a pretty good overview of the power of this feature here.
Lookaround, although powerful, does not lift the Regex engine beyond the theoretical
limits placed on it by a Type 3 Grammar. For example, you will never be able to reliably
parse a language based on a Context Free - Type 2 Grammar using a Regex engine
equipped with lookaround. Regex engines are limited to the power of a Finite State Automation
and this fundamentally restricts the expressiveness of any language they can parse to the level of a Type 3 Grammar. No matter
how many "tricks" are added to your Regex engine, languages generated via a Context Free Grammar
will always remain beyond its capabilities. Parsing Context Free - Type 2 grammar requires pushdown automation to "remember" where it is in
a recursive language construct. Anything that requires a recursive evaluation of the grammar rules cannot be parsed using
Regex engines.
To summarize: Lookaround provides some practical benefits to Regex engines but does not "alter the game" on a
theoretical level.
EDIT
Is there some grammar with a complexity somewhere between Type 3 (Regular) and Type 2 (Context Free)?
I believe the answer is no. The reason is because there is no theoretical limit
placed on the size of the NFA/DFA needed to describe a Regular language. It may become arbitrarily large
and therefore impractical to use (or specify). This is where dodges such as "lookaround" are useful. They
provide a short-hand mechanism to specify what would otherwise lead to very large/complex NFA/DFA
specifications. They do not increase the expressiveness of
Regular languages, they just make specifying them more practical. Once you get this point, it becomes
clear that there are a lot of "features" that could be added to Regex engines to make them more
useful in a practical sense - but nothing will make them capable of going beyond the
limits of a Regular language.
The basic difference between a Regular and a Context Free language is that a Regular language
does not contain recursive elements. In order to evaluate a recursive language you need a
Push Down Automation
to "remember" where you are in the recursion. An NFA/DFA does not stack state information so cannot
handle the recursion. So given a non-recursive language definition there will be some NFA/DFA (but
not necessarily a practical Regex expression) to describe it.

Related

Check if a regex is ambiguous

I wonder if there is a way to check the ambiguity of a regular expression automatically. A regex is considered ambiguous if there is an string which can be matched by more that one ways from the regex. For example, given a regex R = (ab)*(a|b)*, we can detect that R is an ambiguous regex since there are two ways to match string ab from R.
UPDATE
The question is about how to check if the regex is ambiguous by definition. I know in practical implementation of regex mechanism, there is always one way to match a regex, but please read and think about this question in academic way.
A regular expression is one-ambiguous if and only if the corresponding Glushkov automaton is not deterministic. This can be done in linear time now. Here's a link. BTW, deterministic regular expressions have been investigated also under the name of one-unambiguity.
You are forgetting greed. Usually one section gets first dibs because it is a greedy match, and so there is no ambiguity.
If instead you are talking about a mythical pattern matching engine without the practical details like greed; then the answer is yes you can.
Take every element of the pattern. And try every possible subset against every possible string. If more than one subset matches the same pattern then there's an ambiguity. Optimizing this to take less than infinite time is left as an exercise for the reader.
I read a paper published around 1980 which showed that whether a regular expression is ambiguous can be determined in O(n^4) time. I wish I could give you a reference but
I no longer know the reference or even the journal.
A more expensive way to determine if a regular expression is ambiguous is to construct
a finite state machine (exponential in time and space in worst case) from the regular expression using subset construction. Now consider any state X of the FSM constructed from nfa states N. If,
for any two nfa states n1, n2 of X, follow(n1) intersect follow(n2) is not empty then
the regular expression is ambiguous. If this is not true for any state of the FSM then
the regular expression is not ambiguous.
A possible solution:
Construct an NFA for the regexp. Then analyse the NFA where you start with a set of states consisting solely of the initial state. Then do a depth, or width first traversal where you keep track of if you can be in multiple states. You also need to track the path taken in order to eliminate cycles.
For example your (ab)*(a|b)* can be modeled with three states.
| a | b
p| {q,r} | {r}
q| {} | {p}
r| {r} | {r}
Where p is the starting state and p and r accepts.
You then need to consider both letters and proceed with the sets {q,r} and {r}. The set {r} only leads to {r} giving a cycle and we can close that path. The set {q,r}, from {q,r} a takes us to {r} which is an accepting state, but since this path can not accept if we start with going to q we only have a single path here, we can then close this when we identify the cycle. Getting a b from {q,r} takes us to {p,r}. Since both of these accepts we have identified an ambigous position and we can conclude that the regexp is ambigous.

Can regexes containing ordered alternation be rewritten to use only unordered alternation?

Suppose I have a regex language supporting literals, positive and negative character classes, ordered alternation, the greedy quantifiers ?, *, and +, and the nongreedy quantifiers ??, *?, and +?. (This is essentially a subset of PCRE without backreferences, look-around assertions, or some of the other fancier bits.) Does replacing ordered alternation with unordered alternation decrease the expressive power of this formalism?
(Unordered alternation---also sometimes called "unordered choice"---is such that L(S|T) = L(S) + L(T), while ordered alternation is such that L(S|T) = L(S) + (L(T) - { a in L(T) : a extends some b in L(S) }). Concretely, the pattern a|aa would match the strings a and aa if the alternation is unordered, but only a if the alternation is ordered.)
Put another way, given a pattern S containing an ordered alternation, can that pattern be rewritten to an equivalent pattern T which contains no ordered alternations (but possibly unordered alternations instead)?
If this question has been considered in the literature, I'd appreciate any references which anyone can provide. I was able to turn up almost no theoretical work on the expressive power of extended regex formalisms (beyond the usual things about how backreferences move you from regular languages to context-free grammars).
in http://swtch.com/~rsc/regexp/regexp3.html [section "Does the regexp match a substring of the string? If so, where?"] it's necessary to introduce the idea of priorities within the "DFA" (you need to read the entire series to understand, i suspect, but the "DFA" in question is expanded from the NFA graph "on the fly") to handle ordered alternations. while this is only an appeal to authority, and not a proof, i think it's fair to say that if russ cox can't do it (express ordered alternations as a pure DFA), then no-one knows how to.
I haven't checked any literature but I think you can construct a DFA for the ordered alternation and thus prove that it doesn't add any expressive power in the following way:
Let's say we have the regex x||y where x and y are regexen and || means the unordered alternation. If so we can construct DFA's accepting x and y. We will mark those DFA_x and DFA_y
We will construct the DFA for x||y in phases by connecting DFA_x and DFA_y
For every path in DFA_x corresponding to some string a (by path I mean a path in the graph sense without traversing and edge twice so a is a path in DFA_"a*" but aa is not)...
For every symbol in the alphabet s
If DFA_y consumes as (that is if run on as DFA_y will not stop early but it may not necessarily accept) and DFA_x does not and DFA_x doesn't accept any prefix of as create a transition from the state DFA_x ends in after consuming a to the state DFA_y ends in after consuming as
The accepting states of the final DFA are all the accepting states of both the input DFA's. The starting state is the starting state of DFA_x.
Intuitively what this does is it creates two regions in the output DFA. One of them corresponds to the first argument of the alternation and the other to the second. As long as it's possible that the first argument of the alternation will match we stay in the first part. When a symbol is encountered which makes it certain that the first argument won't match we switch to the second part if possible at this point. Please comment if this approach is wrong.

DFA vs NFA engines: What is the difference in their capabilities and limitations?

I am looking for a non-technical explanation of the difference between DFA vs NFA engines, based on their capabilities and limitations.
Deterministic Finite Automatons (DFAs) and Nondeterministic Finite Automatons (NFAs) have exactly the same capabilities and limitations. The only difference is notational convenience.
A finite automaton is a processor that has states and reads input, each input character potentially setting it into another state. For example, a state might be "just read two Cs in a row" or "am starting a word". These are usually used for quick scans of text to find patterns, such as lexical scanning of source code to turn it into tokens.
A deterministic finite automaton is in one state at a time, which is implementable. A nondeterministic finite automaton can be in more than one state at a time: for example, in a language where identifiers can begin with a digit, there might be a state "reading a number" and another state "reading an identifier", and an NFA could be in both at the same time when reading something starting "123". Which state actually applies would depend on whether it encountered something not numeric before the end of the word.
Now, we can express "reading a number or identifier" as a state itself, and suddenly we don't need the NFA. If we express combinations of states in an NFA as states themselves, we've got a DFA with a lot more states than the NFA, but which does the same thing.
It's a matter of which is easier to read or write or deal with. DFAs are easier to understand per se, but NFAs are generally smaller.
Here's a non-technical answer from Microsoft:
DFA engines run in linear time because they do not require backtracking (and thus they never test the same character twice). They can also guarantee matching the longest possible string. However, since a DFA engine contains only finite state, it cannot match a pattern with backreferences, and because it does not construct an explicit expansion, it cannot capture subexpressions.
Traditional NFA engines run so-called "greedy" match backtracking algorithms, testing all possible expansions of a regular expression in a specific order and accepting the first match. Because a traditional NFA constructs a specific expansion of the regular expression for a successful match, it can capture subexpression matches and matching backreferences. However, because a traditional NFA backtracks, it can visit exactly the same state multiple times if the state is arrived at over different paths. As a result, it can run exponentially slowly in the worst case. Because a traditional NFA accepts the first match it finds, it can also leave other (possibly longer) matches undiscovered.
POSIX NFA engines are like traditional NFA engines, except that they continue to backtrack until they can guarantee that they have found the longest match possible. As a result, a POSIX NFA engine is slower than a traditional NFA engine, and when using a POSIX NFA you cannot favor a shorter match over a longer one by changing the order of the backtracking search.
Traditional NFA engines are favored by programmers because they are more expressive than either DFA or POSIX NFA engines. Although in the worst case they can run slowly, you can steer them to find matches in linear or polynomial time using patterns that reduce ambiguities and limit backtracking.
[http://msdn.microsoft.com/en-us/library/0yzc2yb0.aspx]
A simple, nontechnical explanation, paraphrased from Jeffrey Friedl's book Mastering Regular Expressions.
CAVEAT:
While this book is generally considered the "regex bible", there appears some controversy as to whether the distinction made here between DFA and NFA is actually correct. I'm not a computer scientist, and I don't understand most of the theory behind what really is a "regular" expression, deterministic or not. After the controversy started, I deleted this answer because of this, but since then it has been referenced in comments to other answers. I would be very interested in discussing this further - can it be that Friedl really is wrong? Or did I get Friedl wrong (but I reread that chapter yesterday evening, and it's just like I remembered...)?
Edit: It appears that Friedl and I are indeed wrong. Please check out Eamon's excellent comments below.
Original answer:
A DFA engine steps through the input string character by character and tries (and remembers) all possible ways the regex could match at this point. If it reaches the end of the string, it declares success.
Imagine the string AAB and the regex A*AB. We now step through our string letter by letter.
A:
First branch: Can be matched by A*.
Second branch: Can be matched by ignoring the A* (zero repetitions are allowed) and using the second A in the regex.
A:
First branch: Can be matched by expanding A*.
Second branch: Can't be matched by B. Second branch fails. But:
Third branch: Can be matched by not expanding A* and using the second A instead.
B:
First branch: Can't be matched by expanding A* or by moving on in the regex to the next token A. First branch fails.
Third branch: Can be matched. Hooray!
A DFA engine never backtracks in the string.
An NFA engine steps through the regex token by token and tries all possible permutations on the string, backtracking if necessary. If it reaches the end of the regex, it declares success.
Imagine the same string and the same regex as before. We now step through our regex token by token:
A*: Match AA. Remember the backtracking positions 0 (start of string) and 1.
A: Doesn't match. But we have a backtracking position we can return to and try again. The regex engine steps back one character. Now A matches.
B: Matches. End of regex reached (with one backtracking position to spare). Hooray!
Both NFAs and DFAs are finite automata, as their names say.
Both can be represented as a starting state, a success (or "accept") state (or set of success states), and a state table listing transitions.
In the state table of a DFA, each <state₀, input> key will transit to one and only one state₁.
In the state table of an NFA, each <state₀, input> will transit to a set of states.
When you take a DFA, reset it to it's start state, give it a sequence of input symbols, and you will know exactly what end state it's in and whether it's a success state or not.
When you take an NFA, however, it will, for each input symbol, look up the set of possible result states, and (in theory) randomly, "nondeterministically," select one of them. If there exists a sequence of random selections which leads to one of the success states for that input string, then the NFA is said to succeed for that string. In other words, you are expected to pretend that it magically always selects the right one.
One early question in computing was whether NFAs were more powerful than DFAs, due to that magic, and the answer turned out to be no since any NFA could be translated into an equivalent DFA. Their capabilities and limitations are exactly precisely the same as one another.
For those wondering how real, non-magical, NFA engine can "magically" select the right successor state for a given symbol, this page describes the two common approaches.
I find the explanation given in Regular Expressions, The Complete Tutorial by Jan Goyvaerts to be the most usable. See page 7 of this PDF:
https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf
Among other points made on page 7, There are two kinds of regular expression engines: text-directed engines, and regex-directed engines. Jeffrey Friedl calls them DFA and NFA engines, respectively. ...certain very useful features, such as lazy quantifiers and backreferences, can only be implemented in regex-directed engines.

Aren't modern regular expression dialects regular?

I've seen a few comments here that mention that modern regular expressions go beyond what can be represented in a regular language. How is this so?
What features of modern regular expressions are not regular? Examples would be helpful.
The first thing that comes to mind is backreferences:
(\w*)\s\1
(matches a group of word characters, followed by a space character and then the same group previously matched) eg: hello hello matches, hello world doesn't.
This construct is not regular (ie: can't be generated by a regular grammar).
Another feature supported by Perl Compatible RegExp (PCRE) that is not regular are recursive patterns:
\((a*|(?R))*\)
This can be used to match any combination of balanced parentheses and "a"s (from wikipedia)
A couple of examples:
Regular expressions support grouping. E.g. in Ruby: /my (group)/.match("my group")[1] will output "group". storing something in a group requires an external storage, which a finite automaton does not have.
Many languages, e.g. C#, support captures, i.e. that each match will be captured on a stack - for example the pattern (?<MYGROUP>.)* could perform multiple captures of "." in the same group.
Grouping are used for backreferencing as pointed out by the user NullUserException above. Backreferencing requires one or more external stacks with the power of a push-down-automaton (you have to be able to push something on a stack and peek or pop it afterwards.
Some engines have the possibility of seperately pushing and popping external stacks and checking whether the stack is empty. In .NET, actually (?<MYGROUP>test) pushes a stack, while (?<-MYGROUP>) pops a stack.
Some engines like the .NET engine have a balanced grouping concept - where an external stack can be both pushed and popped at the same time. Balanced grouping syntax is (?<FIRSTGROUP-LASTGROUP>) which pops the LASTGROUP and pushes the capture since the LASTGROUP index on the FIRSTGROUP stack. This can actually be used to match infinitely nested constructions which is definitely beyond the power of a finite automaton.
Probably other good examples exist :-) If you are further interessted in some of the implementation details of external stacks in combination with Regex's and balanced grouping and thus higher order automata than finite automata, I once wrote two short articles on this (http://www.codeproject.com/KB/recipes/Nested_RegEx_explained.aspx and http://www.codeproject.com/KB/recipes/RegEx_Balanced_Grouping.aspx).
Anyway - finitieness or not - I blieve that the power that this extra stuff brings to the regular languages is great :-)
Br. Morten
A deterministic or nondeterministic finite automaton recognizes just the regular languages, which are described by regular expressions. The definition of a regular expression is simple. Let S be an alphabet. Then the empty set, the empty string, and every element of S are regular expressions (over S). Let u and v be regular expressions. Then the union (u | v), concatenation (uv), and closure (u*) of u and v are regular expressions over S. This definition is easily extended to the regular languages. No other expression is a regular expression. As pointed out, some back-references are an example. The Wikipedia pages on regular languages and expressions are good references.
In essence, certain "regular expressions" are not regular because no automaton of a particular type can be constructed to recognize them. For example, the the language
{ a^ i b^ i : i <= 0 }
is not regular. This is because the accepting automaton would require infinitely many states, but an automaton accepting regular languages must have a finite number of states.

Regular expressions Lexical Analysis

Why repeated strings such as
[wcw|w is a string of a's and b's]
cannot be denoted by regular expressions?
pls. give me detailed answer as i m new to lexical analysis.
Thanks ...
Regular expressions in their original form describe regular languages/grammars. Those cannot contain nested structures as those languages can be described by a simple finite state machine. Simplified you can picture that as if each word of the language grows strictly from left to right (or right to left), where repeating structures have to be explicitly defined and are static.
What this means is, that no information whatsoever from previous states can be carried over to later states (a few characters further in the input). So if you have your symbol w you can't specify that the input must have exactly the same string w later in the sequence. Similarly you can't ensure that each opening paranthesis needs a closin paren as well (so regular expressions themselves are not even a regular language and thus cannot be described by regular expressions :-)).
In theoretical computer science we worked with a very restricted set of regex operators, basically only consisting of sequence, alternative (|) and repetition (*), everything else can be described with those operations.
However, usually regex engines allow grouping of certain sub-patterns into matches which can then be referenced or extracted later. Some engines even allow to use such a backreference in the search expression string itself, thereby allowing the expression to describe more than just a regular language. If I remember correctly such use of backreferences can even yield languages that are not context-free.
Additional pointers:
This StackOverflow question
Wikipedia
It can be, you just can't assure that it's the same string of "a"s and "b"s because there's no way to retain the information acquired in traversing the first half for use in traversing the second.