Same occurence of 1 and 2, but with length < 1.000.000 - regex

So I have the following task, where I have to explain whether or not it's possible to construct a regular expression following these constraints:
Numbers N < 1.000.000 which contain digit 1 exactly as often as digit 2
My task is not to actually write an regular expression, but just to reasoning about the fact if it's possible or not.
My conclusion is that it's perfectly possible, but I would have to write down every possible string in my regex, which would be extremely long.
So far I have the following regular expression:
(([3-9]*[1]{n}[3-9]*[2]{n}[3-9]*){0,1000})
Where n is iterating towards it's maximum length of 1.000.000
I know that iterators is not a "thing" in the regular language. But is there an easier way to "iterate" n, instead of having to write it all by hand?

My conclusion is that it's perfectly possible, but I would have to write down every possible string in my regex, which would be extremely long.
For questions about theory, the fact that it would be extremely long is irrelevant. You're correct. Any finite language is regular.
I know that iterators is not a "thing" in the regular language. But is there an easier way to "iterate" n, instead of having to write it all by hand?
No, there isn't. Perhaps it would help to think about how to construct a state machine for this.
One state, the initial state, is the one where the number of 1s seen and the number of 2s seen are equal.
A second state is one where one more 1 has been seen than 2s.
A third state is one where one more 2 has been seen than 1s.
A fourth state is one where two more 1s have been seen than 2s. Etc...
It should be fairly straightforward to convince yourself that you cannot combine any of those states, as there is no overlap in the valid remaining substrings for each state. Therefore, you would need >1000000 states, and as the translation of a regular expression to a state machine is fairly direct, the complexity of a regular expression cannot be less than the complexity of the state machine.
Of course, when programming, you can implement this far more efficiently. You can for instance store the count of 1s and 2s in additional variables. But variables are a feature state machines do not have.

Related

Function to count an expression

Is there any function in C++ (included in some library or header file), that would count an expression from string?
Let's say we have a string, which equals 2 + 3 * 8 - 5 (but it's taken from user's keyboard, so we don't know what expression exactly it will be while writing the code), and we want this function co count it, but of course in the correct order (1. power / root 2. times / divide 3. increase / decrease).
I've tried to take all the numbers to an array of ints and operators to an array of chars (okay, actually vectors, because I don't know how many numbers and operators it's going to contain), but I'm not sure what to do next.
Note, that I'm asking if there's any function already written for that, if not I'm going to just try it again.
By count, I'm taking that to mean "evaluate the expression".
You'll have to use a parser generator like boost::spirit to do this properly. If you try hand-writing this I guarantee pain and misery for all involved.
Try looking on here for calculator apps, there are several:
http://boost-spirit.com/repository/applications/show_contents.php
There are also some simple calculator-style grammars in the boost::spirit examples.
Quite surprisingly, others before me have not redirected you to this particular algorithm which is easy to implement and converts your string into this special thing called Reverse Polish Notation (RPN). Computing expressions in RPN is easy, the hard part is implementing Shunting Yard but it is something done many times before and you may find many tutorials on the subject.
Quick overview of the algorithms:
PRN - RPN is a way to write expressions that eliminates the need for parentheses, thus it allows easier computation. Practically speaking, to compute one such expression you walk the string from left to right while keeping a stack of operands. Whenever you meet an operand, you push it onto the stack. Whenever you meet an operation token, you calculate its result on the last 2 operands (if the operation is binary ofcourse, on the last only if it is unary) and push it onto the stack. Rinse and repeat until the end of string.
Shunting Yard really is much harder to simply overview and if I do try to accomplish this task, this answer will end up looking much like the wikipedia article I linked above so I'll save that trouble to both of us.
Tl;DR; Go read the links in the first sentence.
You will have to do this yourself as there is no standard library that handles infix equation resolution
If you do, please include the code for what you are trying and we can help you

find Reg. Expr. over {0,1,2} so last symbol of string is the sum of the symbols so far on the string mod 3.

I'm learning by myself formal languages (Aho's,Hopcroft) but I'm having a hard time with regular expressions.
I've been able to tackle simple tasks but this one has posed a challenge, at least for me. How to solve this if you can't count so far, I'm not used to this type of computation.
There must be some property or something that let me generalize the answer that much that i can put it as a regular expresion.
So far I've devised that is possible that there may be at least 2 o 3 cases:
sums mod3=0 if sum=3k
sums mod3=1 if sum=3k+1
sums mod3=2 if sum=3k+2.
But I've come to realize that there may be many combinations for a sum to happen so can't find the pattern the regular expression must follow.
The string for ex. {122211}0 (braces are for easy read sake) has the zero at the end as it holds that {sum=3k}0, if the sum is "10" from a string for ex. {1222111}1 the case may be {sum=3k+1} so the one has to be at the end, and so on.
This may or not be the right track to tackle the problem but I'm open to any suggestions please, any help is very appreciated.
Here's a hint: think of what distinct final states you can possibly be in. You certainly have at least 3 states, since the number of values can be three different things mod three. Also, you need to have a distinct start state, since the empty string cannot be accepted. Do you need more states?
Hint2: I think you can easily do this with a DFA using a start state and nine other states, of which exactly three will be accepting.
EDIT: Once you have a DFA, you can use Kleene's Theorem to construct an equivalent regular expression. If you'd rather go straight for a regular expression, here's another hint: if you're looking at any string of length 3k, you can append: 0; any string of length 1, followed by 1; any string of length 2, followed by 2. So if you can write regular expressions for strings of lengths 3k, 1, and 2, you're practically done.

efficient algorithm for searching one of several strings in a text?

I need to search incoming not-very-long pieces of text for occurrences of given strings. The strings are constant for the whole session and are not many (~10). Additional simplification is that none of the strings is contained in any other.
I am currently using boost regex matching with str1 | str2 | .... The performance of this task is important, so I wonder if I can improve it. Not that I can program better than the boost guys, but perhaps a dedicated implementation is more efficient than a general one.
As the strings stay constant over long time, I can afford building a data structure, like a state transition table, upfront.
e.g., if the strings are abcx, bcy and cz, and I've read so far abc, I should be in a combined state that means you're either 3 chars into string 1, 2 chars into string 2 or 1 char into string 1. Then reading x next will move me to the string 1 matched state etc., and any char other than xyz will move to the initial state, and I will not need to retract back to b.
Any ideas or references are appreciated.
Check out the Aho–Corasick string matching algorithm!
Take a look at Suffix Tree.
Look at this: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/configuration/algorithm.html
The existence of a recursive/non-recursive distinction is a pretty strong suggestion that BOOST is not necessarily a linear-time discrete finite-state machine. Therefore, there's a good chance you can do better for your particular problem.
The best answer depends quite a bit on how many haystacks you have and the minimum size of a needle. If the smallest needle is longer than a few characters, you may be able to do a little bit better than a generalized regex library.
Basically all string searches work by testing for a match at the current position (cursor), and if none is found, then trying again with the cursor slid farther to the right.
Rabin-Karp builds a DFSM out of the string (or strings) for which you are searching so that the test and the cursor motion are combined in a single operation. However, Rabin-Karp was originally designed for a single needle, so you would need to support backtracking if one match could ever be a proper prefix of another. (Remember that for when you want to reuse your code.)
Another tactic is to slide the cursor more than one character to the right if at all possible. Boyer-Moore does this. It's normally built for a single needle. Construct a table of all characters and the rightmost position that they appear in the needle (if at all). Now, position the cursor at len(needle)-1. The table entry will tell you (a) what leftward offset from the cursor that the needle might be found, or (b) that you can move the cursor len(needle) farther to the right.
When you have more than one needle, the construction and use of your table grows more complicated, but it still may possibly save you an order of magnitude on probes. You still might want to make a DFSM but instead of calling a general search method, you call does_this_DFSM_match_at_this_offset().
Another tactic is to test more than 8 bits at a time. There's a spam-killer tool that looks at 32-bit machine words at a time. It then does some simple hash code to fit the result into 12 bits, and then looks in a table to see if there's a hit. You have four entries for each pattern (offsets of 0, 1, 2, and 3 from the start of the pattern) and then this way despite thousands of patterns in the table they only test one or two per 32-bit word in the subject line.
So in general, yes, you can go faster than regexes WHEN THE NEEDLES ARE CONSTANT.
I've been looking at the answers but none seem quite explicit... and mostly boiled down to a couple of links.
What intrigues me here is the uniqueness of your problem, the solutions exposed so far do not capitalize at all on the fact that we are looking for several needles at once in the haystack.
I would take a look at KMP / Boyer Moore, for sure, but I would not apply them blindly (at least if you have some time on your hand), because they are tailored for a single needle, and I am pretty convinced we could capitalized on the fact that we have several strings and check all of them at once, with a custom state machine (or custom tables for BM).
Of course, it's unlikely to improve the big O (Boyer Moore runs in 3n for each string, so it'll be linear anyway), but you could probably gain on the constant factor.
Regex engine initialization is expected to have some overhead,
so if there are no real regular expressions involved,
C - memcmp() should do fine.
If you can tell the file sizes and give some
specific use cases, we could build a benchmark
(I consider this very interesting).
Interesting: memcmp explorations and timing differences
Regards
rbo
There is always Boyer Moore
Beside Rabin-Karp-Algorithm and Knuth-Morris-Pratt-Algorithm, my Algorithm-Book suggests a Finite State Machine for String Matching. For every Search String you need to build such a Finite State Machine.
You can do it with very popular Lex & Yacc tools, with the support of Flex and Bison tools.
You can use Lex for getting tokens of the string.
Compare your pre-defined strings with the tokens returned from Lexer.
When match is found, perform the desired action.
There are many sites which describe about Lex and Yacc.
One such site is http://epaperpress.com/lexandyacc/

Distance between regular expression

Can we compute a sort of distance between regular expressions ?
The idea is to mesure in which way two regular expression are similar.
You can build deterministic finite-state machines for both regular expressions and compare the transitions. The difference of both transitions can then be used to measure the distance of these regular expressions.
There are a few of metrics you could use:
The length of a valid match. Some regexs have a fixed size, some an upper limit and some a lower limit. Compare how similar their lengths or possible lengths are.
The characters that match. Any regex will have a set of characters a match can contain (maybe all characters). Compare the set of included characters.
Use a large document and see how many matches each regex makes and how many of those are identical.
Are you looking for strict equivalence?
I suppose you could compute a Levenshtein Distance between the actual Regular Experssion strings. That's certainly one way of measuring a "distance" between two different Regular Expression strings.
Of course, I think it's possible that regular expressions are not required here at all, and computing the Levenshtein Distance of the actual "value" strings that the Regular Expressions would otherwise be applied to, may yield a better result.
If you have two regular expressions and have a set of example inputs you could try matching every input against each regex. For each input:
If they both match or both don't match, score 0.
If one matches and the other doesn't, score 1.
Sum this score over all inputs, and this will give you a 'distance' between the regular expressions. This will give you an idea of how often two regular expressions will differ for typical input. It will be very slow to calculate if your sample input set is large. It won't work at all if both regexes fail to match for almost all random strings and your expected input is entirely random. For example the regex 'sgjlkwren' and the regex 'ueuenwbkaalf' would probably both never match anything if tested on random input, so this metric would say the distance between them is zero. That might or might not be what you want (probably not).
You might be able to analyze the structure of the regex and use biased random sampling to deliberately hit strings that match more frequently than in completely random input. For example, if both regex require that the string starts with 'foo', you could make sure that your test inputs also always start with foo, to avoid wasting time testing strings that you know will fail for both.
So in conclusion: unless you have a very specific situation with a restricted input set and/or restricted regular expression language, I'd say its not possible. If you do have some restrictions on your input and on the regular expression, it might be possible. Please specify what these restrictions are and maybe I can come up with something better.
There's an answer hidden in an earlier question here on SO: Generating strings from regexes. You can calculate an (asymmetric) distance measure by generating strings using one regex and checking how many of those match the other regex.
This can be optimized by stripping out shared prefixes/suffixes. E.g. a[0-9]* and a[0-7]* share the a prefix, so you can calculate the distance between [0-9]* and [0-7]* instead.
I think first you need to understand for yourself how you see a "difference" between two expressions. Basically, define a distance metric.
In general case, it would be quite different to make. Depending on what you need to do, you may see allowing one different character in some place as a big difference. In the other case, allowing any number of consequent but same characters may not yield much difference.
I'd like to emphasize as well that normally when they talk about distance functions, they apply them to..., well, let's call them, tokens. In our case, character sequences. What you are willing to do, is to apply this method not to those tokens, but to the rules a multitude of tokens will match. I'm not quite sure it even makes sense.
Still, I believe we could think of something, but not in general, but for one particular and quite restricted case. Do you have some sort of example to show us?

DFA based regular expression matching - how to get all matches?

I have a given DFA that represent a regular expression.
I want to match the DFA against an input stream and get all possible matches back, not only the leastmost-longest match.
For example:
regex: a*ba|baa
input: aaaaabaaababbabbbaa
result:
aaaaaba
aaba
ba
baa
Assumptions
Based on your question and later comments you want a general method for splitting a sentence into non-overlapping, matching substrings, with non-matching parts of the sentence discarded. You also seem to want optimal run-time performance. Also I assume you have an existing algorithm to transform a regular expression into DFA form already. I further assume that you are doing this by the usual method of first constructing an NFA and converting it by subset construction to DFA, since I'm not aware of any other way of accomplishing this.
Before you go chasing after shadows, make sure your trying to apply the right tool for the job. Any discussion of regular expressions is almost always muddied by the fact that folks use regular expressions for a lot more things than they are really optimal for. If you want to receive the benefits of regular expressions, be sure you're using a regular expression, and not something broader. If what you want to do can't be somewhat coded into a regular expression itself, then you can't benefit from the advantages of regular expression algorithms (fully)
An obvious example is that no amount of cleverness will allow a FSM, or any algorithm, to predict the future. For instance, an expression like (a*b)|(a), when matched against the string aaa... where the ellipsis is the portion of the expression not yet scanned because the user has not typed them yet, cannot give you every possible right subgroup.
For a more detailed discussion of Regular expression implementations, and specifically Thompson NFA's please check this link, which describes a simple C implementation with some clever optimizations.
Limitations of Regular Languages
The O(n) and Space(O(1)) guarantees of regular expression algorithms is a fairly narrow claim. Specifically, a regular language is the set of all languages that can be recognized in constant space. This distinction is important. Any kind of enhancement to the algorithm that does something more sophisticated than accepting or rejecting a sentence is likely to operate on a larger set of languages than regular. On top of that, if you can show that some enhancement requires greater than constant space to implement, then you are also outside of the performance guarantee. That being said, we can still do an awful lot if we are very careful to keep our algorithm within these narrow constraints.
Obviously that eliminates anything we might want to do with recursive backtracking. A stack does not have constant space. Even maintaining pointers into the sentence would be verboten, since we don't know how long the sentence might be. A long enough sentence would overflow any integer pointer. We can't create new states for the automaton as we go to get around this. All possible states (and a few impossible ones) must be predictable before exposing the recognizer to any input, and that quantity must be bounded by some constant, which may vary for the specific language we want to match, but by no other variable.
This still allows some room for adding additonal behavior. The usual way of getting more mileage is to add some extra annotations for where certain events in processing occur, such as when a subexpression started or stopped matching. Since we are only allowed to have constant space processing, that limits the number of subexpression matches we can process. This usually means the latest instance of that subexpression. This is why, when you ask for the subgrouped matched by (a|)*, you always get an empty string, because any sequence of a's is implicitly followed by infinitely many empty strings.
The other common enhancement is to do some clever thing between states. For example, in perl regex, \b matches the empty string, but only if the previous character is a word character and the next is not, or visa versa. Many simple assertions fit this, including the common line anchor operators, ^ and $. Lookahead and lookbehind assertions are also possible, but much more difficult.
When discussing the differences between various regular language recognizers, it's worth clarifying if we're talking about match recognition or search recognition, the former being an accept only if the entire sentence is in the language, and the latter accepts if any substring in the sentence is in the language. These are equivalent in the sense that if some expression E is accepted by the search method, then .*(E).* is accepted in the match method.
This is important because we might want to make it clear whether an expression like a*b|a accepts aa or not. In the search method, it does. Either token will match the right side of the disjunction. It doesn't match, though, because you could never get that sentence by stepping through the expression and generating tokens from the transitions, at least in a single pass. For this reason, i'm only going to talk about match semantics. Obviously if you want search semantics, you can modify the expression with .*'s
Note: A language defined by expression E|.* is not really a very manageable language, regardless of the sublanguage of E because it matches all possible sentences. This represents a real challenge for regular expression recognizers because they are really only suited to recognizing a language or else confirming that a sentence is not in that same language, rather than doing any more specific work.
Implementation of Regular Language Recognizers
There are generally three ways to process a regular expression. All three start the same, by transforming the expression into an NFA. This process produces one or two states for each production rule in the original expression. The rules are extemely simple. Here's some crude ascii art: note that a is any single literal character in the language's alphabet, and E1 and E2 are any regular expression. Epsilon(ε) is a state with inputs and outputs, but ignores the stream of characters, and doesn't consume any input either.
a ::= > a ->
E1 E2 ::= >- E1 ->- E2 ->
/---->
E1* ::= > --ε <-\
\ /
E1
/-E1 ->
E1|E2 ::= > ε
\-E2 ->
And that's it! Common uses such as E+, E?, [abc] are equivalent to EE*, (E|), (a|b|c) respectively. Also note that we add for each production rule a very small number of new states. In fact each rule adds zero or one state (in this presentation). characters, quantifiers and dysjunction all add just one state, and the concatenation doesn't add any. Everything else is done by updating the fragments' end pointers to start pointers of other states or fragments.
The epsilon transition states are important, because they are ambiguous. When encountered, is the machine supposed to change state to once following state or another? should it change state at all or stay put? That's the reason why these automatons are called nondeterministic. The solution is to have the automaton transition to the right state, whichever allows it to match the best. Thus the tricky part is to figure out how to do that.
There are fundamentally two ways of doing this. The first way is to try each one. Follow the first choice, and if that doesn't work, try the next. This is recursive backtracking, appears in a few (and notable) implementations. For well crafted regular expressions, this implementation does very little extra work. If the expression is a bit more convoluted, recursive backtracking is very, very bad, O(2^n).
The other way of doing this is to instead try both options in parallel. At each epsilon transition, add to the set of current states both of the states the epsilon transition suggests. Since you are using a set, you can have the same state come up more than once, but you only need to track it once, either you are in that state or not. If you get to the point that there's no option for a particular state to follow, just ignore it, that path didn't match. If there are no more states, then the entire expression didn't match. as soon as any state reaches the final state, you are done.
Just from that explanation, the amount of work we have to do has gone up a little bit. We've gone from having to keep track of a single state to several. At each iteration, we may have to update on the order of m state pointers, including things like checking for duplicates. Also the amount of storage we needed has gone up, since now it's no longer a single pointer to one possible state in the NFA, but a whole set of them.
However, this isn't anywhere close to as bad as it sounds. First off, the number of states is bounded by the number of productions in the original regular expression. From now on we'll call this value m to distinguish it from the number of symbols in the input, which will be n. If two state pointers end up transitioning to the same new state, you can discard one of them, because no matter what else happens, they will both follow the same path from there on out. This means the number of state pointers you need is bounded by the number of states, so that to is m.
This is a bigger win in the worst case scenario when compared to backtracking. After each character is consumed from the input, you will create, rename, or destroy at most m state pointers. There is no way to craft a regular expression which will cause you to execute more than that many instructions (times some constant factor depending on your exact implementation), or will cause you to allocate more space on the stack or heap.
This NFA, simultaneously in some subset of its m states, may be considered some other state machine who's state represents the set of states the NFA it models could be in. each state of that FSM represents one element from the power set of the states of the NFA. This is exactly the DFA implementation used for matching regular expressions.
Using this alternate representation has an advantage that instead of updating m state pointers, you only have to update one. It also has a downside, since it models the powerset of m states, it actually has up to 2m states. That is an upper limit, because you don't model states that cannot happen, for instance the expression a|b has two possible states after reading the first character, either the one for having seen an a, or the one for having seen a b. No matter what input you give it, it cannot be in both of those states at the same time, so that state-set does not appear in the DFA. In fact, because you are eliminating the redundancy of epsilon transitions, many simple DFA's actually get SMALLER than the NFA they represent, but there is simply no way to guarantee that.
To keep the explosion of states from growing too large, a solution used in a few versions of that algorithm is to only generate the DFA states you actually need, and if you get too many, discard ones you haven't used recently. You can always generate them again.
From Theory to Practice
Many practical uses of regular expressions involve tracking the position of the input. This is technically cheating, since the input could be arbitrarily long. Even if you used a 64 bit pointer, the input could possibly be 264+1 symbols long, and you would fail. Your position pointers have to grow with the length of the input, and now your algorithm now requires more than constant space to execute. In practice this isn't relevant, because if your regular expression did end up working its way through that much input, you probably won't notice that it would fail because you'd terminate it long before then.
Of course, we want to do more than just accept or reject inputs as a whole. The most useful variation on this is to extract submatches, to discover which portion of an input was matched by a certain section of the original expression. The simple way to achieve this is to add an epsilon transition for each of the opening and closing braces in the expression. When the FSM simulator encounters one of these states, it annotates the state pointer with information about where in the input it was at the time it encountered that particular transition. If the same pointer returns to that transition a second time, the old annotation is discarded and replaced with a new annotation for the new input position. If two states pointers with disagreeing annotations collapse to the same state, the annotation of a later input position wins again.
If you are sticking to Thompson NFA or DFA implementations, then there's not really any notion of greedy or non-greedy matching. A backtracking algorithm needs to be given a hint about whether it should start by trying to match as much as it can and recursively trying less, or trying as little as it can and recursively trying more, when it fails it first attempt. The Thompson NFA method tries all possible quantities simultaneously. On the other hand, you might still wish to use some greedy/nongreedy hinting. This information would be used to determine if newer or older submatch annotations should be preferred, in order to capture just the right portion of the input.
Another kind of practical enhancement is assertions, productions which do not consume input, but match or reject based on some aspect of the input position. For instance in perl regex, a \b indicates that the input must contain a word boundary at that position, such that the symbol just matched must be a word character, but the next character must not be, or visa versa. Again, we manage this by adding an epsilon transition with special instructions to the simulator. If the assertion passes, then the state pointer continues, otherwise it is discarded.
Lookahead and lookbehind assertions can be achieved with a bit more work. A typical lookbehind assertion r0(?<=r1)r2 is transformed into two separate expressions, .*r1 and r0εr2. Both expressions are applied to the input. Note that we added a .* to the assertion expression, because we don't actually care where it starts. When the simulator encounters the epsilon in the second generated fragment, it checks up on the state of the first fragment. If that fragment is in a state where it could accept right there, the assertion passes with the state pointer flowing into r2, but otherwise, it fails, and both fragments continue, with the second discarding the state pointer at the epsilon transition.
Lookahead also works by using an extra regex fragment for the assertion, but is a little more complex, because when we reach the point in the input where the assertion must succeed, none of the corresponding characters have been encountered (in the lookbehind case, they have all been encountered). Instead, when the simulator reaches the assertion, it starts a pointer in the start state of the assertion subexpression and annotates the state pointer in the main part of the simulation so that it knows it is dependent on the subexpression pointer. At each step, the simulation must check to see that the state pointer it depends upon is still matching. If it doesn't find one, then it fails wherever it happens to be. You don't have to keep any more copies of the assertion subexpressions state pointers than you do for the main part, if two state pointers in the assertion land on the same state, then the state pointers each of them depend upon will share the same fate, and can be reannotated to point to the single pointer you keep.
While were adding special instructions to epsilon transitions, it's not a terrible idea to suggest an instruction to make the simulator pause once in a while to let the user see what's going on. Whenever the simulator encounters such a transition, it will wrap up its current state in some kind of package that can be returned to the caller, inspected or altered, and then resumed where it left off. This could be used to match input interactively, so if the user types only a partial match, the simulator can ask for more input, but if the user types something invalid, the simulator is empty, and can complain to the user. Another possibility is to yield every time a subexpression is matched, allowing you to peek at every sub match in the input. This couldn't be used to exclude some submatches, though. For instance, if you tried to match ((a)*b) against aaa, you could see three submatches for the a's, even though the whole expression ultimately fails because there is no b, and no submatch for the corresponding b's
Finally, there might be a way to modify this to work with backreferences. Even if it's elegent, it's sure to be inefficient, specifically, regular expressions plus backreferences are in NP-Complete, so I won't even try to think of a way to do this, because we are only interested (here) in (asymptotically) efficient possibilities.