I'm using sympy to generate expressions like this:
for crowd in itertools.combinations(symbs, max_true + 1):
exprs.append(functools.reduce(operator.and_, crowd))
unaltered = ~functools.reduce(operator.or_, exprs)
Later, I convert them to CNF:
altered = sympy.logic.boolalg.to_cnf(unaltered, simplify=True, force=True)
It takes a lot of computer time. I made a gist with more details:
https://gist.github.com/MatrixManAtYrService/501ea099826a5aeeacc9368710b059ec
Given that I'm generating expressions with for loops, they're in a reliable format. Sympy (understandably) is doing the exhaustive thing and solving them "by hand", because it doesn't know that they're so well behaved. A human who is looking at the unaltered/altered expressions can easily ascertain the pattern and just generate the CNF directly with a for loop.
That's easy enough in this case, but I expect to have more constraints.
I want to know if I'm in uncharted terratory, or just failing to ask for help correctly.
Does Sympy have anything to help with this kind of thing? Is there another library I should explore? Is there a name for the "look at it and extrapolate based on a pattern" strategy that I'm proposing? Is there a list of algorithms for the task somewhere?
Related
I want to know whether equation set has a solution, and I use solve(f)!=[] (python sympy) to achieve it. But I only need to know that whether there is a solution so I do not need to find all the solutions which will consume a lot of time. Is there any way to do it efficiently?
Be aware that sympy.solve giving [] doesn't necessarily mean the equation has no solution. It only means it could not find any. Some equations have solutions but they cannot be expressed in closed form (like cos(x) = x). sympy.solveset will give you the full solutions but in cases where it can't tell it will just return a generic solution set.
As to the original question I don't know if there's a way to do it in general. If you are only dealing with real continuous functions you could check if it is strictly positive or strictly negative on its domain. Sympy doesn't have the strongest tools for checking this, without some user assistance.
I have an app that includes a 3 operator (& | !) boolean expression evaluator, with variables and constants. Generally the expressions aren't too long (perhaps 50 terms at the most, but usually a lot less). There can be very many expressions - I'm expecting the upper limit to be around a million. Currently I have a hand written parser with a very simple evaluator that simply recursively traverses the parse tree. One constraint is that this has to be callable from C++. I have no sharing between expressions. I'd like to investigate speeding this up.
I see two avenues of research.
Add sharing and store the state indicating whether an expression node has been evaluated or not.
Extract Common Subexpressions.
Also I would expect that a code generation approach will be faster than an interpretive approach working on parse trees or similar structures. It would probably be fairly straightforward to generate some C++ code, but considering the length of the functions, I don't know if a compiler like GCC will be able to optimize the CSEs.
I've seen that there are a few libraries available for expression evaluation, but in my work environment adding 3rd party libraries is not simple plus they all seem very complicated compared to my needs.
Lastly I've been looking at Antlr4 a bit recently, so that might be appropriate for me. In the past I've worked on C code generation, but I have no experience of using something like LLVM for optimisation and code generation.
Any suggestions for which way to go?
As far as I understood, your question is more about faster expression evaluation than it about faster expression parsing. So my answer will focus on the former. Parsing, after all, should not be the bottleneck as your expression language looks simple enough to implement a manually tuned parser for it.
So, to accelerate your evaluations, you can consider JIT execution of your formulas using LLVM. That is, given your formula F you can (relatively) easily generate corresponding LLVM IR and directly evaluate it. This SMT solver does just that. IR code generation is implemented in a single C++ class here.
Note that the boolean expressions you mentioned are a subset of the SMT language supported by that solver. Additionally, you can easily adjust how aggressive the LLVM optimizer needs to be.
However, IR generation and optimization has its overhead. Therefore, in case a given formula is not evaluated often enough to amortize the initial overhead, then I would recommend direct interpretation instead. You can look in this case for opportunities to find structural similarities and common subexpressions.
As much as I'd like to suggest ANTLR4, I fear it won't meet your performance needs. There is a lot going on under the hood with its adaptive LL(*) algoritms and though there are some common tricks to improve its performance, simply tracing an ANTLR4 interpreter at runtime suggests that unless your current expression evaluator is very inefficient, it is likely faster than ANTLR4, which is an industrial-duty engine meant to support grammars far more complicated than yours. I use ANTLR when a LALR(1) DFA shift-reduce engine won't support my grammar, and take the performance hit in return for the extra parsing power of ANTLR4.
I am currently working on a database related project in which I generate a lot of C++ code. This code is compiled then and loaded as a dynamic library. I use this techniques to build efficient code for the database schema and queries.
Currently, I am using simple file write to generate the code (what was okay for the proof-of-concept implementation). Now, I am searching for a more elegant but comparable flexible solution to generate C++ code.
I searched quite a lot but all the solutions I found so far are rather complex/extensive, not efficient enough, or not flexible enough.
What libraries are you using in your C++ projects to generate code?
Best,
Moritz
You can use a program transformation system (PTS) to define and compose code templates in a reliable way.
Most PTS enable one to define a grammar, and then parse source code into ASTs using that grammar. More importantly, they accept patterns: source code fragments (usually of a nonterminal or a list of nonterminals) with placeholders that correspond to wellformed sub-fragments (nonterminals representing subtrees). These patterns usually insist that a named placeholder match identically (see example below). Such patterns can be used to match against a parsed AST as a way to find code fragments using the surface syntax.
So, one might use a pattern:
pattern x_squared(t: term): product
= " \t * \t ";
to hunt for subexpressions which consist of products of identical subtrees.
This will match
(p + q[17])*(p+q[17)
but not
2 * (x-3)
But just as interestingly, such patterns can be used as code generators, by instantiating the pattern with bound value (trees) for the variables. So,
"instantiate x_squared(2^x)" produces
(2^x)*(2^x)
By itself, this is just a fancy sort of macro scheme. It is a lot better, in that it can tell you "at compile time" (for the patterns) whether what you are composing makes sense or not. So you get type checking of the composition of the code fragments. For instance, you might accidentally code "instantiate x_squared(int q)", but a good PTS will object that "int q" is not a "term"; you find the bug when you build the code generator.
Where this gets really interesting is where one can build many different code fragments, from many different patterns, and compose those fragments with yet more patterns. This allows one to build very complex code. All of this is a (syntax-type) safe way; resulting trees are valid syntax. (You can still bollix semantics; nothing is perfect). As the complexity of the code you can generate goes up, it is good to have this additional checking to help you avoid generating bad code.
A PTS has an additional advantage: after composing code fragments, it can apply source-to-source transformations to optimize the resulting code. Thus you can produce optimized code according to your ability to write matching transformations, and harnessing knowledge you have during code generation.
Imagine you generate code for a matrix multiply:
... P * Q ...
and your code generator somehow or other knows that Q is an identity matrix.
Then the following optimization can remove an expensive matrix multiply:
rule optimize_matrix_times_unit(m: term, n: term): product -> product
" \m * \q "
-> " \m "
if is_identity_matrix(q)
This transformation takes advantage of pattern matching (to find a matrix product) in the generated code, pattern instantiation (to generate a replacement for the matched product), and additional knowledge or analysis (is_identity_matrix) that the code generation can do.
You need a PTS capable of handling C++ parsing; those are a bit hard to find. The one I designed (DMS Software Reengineering Toolkit) happens to do this. The examples in this answer are DMS-style.
Here's a technical paper that describes a large-scale reengineering task done by DMS on C++ code. A number of examples in the paper are actually quite complex patterns used to instantiate code; the reengineering task had to generate a new set of APIs for an existing chunk of code.
I was reading the Java project idea described here:
The user gives examples of what he wants and does not want to match. The program tries to deduce a regex that fits the examples. Then it generates examples that would fit and not fit. The user corrects the examples it got wrong, and it composes a new regex. Iteratively, you get a regex that is close enough to what you need.
This sounds like a really interesting idea to me. Does anyone has an idea as to how to do this? My first idea was something like a genetic algorithm, but I would love some input from you guys.
Actually, this starts to look more and more like a compiler application. In fact, if I remember correctly, the Aho Dragon compiler book uses a regex example to build a DFA compiler. That's the place to start. This could be a really cool compiler project.
If that is too much, you can approach it as an optimization in several passes to refine it further and further, but it will be all predefined algo's at first:
First pass: Want to match Cat, Catches cans
Result: /Cat|Catches|Cans/
Second Pass: Look for similar starting conditions:
Result: /Ca(t|tches|ans)/
Second Pass: Look for similar ending conditions:
Result: /Ca(t|tch|an)s*/
Third Pass: Look for more refinements like repetitions and negative conditions
There exists algorithm that does exactly this for positive examples.
Regular expression are equivalent to DFA (Deterministic Finite Automata).
The strategie is mostly always the same:
Look at Alergia (for the theory) and MDI algorithm (for real usage) if generate an Deterministic Automaton is enough.
The Alergia algorithm and MDI are both described here:
http://www.info.ucl.ac.be/~pdupont/pdupont/pdf/icml2k.pdf
If you want to generate smaller models you can use another algorithm. The article describing it is here:
http://www.grappa.univ-lille3.fr/~lemay/publi/TCS02.ps.gz
His homepage is here:
http://www.grappa.univ-lille3.fr/~lemay
If you want to use negative example, I suggest you to use a simple rule (coloring) that prevent two states of the DFA to be merged.
If you ask these people, I am sure they will share their code source.
I made the same kind of algorithm during my Ph.D. for probabilistic automata. That means, you can associate a probability to each string, and I have made a C++ program that "learn" Weighted Automata.
Mainly these algorithm work like that:
from positive examples: {abb, aba, abbb}
create the simplest DFA that accept exactly all these examples:
-> x -- a --> (y) -- b --> (z)
\
b --> t -- b --> (v)
x cant got to state y by reading the letter 'a' for example.
The states are x, y, z, t and v. (z) means it is a finite state.
then "merge" states of the DFA: (here for example the result after merging states y and t.
_
/ \
v | a,b ( <- this is a loop :-) )
x -- a -> (y,t) _/
the new state (y,t) is a terminal state obtaining by merging y and t. And you can read the letter a and b from it.
Now the DFA can accept: a(a|b)* and it is easy to construct the regular expression from the DFA.
Which states to merge is a choice that makes the main difference between algorithms.
The program tries to deduce a regex
that fits the examples
I don't think it's a useful question to ask. You have to know semantically what you need to represent to deduce something. When you write a regex, you have a purpose: accepting urls, accepting emails, extracting tokens from code, etc. I would redefine the question as so: given a knowledge base and a semantic for regular expression, compute the smallest regex. This get a step further, because you have natural language trying explaining a general expression and we all know how it get ambiguous! You have to have some semantic explanation. Without that, the best thing you can do for examples is to compute regex that cover all string from the ok list.
Algorithm for coverage:
Populate Ok List
Populate Not ok List
Check for repetitions
Check for contradictions ( the same string cannot be in both list )
Create Deterministic Finite Automata (DFA) from Ok List where strings from the list are final states
Simplify the DFA by eliminating repetitive states. ([1] 4.4 )
Convert DFA to regular expression. ([1] 3.2.2 )
Test Ok list and Not ok list
[1] Introduction to Automata Theory, Language, and Computation. J. Hopcroft, R. Motwani, J.D. Ullman, 2nd Edition, Pearson Education.
P.S.
I had some experience with genetic programming and I think it's really overhead for your problem. Since the objective function is really light it's better to evaluate with a single processor and this can take a lot of time. To have shorter expression you just need to minimize the DFA. But GA may possibly produce interesting result.
Maybe I'm a bit late, but there is a way to solve this problem by means of Genetic Programming.
Genetic Programming (GP) is an evolutionary machine learning technique in which candidate a candidate solution for a given problem is represeted as an abstract syntax tree.
Several studies have been published on how to use GP in order to find a regular expression that matches a given set of examples.
You can find the articles and the details here
A webapp that does this is hosted at regex.inginf.units.it.
The source code behind the application has been publicly released on github
You may try to use a basic inferring algorithm that has been used in other applications. I have implemented a very basic based on building a state machine. However, it only accounts for positive samples. The source code is on http://github.com/mvaled/inferdtd
Should could be interested in the AutomataInferrer.py which is very simple.
RegexBuilder seems to have many of the features you're looking for.
I think that the title accurately summarizes my question, but just to elaborate a bit.
Instead of using a regular expression to verify properties of existing strings, I'd like to use the regular expression as a way to generate strings that have certain properties.
Note: The function doesn't need to generate every string that satisfies the regular expression (cause that would be an infinite number of string for a lot of regexes). Just a sampling of the many valid strings is sufficient.
How feasible is something like this? If the solution is too complicated/large, I'm happy with a general discussion/outline. Additionally, I'm interested in any existing programs or libraries (.NET) that do this.
Well a regex is convertible to a DFA which can be thought of as a graph. To generate a string given this DFA-graph you'd just find a path from a start state to an end state. You'd just have to think about how you want to handle cycles (Maybe traverse every cycle at least once to get a sampling? n times?), but I don't see why it wouldn't work.
This utility on UtilityMill will invert some simple regexen. It is based on this example from the pyparsing wiki. The test cases for this program are:
[A-EA]
[A-D]*
[A-D]{3}
X[A-C]{3}Y
X[A-C]{3}\(
X\d
foobar\d\d
foobar{2}
foobar{2,9}
fooba[rz]{2}
(foobar){2}
([01]\d)|(2[0-5])
([01]\d\d)|(2[0-4]\d)|(25[0-5])
[A-C]{1,2}
[A-C]{0,3}
[A-C]\s[A-C]\s[A-C]
[A-C]\s?[A-C][A-C]
[A-C]\s([A-C][A-C])
[A-C]\s([A-C][A-C])?
[A-C]{2}\d{2}
#|TH[12]
#(#|TH[12])?
#(#|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9]))?
#(#|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9])|OH(1[0-9]?|2[0-9]?|30?|[4-9]))?
(([ECMP]|HA|AK)[SD]|HS)T
[A-CV]{2}
A[cglmrstu]|B[aehikr]?|C[adeflmorsu]?|D[bsy]|E[rsu]|F[emr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airu]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|Uu[bhopqst]|U|V|W|Xe|Yb?|Z[nr]
(a|b)|(x|y)
(a|b) (x|y)
This can be done by traversing the DFA (includes pseudocode) or else by walking the regex's abstract-syntax tree directly or converting to NFA first, as explained by Doug McIlroy: paper and Haskell code. (He finds the NFA approach to go faster, but he didn't compare it to the DFA.)
These all work on regular expressions without back-references -- that is, 'real' regular expressions rather than Perl regular expressions. To handle the extra Perl features it'd be easiest to add on a post-filter.
Added: code for this in Python, by Peter Norvig and me.
Since it is trivially possible to write a regular expression that matches no possible strings, and I believe it is also possible to write a regular expression for which calculating a matching string requires an exhaustive search of possible strings of all lengths, you'll probably need an upper bound on requesting an answer.
The easiest way to implement but definitely most CPU time intensive approach would be to simply brute force it.
Set up a character table with the characters that your string should contain and then just sequentially generate strings and do a Regex.IsMatch on them.
I, personally, believe that this is the holy grail of reg-ex. If you could implement this -- even only 3/4 working -- I have no doubt that you'd be rich in about 5 minutes.
All joking aside, I'm not sure that what you are truly going after is feasible. Reg-Ex is a very open, flexible language and giving the computer enough sample input to truly and accurately find what you need, is probably not feasible.
If I'm proven wrong, I wish kudos to that developer.
To look at this from a different perspective, this is almost (not quite) like giving a computer it's output, and having it -- based on that -- write a program for you. This is a little overboard, but it kind of illustrates my point.