Writing INTEGRATOR similar to http://integrals.wolfram.com/index.jsp using parsing in java - regex

Hi as part of my project I have been working on evaluation of arithmetic expression using reg-ex in java.
expressions are like this : 2+3/4(5+7)
first I will modify it to this : 2+3/4*(5+7)
and convert it to postfix
postfix: 23457+*/+
The procedure I have adopted is parsing all of the tokens (i.e. Integers,Operators,Open Paren,Close Paren) using Reg-Ex and then sorting by the i-th position of their occurrence. After that I converted that array of token into post-fix expression and then solved that expression, Until this point every thing is working fine.
Now I would like to extend it to solving differentiation,integration or solving quadratic equation.
foe ex: differentiate or integrate expression x^2+2*x+2
similar to http://integrals.wolfram.com/index.jsp
Is it possible? because right now I dont have any clue how to proceed with that?

Most calculators I am aware of - calculating equations/differentions and integrals using numerical analysis. This allows one to find a close solution to the exact (analytical one) solution, sometimes even for unsolveable equations (This is how we got the standard normal table, for the unsolveable normal density function)
For example, solving integrals - gaussian quardature is very common and efficeint way.
For solving equations - regula-falsi method is a simple and intuitive one

I think that most integral computation engines use the numerical approach just as amit said, however there's a way to compute it symbolically, but it's heuristic more than algorithmic, it's done by pattern matching. I think that Mathematica follows this approach.

Related

Function to count an expression

Is there any function in C++ (included in some library or header file), that would count an expression from string?
Let's say we have a string, which equals 2 + 3 * 8 - 5 (but it's taken from user's keyboard, so we don't know what expression exactly it will be while writing the code), and we want this function co count it, but of course in the correct order (1. power / root 2. times / divide 3. increase / decrease).
I've tried to take all the numbers to an array of ints and operators to an array of chars (okay, actually vectors, because I don't know how many numbers and operators it's going to contain), but I'm not sure what to do next.
Note, that I'm asking if there's any function already written for that, if not I'm going to just try it again.
By count, I'm taking that to mean "evaluate the expression".
You'll have to use a parser generator like boost::spirit to do this properly. If you try hand-writing this I guarantee pain and misery for all involved.
Try looking on here for calculator apps, there are several:
http://boost-spirit.com/repository/applications/show_contents.php
There are also some simple calculator-style grammars in the boost::spirit examples.
Quite surprisingly, others before me have not redirected you to this particular algorithm which is easy to implement and converts your string into this special thing called Reverse Polish Notation (RPN). Computing expressions in RPN is easy, the hard part is implementing Shunting Yard but it is something done many times before and you may find many tutorials on the subject.
Quick overview of the algorithms:
PRN - RPN is a way to write expressions that eliminates the need for parentheses, thus it allows easier computation. Practically speaking, to compute one such expression you walk the string from left to right while keeping a stack of operands. Whenever you meet an operand, you push it onto the stack. Whenever you meet an operation token, you calculate its result on the last 2 operands (if the operation is binary ofcourse, on the last only if it is unary) and push it onto the stack. Rinse and repeat until the end of string.
Shunting Yard really is much harder to simply overview and if I do try to accomplish this task, this answer will end up looking much like the wikipedia article I linked above so I'll save that trouble to both of us.
Tl;DR; Go read the links in the first sentence.
You will have to do this yourself as there is no standard library that handles infix equation resolution
If you do, please include the code for what you are trying and we can help you

Create regex from samples algorithm

AFAIK no one have implemented an algorithm that takes a set of strings and substrings and gives back one or more regular expressions that would match the given substrings inside the strings. So, for instance, if I'd give my algorithm this two samples:
string1 = "fwef 1234 asdfd"
substring1 = "1234"
string2 = "asdf456fsdf"
substring2 = "456"
The algorithm would give me the regular expression "[0-9]*" back. I know it could give more than one regex or even no possible regex back and you might find 1000 reasons why such algorithm would be close to impossible to implement to perfection. But what's the closest thing?
I don't really care about regex itself also. Basically what I want is an algorithm that takes samples as the ones above and then finds a pattern in them that can be used to easily find the "kind" of text I want to find in a string without having to write any regex or code manually.
I don't have proof but I suspect no such discrete algorithm with a finite output could exist since you are asking for the creation of a regular language which could be "large" in respect to the input size.
With that, I suggest you peek at txt2re which can break down sample texts one-by-one and help you build regexes.
FlashFill a new feature of MS Excel 2013 would do exactly the task you want, but it does not give you the regular expression. It's a NP-complete problem and an open question for practical purposes. If you're interested in how to synthesise string manipulation from multiple examples, Go Flash Fill official website and read a few papers. They have pseudo-code and demo. movies as well.
There are many such algorithm in fact. This is a research area called "Grammatical inference".
I know RPNI, for example. (you could also look on the probabilistic branch, alergia, MDI, DEES). These algorithms generate DSA (Deterministic State Automata). In fact you absolutely don't need to enter the strings in your example. Only substrings.
There are also some algorithms to generate directly Non deterministic automata.
Of course, get the regular expression from an Non Deterministic Automata is easy.
The main ideas are simple:
Generate a PTSA (Prefix Tree State Automata) from your sample.
Then, you have to try to "merge" some states. From these merge, will emerge loops (i.e. * in the regular expression). All the difficulty being to choose the right rule to merge.
Here you go:
re = '|'.join(substrings)
If you want anything more general, your algorithm is going to have to make educated guesses about what type of strings are acceptable as matches, and it's trivial to demonstrate that no procedure can account for all possible sets of possible inputs without simply enumerating them all. For instance, consider some of these scenarios:
Match all prime numbers
Match hexadecimal strings, but no strings containing 'f' are in the sample set
Match the same string repeated twice
Match any even-length string
The root problem is that your question is incompletely specified. If you have a more specific requirement, that might be solvable, depending on what it is.

How to determine if a regex is orthogonal to another regex?

I guess my question is best explained with an (simplified) example.
Regex 1:
^\d+_[a-z]+$
Regex 2:
^\d*$
Regex 1 will never match a string where regex 2 matches.
So let's say that regex 1 is orthogonal to regex 2.
As many people asked what I meant by orthogonal I'll try to clarify it:
Let S1 be the (infinite) set of strings where regex 1 matches.
S2 is the set of strings where regex 2 matches.
Regex 2 is orthogonal to regex 1 iff the intersection of S1 and S2 is empty.
The regex ^\d_a$ would be not orthogonal as the string '2_a' is in the set S1 and S2.
How can it be programmatically determined, if two regexes are orthogonal to each other?
Best case would be some library that implements a method like:
/**
* #return True if the regex is orthogonal (i.e. "intersection is empty"), False otherwise or Null if it can't be determined
*/
public Boolean isRegexOrthogonal(Pattern regex1, Pattern regex2);
By "Orthogonal" you mean "the intersection is the empty set" I take it?
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
Then again, I'm a theorist...
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
That seems like shooting sparrows with a cannon. Why not just construct the product automaton and check if an accept state is reachable from the initial state? That'll also give you a string in the intersection straight away without having to construct a regular expression first.
I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem.
I only know of a way to do it which involves creating a DFA from a regexp, which is exponential time (in the degenerate case). It's reducible to the halting problem, because everything is, but the halting problem is not reducible to it.
If the last, then you can use the fact that any RE can be translated into a finite state machine. Two finite state machines are equal if they have the same set of nodes, with the same arcs connecting those nodes.
So, given what I think you're using as a definition for orthogonal, if you translate your REs into FSMs and those FSMs are not equal, the REs are orthogonal.
That's not correct. You can have two DFAs (FSMs) that are non-isomorphic in the edge-labeled multigraph sense, but accept the same languages. Also, were that not the case, your test would check whether two regexps accepted non-identical, whereas OP wants non-overlapping languages (empty intersection).
Also, be aware that the \1, \2, ..., \9 construction is not regular: it can't be expressed in terms of concatenation, union and * (Kleene star). If you want to include back substitution, I don't know what the answer is. Also of interest is the fact that the corresponding problem for context-free languages is undecidable: there is no algorithm which takes two context-free grammars G1 and G2 and returns true iff L(G1) ∩ L(g2) ≠ Ø.
It's been two years since this question was posted, but I'm happy to say this can be determined now simply by calling the "genex" program here: https://github.com/audreyt/regex-genex
$ ./binaries/osx/genex '^\d+_[a-z]+$' '^\d*$'
$
The empty output means there is no strings that matches both regex. If they have any overlap, it will output the entire list of overlaps:
$ runghc Main.hs '\d' '[123abc]'
1.00000000 "2"
1.00000000 "3"
1.00000000 "1"
Hope this helps!
The fsmtools can do all kinds of operations on finite state machines, your only problem would be to convert the string representation of the regular expression into the format the fsmtools can work with. This is definitely possible for simple cases, but will be tricky in the presence of advanced features like look{ahead,behind}.
You might also have a look at OpenFst, although I've never used it. It supports intersection, though.
Excellent point on the \1, \2 bit... that's context free, and so not solvable. Minor point: Not EVERYTHING is reducible to Halt... Program Equivalence for example.. – Brian Postow
[I'm replying to a comment]
IIRC, a^n b^m a^n b^m is not context free, and so (a\*)(b\*)\1\2 isn't either since it's the same. ISTR { ww | w ∈ L } not being "nice" even if L is "nice", for nice being one of regular, context-free.
I modify my statement: everything in RE is reducible to the halting problem ;-)
I finally found exactly the library that I was looking for:
dk.brics.automaton
Usage:
/**
* #return true if the two regexes will never both match a given string
*/
public boolean isRegexOrthogonal( String regex1, String regex2 ) {
Automaton automaton1 = new RegExp(regex1).toAutomaton();
Automaton automaton2 = new RegExp(regex2).toAutomaton();
return automaton1.intersection(automaton2).isEmpty();
}
It should be noted that the implementation doesn't and can't support complex RegEx features like back references. See the blog post "A Faster Java Regex Package" which introduces dk.brics.automaton.
You can maybe use something like Regexp::Genex to generate test strings to match a specified regex and then use the test string on the 2nd regex to determine whether the 2 regexes are orthogonal.
Proving that one regular expression is orthogonal to another can be trivial in some cases, such as mutually exclusive character groups in the same locations. For any but the simplest regular expressions this is a nontrivial problem. For serious expressions, with groups and backreferences, I would go so far as to say that this may be impossible.
I believe kdgregory is correct you're using Orthogonal to mean Complement.
Is this correct?
Let me start by saying that I have no idea how to construct such an algorithm, nor am I aware of any library that implements it. However, I would not be at all surprised to learn that nonesuch exists for general regular expressions of arbitrary complexity.
Every regular expression defines a regular language of all the strings that can be generated by the expression, or if you prefer, of all the strings that are "matched by" the regular expression. Think of the language as a set of strings. In most cases, the set will be infinitely large. Your question asks whether the intersections of the two sets given by the regular expressions is empty or not.
At least to a first approximation, I can't imagine a way to answer that question without computing the sets, which for infinite sets will take longer than you have. I think there might be a way to compute a limited set and determine when a pattern is being elaborated beyond what is required by the other regex, but it would not be straightforward.
For example, just consider the simple expressions (ab)* and (aba)*b. What is the algorithm that will decide to generate abab from the first expression and then stop, without checking ababab, abababab, etc. because they will never work? You can't just generate strings and check until a match is found because that would never complete when the languages are disjoint. I can't imagine anything that would work in the general case, but then there are folks much better than me at this kind of thing.
All in all, this is a hard problem. I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem. Although, given that regular expressions are not Turing complete, it seems at least possible that a solution exists.
I would do the following:
convert each regex to a FSA, using something like the following structure:
struct FSANode
{
bool accept;
Map<char, FSANode> links;
}
List<FSANode> nodes;
FSANode start;
Note that this isn't trivial, but for simple regex shouldn't be that difficult.
Make a new Combined Node like:
class CombinedNode
{
CombinedNode(FSANode left, FSANode right)
{
this.left = left;
this.right = right;
}
Map<char, CombinedNode> links;
bool valid { get { return !left.accept || !right.accept; } }
public FSANode left;
public FSANode right;
}
Build up links based on following the same char on the left and right sides, and you get two FSANodes which make a new CombinedNode.
Then start at CombinedNode(leftStart, rightStart), and find the spanning set, and if there are any non-valid CombinedNodes, the set isn't "orthogonal."
Convert each regular expression into a DFA. From the accept state of one DFA create an epsilon transition to the start state of the second DFA. You will in effect have created an NFA by adding the epsilon transition. Then convert the NFA into a DFA. If the start state is not the accept state, and the accept state is reachable, then the two regular expressions are not "orthogonal." (Since their intersection is non-empty.)
There are know procedures for converting a regular expression to a DFA, and converting an NFA to a DFA. You could look at a book like "Introduction to the Theory of Computation" by Sipser for the procedures, or just search around the web. No doubt many undergrads and grads had to do this for one "theory" class or another.
I spoke too soon. What I said in my original post would not work out, but there is a procedure for what you are trying to do if you can convert your regular expressions into DFA form.
You can find the procedure in the book I mentioned in my first post: "Introduction to the Theory of Computation" 2nd edition by Sipser. It's on page 46, with details in the footnote.
The procedure would give you a new DFA that is the intersection of the two DFAs. If the new DFA had a reachable accept state then the intersection is non-empty.

Are there any good / interesting analogs to regular expressions in 2d?

Are there any good (or at least interesting but flawed) analogs to regular expressions in two dimensions?
In one dimension I can write something like /aaac?(bc)*b?aaa/ to quickly pull out a region of alternating bs and cs with a border of at least three as. Perhaps as importantly, I can come back a month later and see at a glance what it's looking for.
I'm finding myself writing custom code for analogous problems in 2d (some much more complicated / constrained) and it would be nice to have a more concise and standardized notation, even if I have to write the engine behind it myself.
A second example might be called "find the +". The goal is to locate a column of 3 or more as, a b bracketed by 3 or more as with three or more as below. It should match:
..7...hkj.k f
7...a h o j
----a--------
j .a,g- 8 9
.aaabaaaaa7 j
k .a,,g- h j
hh a----? j
a hjg
and might be written as [b^(a{3})v(a{3})>(a{3})<(a{3})] or...
Suggestions?
Not being a regex expert, but finding the problem interesting, I looked around and found this interesting blog entry. Especially the syntax used there for defining the 2D regex looks appealing. The paper linked there might tell you more than me.
Update from comment: Here is the link to the primary author's page where you can download the linked paper "Two-dimensional languages": http://www.mat.uniroma2.it/~giammarr/Research/pubbl.html
Nice problem.
First, ask yourself if you can constrain the pattern to a "+" pattern, or if you would it need to test/match rectangles also. For instance, a rectangle of [bc] with a border of a would match the center rectangle below, but would also match a "+" shape of [c([bc]*a})v([bc]*a)>([bc]*a)<([bc]*a)] (in your syntax)
xxxaaaaaxxx
yzyabcba12d
defabcbass3
oreabcba3s3
s33aaaaas33
k388x.egiee
If you can constrain it to a "+" then your task is much easier. As ShuggyCoUk said, parsing a RE is usually equivalent to a DFSM -- but for a single, serial input which simplifies things greatly.
In your "RE+" engine, you'll have to debug not only the engine, but also each place that the expressions are entered. I'd like the compiler to catch any errors it could. Given that, you could also use something that was explicitly four RE's, like:
REPlus engine = new REPlus('b').North("a{3}")
.East("a{3}").South("a{3}").West("a{3}");
However, depending on your implementation this may be cumbersome.
And with regard to the traversal engine -- would the North/West patterns match RtL or LtR? It could matter if the patterns match differently with or w/o greedy sub-matches.
Incidentally, I think the '^' in your example is supposed to be one character to the left, outside the parenthesis.
Regular expressions are designed to model patterns in one dimension. As I understand it, you want to match patterns in a two dimensional array of characters.
Can you use regular expressions? That depends on whether the property that you are searching for is decomposable into components which can be matched in one dimension or not. If so, you can run your regular expressions on columns and rows, and look for intersections in the solution sets from those. Depending on the particular problem you have, this may either solve the problem, or it may find areas in your 2d array which are likely to be solutions.
Whether your problem is decomposable or not, I think writing some custom code will be inevitable. But at least it sounds like an interesting problem, so the work should be pleasant.
You're essentially talking about a spatial query language. There are plenty out there if you look up spatial query, geographic query and graphic querying. The spatial part generally comes down to points, lines and objects in a region that have other given attributes. Regions can be specified as polygons, distance from a point (e.g. circles), distance from a linear feature such as a road, all points on one side of a linear feature, etc... You can then get into more complex queries like set of all nearest neighbours, shortest path, travelling salesman, and tesselations like Delaunay TINs and Voronoi diagrams.

calculating user defined formulas (with c++)

We would like to have user defined formulas in our c++ program.
e.g. The value v = x + ( y - (z - 2)) / 2. Later in the program the user would define x,y and z -> the program should return the result of the calculation. Somewhen later the formula may get changed, so the next time the program should parse the formula and add the new values. Any ideas / hints how to do something like this ? So far I just came to the solution to write a parser to calculate these formulas - maybe any ideas about that ?
If it will be used frequently and if it will be extended in the future, I would almost recommend adding either Python or Lua into your code. Lua is a very lightweight scripting language which you can hook into and provide new functions, operators etc. If you want to do more robust and complicated things, use Python instead.
You can represent your formula as a tree of operations and sub-expressions. You may want to define types or constants for Operation types and Variables.
You can then easily enough write a method that recurses through the tree, applying the appropriate operations to whatever values you pass in.
Building your own parser for this should be a straight-forward operation:
) convert the equation from infix to postfix notation (a typical compsci assignment) (I'd use a stack)
) wait to get the values you want
) pop the stack of infix items, dropping the value for the variable in where needed
) display results
Using Spirit (for example) to parse (and the 'semantic actions' it provides to construct an expression tree that you can then manipulate, e.g., evaluate) seems like quite a simple solution. You can find a grammar for arithmetic expressions there for example, if needed... (it's quite simple to come up with your own).
Note: Spirit is very simple to learn, and quite adapted for such tasks.
There's generally two ways of doing it, with three possible implementations:
as you've touched on yourself, a library to evaluate formulas
compiling the formula into code
The second option here is usually done either by compiling something that can be loaded in as a kind of plugin, or it can be compiled into a separate program that is then invoked and produces the necessary output.
For C++ I would guess that a library for evaluation would probably exist somewhere so that's where I would start.
If you want to write your own, search for "formal automata" and/or "finite state machine grammar"
In general what you will do is parse the string, pushing characters on a stack as you go. Then start popping the characters off and perform tasks based on what is popped. It's easier to code if you force equations to reverse-polish notation.
To make your life easier, I think getting this kind of input is best done through a GUI where users are restricted in what they can type in.
If you plan on doing it from the command line (that is the impression I get from your post), then you should probably define a strict set of allowable inputs (e.g. only single letter variables, no whitespace, and only certain mathematical symbols: ()+-*/ etc.).
Then, you will need to:
Read in the input char array
Parse it in order to build up a list of variables and actions
Carry out those actions - in BOMDAS order
With ANTLR you can create a parser/compiler that will interpret the user input, then execute the calculations using the Visitor pattern. A good example is here, but it is in C#. You should be able to adapt it quickly to your needs and remain using C++ as your development platform.