Quadric equation parser using QRegExp - c++

I want to implement a parser for a quadric equation using regular expressions. I want to keep it as a console app. I done the regex and tested it in Debuggex. Currently I have 2 problems - I can't get the a,b,c from (ax^2+bx+c) and i want to add bash-like history with up and down arrows. Thanks in advance. My code:
#include <QCoreApplication>
#include <QRegExp>
#include <QString>
#include <QTextStream>
#include <QStringList>
#include <QDebug>
#include <cstdio>
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
Q_UNUSED(a);
QTextStream cin(stdin, QIODevice::ReadOnly | QIODevice::Text);
QTextStream cout(stdout, QIODevice::WriteOnly | QIODevice::Text);
const QString regexText = R"(^[-]?\d*x\^2\s*[+,-]\s*\d*x\s*[+,-]\s*\d*$)";
while(true)
{
QRegExp regex(regexText);
cout << "Enter an equation to solve or press EOF(Ctrl+D/Z) to exit." << endl;
cout << "--> " << flush;
QString equation;
equation = cin.readLine();
if( equation.isNull() )
{
cout << endl;
cout << "Thanks for using quadric equation solver! Exitting..." << endl;
return 0;
}
int pos = regex.indexIn(equation);
QStringList captures = regex.capturedTexts();
qDebug() << captures;
}
}

I think you're looking to learn how to properly use Capturing groups, which debuggex isn't great at showing you the result of. I'd shoot for a regular expression more along these lines:
^(-?\d*)x\^2\s*([+-]\s*\d*)x\s*([+-]\s*\d+)?$
You can see it in action at RegExr, my preferred RegEx tool. Mouse over the highlighted matches to see what the groups have captured.
You can see that the parentheses essentially deliniate sub-expressions that can be extracted separately, and parsed for meaning. I've chosen to include the operation (+/-) so you can use it to parse the positive or negative nature of the coefficients. You'll see in the example data that it doesn't cover decimal coefficients, but neither did your original expression, and I think this answers the most pressing issue.
Decimals
Capturing a decimal is as easy as adding a snipped after every set of digits that you capture:
(?:\.\d+)?
Which optionally matches (without capturing) a literal period followed by some other digits. This turns your greater Regular Expression into:
^(-?\d*(?:\.\d+)?)x\^2\s*([+-]\s*\d*(?:\.\d+)?)x\s*([+-]\s*\d+(?:\.\d+)?)?$
Which, as you can see, allows the capture of decimal expressions. They still have to be in order (a shortcoming of regular expressions, but only when you're trying to do everything at once), but you've increased the number of problems you can solve.
Reordered
The next step is to deal with out of order expressions. You could do this in a single regular expression, but I recommend against it for a few reasons:
It's awful to read, and thus maintain
Doing it in a single RegEx makes it hard to exclude extraneous information.
Doing it in pieces solves the problem of multiple terms automagically (like x^2+x+x+2)
Doing it in pieces sets you up to capture higher order polynomials more easily.
1: Validation
The first basic step is to decide what a term looks like. For me, a term is an operator, followed by optional whitespace, followed a variable expression or a constant. OR:
[+-]\s*(?:\d+(?:\.\d+)?|\d*(?:\.\d+)?x(?:\^\d+(?:\.\d+)?)?)
It's a doozy, so I'll include the Debuggex Visualization.
Wrap your head around the way that expression works, because it's the basic unit for the next one:
^-?\s*(?:\d+(?:\.\d+)?|\d*(?:\.\d+)?x(?:\^\d+(?:\.\d+)?)?)(?:\s*[+-]\s*(?:\d+(?:\.\d+)?|\d*(?:\.\d+)?x(?:\^\d+(?:\.\d+)?)?))+$
When you see that one in Debuggex, it becomes clear that it's basically just the former expression repeated one or more times. I added some whitespace and gave the first one an optional negative instead of an operator, but it's essentially the same.
Now, there is some room missing here, to add a negative or subtract a positive number. (think, 3x+ -4x^2), but it's a minor change to the regular expression, so I think I'll move on. Match that regular expression against your line (trimmed, of course), and you can know you have a valid equation.
2. Extraction
Extraction is based off a single regular expression, modified to capture specific terms. It does require the ability to use a lookahead, which I must admit some Regular expression engines don't support. But Debuggex supports it, and I didn't find confirmation or denial of QRegExp, so I'm going to include it.
((?:^-?|[+-])\s*d*(?:\.\d+)?)
This is your basic Regular Expression. Used by itself, it will capture a number, with no regard as to wether it's a coefficient or constant. To capture a constant, add a negative lookahead to ensure it's not followed by a variable:
((?:^-?|[+-])\s*d*(?:\.\d+)?)(?!\s*x)
To capture a specific exponent, just match it, followed by space or another sign.:
((?:^-?|[+-])\s*d*(?:\.\d+)?)\S*x\^2(?=[\s+-])
To capture without an exponent, use a negative lookahead to ensure it's missing:
((?:^-?|[+-])\s*d*(?:\.\d+)?)\s*x(?!\^)
Although, personally, I'd prefer to capture all variable terms at once with this:
((?:^-?|[+-])\s*d*(?:\.\d+)?)\s*x(?:^(\d+(?:\.\d+)?))
Which has exactly two capturing groups: one for the coefficient, and one for the exponent.

Related

Single repetition for some of the symbols in a set of characters

I need a regular expression that matches a string containing letters and the symbols: #, . or space. But all the symbols must appear only once in the whole string.
^[#][.][a-z]+$ - This matches for example#.asdf but need something to match one # and one dot in the string.
^[a-z]+[#][a-z]+[.][a-z]+$ - This is the best result for now.
I was just wondering if I can use something like that - ^[a-z[#]{1}]$.
You probably realize that you could get away with it by writing all 6 permutations of the 3 symbols. Unfortunately, that's the best you can do with plain regular expressions because by definition the only way regexp (seen as finite state automata) can remember what happened before is through their current state, you can't have a set to store what's you've seen so far.
So you'll have to do something like this.
Say \a stands for letters (you can expand it to [a-zA-Z]) because it's bad enough like that. You can use
\a*([#]?\a*[.]?\a*[ ]?|[#]?\a*[ ]?\a*[.]?|[.]?\a*[#]?\a*[ ]?|[.]?\a*[ ]?\a*[#]?|[ ]?\a*[.]\a*[#]|[.]?\a*[ ]\a*[#]|)\a*
I threw in question marks after the charactacters, so each can appear "at most" once, if you want exactly once you just drop all the question marks.
Evidently, if you were dealing with 10 symbols and you had to write 10! permutations you'd be better off writing a function like
fun check_string_contains_symbols_once_at_most(str) {
for c in str {
if symbol_list.contains(c) {
if symbol_seen_set.contains(c) then return false
else if symbol_seen_set.add(c)
}
}
}

Why does regex_match throw "complexity exception"?

I am trying to test (using boost::regex) whether a line in a file contains only numeric entries seperated by spaces. I encountered an exception which I do not understand (see below). It would be great if someone could explain why it is thrown. Maybe I am doing something stupid here in my way of defining the patterns? Here is the code:
// regex_test.cpp
#include <string>
#include <iostream>
#include <boost/regex.hpp>
using namespace std;
using namespace boost;
int main(){
// My basic pattern to test for a single numeric expression
const string numeric_value_pattern = "(?:-|\\+)?[[:d:]]+\\.?[[:d:]]*";
// pattern for the full line
const string numeric_sequence_pattern = "([[:s:]]*"+numeric_value_pattern+"[[:s:]]*)+";
regex r(numeric_sequence_pattern);
string line= "1 2 3 4.444444444444";
bool match = regex_match(line, r);
cout<<match<<endl;
//...
}
I compile that successfully with
g++ -std=c++11 -L/usr/lib64/ -lboost_regex regex_test.cpp
The resulting program worked fine so far and match == true as I wanted. But then I test an input line like
string line= "1 2 3 4.44444444e-16";
Of course, my pattern isn't built to recognise the format 4.44444444e-16 and I would expect that match == false. However, instead I get the following runtime error:
terminate called after throwing an instance of
'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::runtime_error> >'
what(): The complexity of matching the regular expression exceeded predefined bounds.
Try refactoring the regular expression to make each choice made by the state machine unambiguous.
This exception is thrown to prevent "eternal" matches that take an indefinite period time to locate.
Why is that?
Note: the example I gave is extremal in the sense that putting one digit less after the dot works ok. That means
string line= "1 2 3 4.4444444e-16";
just results in match == false as expected. So, I'm baffled. What is happening here?
Thanks already!
Update:
Problem seems to be solved. Given the hint of alejrb I refactored the pattern to
const string numeric_value_pattern = "(?:-|\\+)?[[:d:]]+(?:\\.[[:d:]]*)?";
That seems to work as it should. Somehow, the isolated optional \\. inside the original pattern [[:d:]]+\\.?[[:d:]]* left to many possibilities to match a long sequence of digits in different ways.
I hope the pattern is safe now. However, if someone finds a way to use it for a blow up in the new form, let me know! It's not so obvious for me whether that might still be possible...
I'd say that your regex is probably exponentially backtracking. To protect you from a loop that would become entirely unworkable if the input were any longer, the regex engine just aborts the attempt.
One of the patterns that often causes this problem is anything of the form (x+x+)+ - which you build up here when you place the first pattern inside the second.
There's a good discussion at http://www.regular-expressions.info/catastrophic.html

custom regular expression parser

i would like to do regular expression matching on custom alphabets, using custom commands. the purpose is to investigate equations and expressions that appear in meteorology.
So for example my alpabet an be [p, rho, u, v, w, x, y, z, g, f, phi, t, T, +, -, /] NOTE: the rho and phi are multiple characters, that should be treated as single character.
I would also like to use custom commands, such a \v for variable, i.e. not the arithmatic operators.
I would like to use other commands such as (\v). note the dot should match dx/dt, where x is a variable. similarly, given p=p(x,y,z), p' would match dp/dx, dp/dy, and dp/dz, but not dp/df. (somewhere there would be given that p = p(x,y,z)).
I would also like to be able to backtrack.
Now, i have investigated PCRE and ragel with D, i see that the first two problems are solvable, with multiple character objects defined s fixed objects. and not a character class.
However how do I address the third?
I dont see either PCRE or RAGEL admitting a way to use custom commands.
Moreover, since I would like to use backtrack I am not sure if Ragel is the correct option, as this wouuld need a stack, which means I would be using CFG.
Is there perhaps a domainspeific language to build such regex/cfg machines (for linux 64 bit if that matters)
There is nothing impossible. Just write new class with regex inside with your programming language and define new syntax. It will be your personal regular expression syntax. For example, like:
result = latex_string.match("p'(x,y,z)", "full"); // match dp/dx, dp/dy, dp/dz
result = latex_string_array.match("p'(x,y,z)", "partial"); // match ∂p/∂x, ∂p/∂y, ∂p/∂z
. . .
The method match will treat new, pseudo-regular expression inside the your class and will return the result in desirable form. You can simply make input definition as a string and/or array form. Actually, if some function have to be matched by all derivatives, you must simplify search notation to .match("p'").
One simple notice:
,
have source: \mathrm{d}y=\frac{\mathrm{d}y}{\mathrm{d}t}\mathrm{d}t, and:
,
dy=\frac{dy}{dt}dt, and finally:
,
is dy=(dy/dt)dt
The problem of generalization for latex equations meaning with regular expressions is human input factor. It is just a notation and author can select various manners of input.
The best and precise way is to analysis of formula content and creation a computation three. In this case, you will search not just notations of differentials or derivatives, but instructions to calculate differentials and derivatives, but anyway it is connected with detailed analysis of the formula string with multiple cases of writing manners.
One more thing, and good news for you! It's not necessary to define magic regex-latex multibyte letter greek alphabet. UTF-8 have ρ - GREEK SMALL LETTER RHO you can use in UI, but in search method treat it as \rho, and use simply /\\frac{d\\rho}{dx}/ regex notation.
One more example:
// search string
equation = "dU= \left(\frac{\partial U}{\partial S}\right)_{V,\{N_i\}}dS+ \left(\frac{\partial U}{\partial V}\right)_{S,\{N_i\}}dV+ \sum_i\left(\frac{\partial U}{\partial N_i}\right)_{S,V,\{N_{j \ne i}\}}dN_i";
. . .
// user input by UI
. . .
// call method
equation.equation_match("U'");// example notation for all types of derivatives for all variables
. . .
// inside the 'equation_match' method you will use native regex methods
matches1 = equation.match(/dU/); // dU
matches2 = equation.match(/\\partial U/); // ∂U
etc.
return(matches);// combination of matches

C++ Simple use of regex

I'm just trying to mess around and get familiar with using regex in c++.
Let's say I want the user to input the following: ###-$$-###, make #=any number between 0-9 and $=any number between 0-5. This is my idea for accomplishing this:
regex rx("[0-9][0-9][0-9]""\\-""[0-5][0-5]")
That's not the exact code however that's the general idea to check whether or not the user's input is a valid string of numbers. However, let's say i won't allow numbers starting with a 0 so: 099-55-999 is not acceptable. How can I check something like that and output invalid? Thanks
[0-9]{3}-[0-5]{2}-[0-9]{3}
matches a string that starts with three digits between 0 and 9, followed by a dash, followed by two digits between 0 and 5, followed by a dash, followed by three digits between 0 and 9.
Is that what you're looking for? This is very basic regex stuff. I suggest you look at a good tutorial.
EDIT: (after you changed your question):
[1-9][0-9]{2}-[0-5]{2}-[0-9]{3}
would match the same as above except for not allowing a 0 as the first character.
std::tr1::regex rx("[0-9]{3}-[0-5]{2}-[0-9]{3}");
Your talking about using tr1 regex in c++ right and not the managed c++? If so, go here where it explains this stuff.
Also, you should know that if your using VS2010 that you don't need the boost library anymore for regex.
Try this:
#include <regex>
#include <iostream>
#include <string>
int main()
{
std::tr1::regex rx("\\d{3}-[0-5]{2}-\\d{3}");
std::string s;
std::getline(std::cin,s);
if(regex_match(s.begin(),s.end(),rx))
{
std::cout << "Matched!" << std::endl;
}
}
For explanation check #Tim's answer. Do note the double \ for the digit metacharacter.

How to find a formatted number in a string?

If I have a string, and I want to find if it contains a number of the form XXX-XX-XXX, and return its position in the string, is there an easy way to do that?
XXX-XX-XXX can be any number, such as 259-40-092.
This is usually a job for a regular expression. Have a look at the Boost.Regex library for example.
I did this before....
Regular Expression is your superhero, become his friend....
//Javascript
var numRegExp = /^[+]?([0-9- ]+){10,}$/;
if (numRegExp.test("259-40-092")) {
alert("True - Number found....");
else
alert("False - Not a Number");
}
To give you a position in the string, that will be your homework. :-)
The regular expression in C++ will be...
char* regExp = "[+]?([0-9- ]+){10,}";
Use Boost.Regex for this instance.
If you don't want regexes, here's an algorithm:
Find the first -
LOOP {
Find the next -
If not found, break.
Check if the distance is 2
Check if the 8 characters surrounding the two minuses are digits
If so, found the number.
}
Not optimal but but the scan speed will already be dominated by the cache/memory speed. It can be optimized by considering on what part the match failed, and how. For instance, if you've got "123-4X-X............", when you find the X you know that you can skip ahead quickly. The second - preceding the X cannot be the first - of a proper number. Similarly, in "123--" you know that the second - can't be the first - of a number either.