C++11 regex to tokenize Mathematical Expression - regex

I have the following code to tokenize a string of the format: (1+2)/((8))-(100*34):
I'd like to throw an error to the user if they use an operator or character that isn't part of my regex.
e.g if user enters 3^4 or x-6
Is there a way to negate my regex, search for it and if it is true throw the error?
Can the regex expression be improved?
//Using c++11 regex to tokenize input string
//[0-9]+ = 1 or many digits
//Or [\\-\\+\\\\\(\\)\\/\\*] = "-" or "+" or "/" or "*" or "(" or ")"
std::regex e ( "[0-9]+|[\\-\\+\\\\\(\\)\\/\\*]");
std::sregex_iterator rend;
std::sregex_iterator a( infixExpression.begin(), infixExpression.end(), e );
queue<string> infixQueue;
while (a!=rend) {
infixQueue.push(a->str());
++a;
}
return infixQueue;
-Thanks

You can run a search on the string using the search expression [^0-9()+\-*/] defined as C++ string as "[^0-9()+\\-*/]" which finds any character which is NOT a digit, a round bracket, a plus or minus sign (in real hyphen), an asterisk or a slash.
The search with this regular expression search string should not return anything otherwise the string contains a not supported character like ^ or x.
[...] is a positive character class which means find a character being one of the characters in the square brackets.
[^...] is a negative character class which means find a character NOT being one of the characters in the square brackets.
The only characters which must be escaped within square brackets to be interpreted as literal character are ], \ and - whereby - must not be escaped if being first or last character in the list of characters within the square brackets. But it is nevertheless better to escape - always within square brackets as this makes it easier for the regular expression engine / function to detect that the hyphen character should be interpreted as literal character and not with meaning "FROM x to z".
Of course this expression does not check for missing closing round brackets. But formula parsers do often not require that there is always a closing parenthesis for every opening parenthesis in comparison to a compiler or script interpreter simply because not needed to calculate the value based on entered formula.

Answer is given already but perhaps someone might need this
[0-9]?([0-9]*[.])?[0-9]+|[\\-\\+\\\\\(\\)\\/\\*]
This regex separates floats, integers and arithmetic operators
Heres the trick:
[0-9]?([0-9]*[.])?[0-9]+ -> if its a digit and has a point, then grab the digits with the point and the digits that follows it, if not, just grab the digits.
Sorry if my answer isn't clear, i just learned regex and found this solution by my own by just trial and errors.
Heres the code (it takes a mathematical expression and split all digits and operators into a vector)
NOTE: I don't know if it accepts whitespaces, meaning that the mathematical expression that i worked with had no whitespaces. Example: 4+2*(3+1) and would separate everything nicely, but i havent tried with whitespaces.
/* Separate every int or float or operator into a single string using regular expression and store it in untokenize vector */
string infix; //The string to be parse (the arithmetic operation if you will)
vector<string> untokenize;
std::regex words_regex("[0-9]?([0-9]*[.])?[0-9]+|[\\-\\+\\\\\(\\)\\/\\*]");
auto words_begin = std::sregex_iterator(infix.begin(), infix.end(), words_regex);
auto words_end = std::sregex_iterator();
for (std::sregex_iterator i = words_begin; i != words_end; ++i) {
cout << (*i).str() << endl;
untokenize.push_back((*i).str());
}
Output:
(<br/>
1<br/>
+<br/>
2<br/>
)<br/>
/<br/>
(<br/>
(<br/>
8<br/>
)<br/>
)<br/>
-<br/>
(<br/>
100<br/>
*<br/>
34<br/>
)<br/>

Related

regex_replace is returning empty string

I am trying to remove all characters that are not digit, dot (.), plus/minus sign (+/-) with empty character/string for float conversion.
When I pass my string through regex_replace function I am returned an empty string.
I belive something is wrong with my regex expression std::regex reg_exp("\\D|[^+-.]")
Code
#include <iostream>
#include <regex>
int main()
{
std::string temporary_recieve_data = " S S +456.789 tg\r\n";
std::string::size_type sz;
const std::regex reg_exp("\\D|[^+-.]"); // matches not digit, decimal point (.), plus sign, minus sign
std::string numeric_string = std::regex_replace(temporary_recieve_data, reg_exp, ""); //replace the character that are not digit, dot (.), plus-minus sign (+,-) with empty character/string for float conversion
std::cout << "Numeric String : " << numeric_string << std::endl;
if (numeric_string.empty())
{
return 0;
}
float data_value = std::stof(numeric_string, &sz);
std::cout << "Float Value : " << data_value << std::endl;
return 0;
}
I have been trying to evaluate my regex expression on regex101.com for past 2 days but I am unable to figure out where I am wrong with my regular expression. When I just put \D, the editor substitutes non-digit character properly but soon as I add or condition | for not dot . or plus + or minus - sign the editor returns empty string.
The string is empty because your regex matches each character.
\D already matches every character that is not a digit.
So plus, hyphen and the period thus far are consumed.
And digits get consumed by the negated class: [^+-.]
Further the hyphen indicates a range inside a character class.
Either escape it or put it at the start or end of the char-class.
(funnily the used range +-. 43-46 even contained a hyphen)
Remove the alternation with \D and put \d into the negated class:
[^\d.+-]+
See this demo at regex101 (attaching + for one or more is efficient)

regex_search trying to match a string containing '['

Being relatively new to regular expressions I am having trouble figuring out the correct syntax. I am trying to match a string with the following pattern: String[string]!='string'. I want to divide it into three matches as it follows:
String[string] // can contain numbers
!= // operator that can also be: =,> and < or a combination of these
'string' // can contain _ and numbers
So far I have managed to match a string with the following pattern: string='string'. Using this code:
const string strExpression = "TypeValue!='Set_Site'";
regex regex("([a-zA-Z0-9]+)([=><!]+)(['a-zA-Z0-9_]+)");
smatch match;
if (regex_search(strExpression.begin(), strExpression.end(), match, regex))
{
string indicator(match[1]);
string op(match[2]);
string value(match[3]);
}
However when I try to add '[' and ']' to the regex syntax I don't get any matches. I have modified the code like this:
const string strExpression = "Type[Value]!='Set_Site'";
regex regex("([]a-zA-Z0-9[]+)([=><!]+)(['a-zA-Z0-9_]+)");
smatch match;
if (regex_search(strExpression.begin(), strExpression.end(), match, regex))
{
string indicator(match[1]);
string op(match[2]);
string value(match[3]);
}
Аccording to the documentation I am reading the right-square-bracket ( ']' ) will lose its special meaning (terminating the bracket expression) and represent itself in a bracket expression if it occurs first in the list. For the left-square-bracket( '[' ) it says that it will lose its special meaning within a bracket expression. So following this rules and definitions I cannot identify why i am not getting any matches
Can someone give me some guidelines what I am doing wrong?
Thank you.

Finding number between [/ and ] using regex in C++

I want to find the number between [/ and ] (12345 in this case).
I have written such code:
float num;
string line = "A111[/12345]";
boost::regex e ("[/([0-9]{5})]");
boost::smatch match;
if (boost::regex_search(line, match, e))
{
std::string s1(match[1].first, match[1].second);
num = boost::lexical_cast<float>(s1); //convert to float
cout << num << endl;
}
However, I get this error: The error occurred while parsing the regular expression fragment: '/([0-9]{5}>>>HERE>>>)]'.
You need to double escape the [ and ] that special characters in regex denoting character classes. The correct regex declaration will be
boost::regex e ("\\[/([0-9]{5})\\]");
This is necessary because C++ compiler also uses a backslash to escape entities like \n, and regex engine uses the backslash to escape special characters so that they are treated like literals. Thus, backslash gets doubled. When you need to match a literal backslash, you will have to use 4 of them (i.e. \\\\).
Use the following (escape [ and ] because they are special characters in regex meaning a character class):
\\[/([0-9]{5})\\]
^^ ^^

Ignore String containing special words (Months)

I am trying to find alphanumeric strings by using the following regular expression:
^(?=.*\d)(?=.*[a-zA-Z]).{3,90}$
Alphanumeric string: an alphanumeric string is any string that contains at least a number and a letter plus any other special characters it can be # - _ [] () {} ç _ \ ù %
I want to add an extra constraint to ignore all alphanumerical strings containing the following month formats :
JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
One solution is to actually match an alphanumerical string. Then check if this string contains one of these names by using the following function:
vector<string> findString(string s)
{
vector<string> vec;
boost::regex rgx("JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
");
boost::smatch match;
boost::sregex_iterator begin {s.begin(), s.end(), rgx},
end {};
for (boost::sregex_iterator& i = begin; i != end; ++i)
{
boost::smatch m = *i;
vec.push_back(m.str());
}
return vec;
}
Question: How can I add this constraint directly into the regular expression instead of using this function.
One solution is to use negative lookahead as mentioned in How to ignore words in string using Regular Expressions.
I used it as follows:
String : 2-hello-001
Regular expression : ^(?=.*\d)(?=.*[a-zA-Z]^(?!Jan|Feb|Mar)).{3,90}$
Result: no match
Test website: http://regexlib.com/
The edit provided by #Robin and #RyanCarlson : ^[][\w#_(){}ç\\ù%-]{3,90}$ works perfectly in detecting alphanumeric strings with special characters. It's just the negative lookahead part that isn't working.
You can use negative look ahead, the same way you're using positive lookahead:
(?=.*\d)(?=.*[a-zA-Z])
(?!.*(?:JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre)).{3,90}$
Also you regex is pretty unclear. If you want alphanumerical strings with a length between 3 and 90, you can just do:
/^(?!.*(?:JANVIER|F[Eé]VRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AO[Uù]T|SEPTEMBRE|OCTOBRE|NOVEMBRE|D[Eé]CEMBRE|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))
[][\w#_(){}ç\\ù%-]{3,90}$/i
the i flag means it will match upper and lower case (so you can reduce your forbidden list), \w is a shortcut for [0-9a-zA-Z_] (careful if you copy-paste, there's a linebreak here for readability between (?! ) and [ ]). Just add in the final [...] whatever special characters you wanna match.

Regex for numbers on scientific notation?

I'm loading a .obj file that has lines like
vn 8.67548e-017 1 -1.55211e-016
for the vertex normals. How can I detect them and bring them to double notation?
A regex that would work pretty well would be:
-?[\d.]+(?:e-?\d+)?
Converting to a number can be done like this: String in scientific notation C++ to double conversion, I guess.
The regex is
-? # an optional -
[\d.]+ # a series of digits or dots (see *1)
(?: # start non capturing group
e # "e"
-? # an optional -
\d+ # digits
)? # end non-capturing group, make optional
**1) This is not 100% correct, technically there can be only one dot, and before it only one (or no) digit. But practically, this should not happen. So the regex is a good approximation and false positives should be very unlikely. Feel free to make the regex more specific.*
You can identify the scientific values using: -?\d*\.?\d+e[+-]?\d+ regex.
I tried a number of the other solutions to no avail, so I came up with this.
^(-?\d+)\.?\d+(e-|e\+|e|\d+)\d+$
Debuggex Demo
Anything that matches is considered to be valid Scientific Notation.
Please note: This accepts e+, e- and e; if you don't want to accept e, use this: ^(-?\d+)\.?\d+(e-|e\+|\d+)\d+$
I'm not sure if it works for c++, but in c# you can add (?i) between the ^ and (- in the regex, to toggle in-line case-insensitivity. Without it, exponents declared like 1.05E+10 will fail to be recognised.
Edit: My previous regex was a little buggy, so I've replaced it with the one above.
The standard library function strtod handles the exponential component just fine (so does atof, but strtod allows you to differentiate between a failed parse and parsing the value zero).
If you can be sure that the format of the double is scientific, you can try something like the following:
string inp("8.67548e-017");
istringstream str(inp);
double v;
str >> scientific >> v;
cout << "v: " << v << endl;
If you want to detect whether there is a floating point number of that format, then the regexes above will do the trick.
EDIT: the scientific manipulator is actually not needed, when you stream in a double, it will automatically do the handling for you (whether it's fixed or scientific)
Well this is not exactly what you asked for since it isn't Perl (gak) and it is a regular definition not a regular expression, but it's what I use to recognize an extension of C floating point literals (the extension is permitting "_" in digit strings), I'm sure you can convert it to an unreadable regexp if you want:
/* floats: Follows ISO C89, except that we allow underscores */
let decimal_string = digit (underscore? digit) *
let hexadecimal_string = hexdigit (underscore? hexdigit) *
let decimal_fractional_constant =
decimal_string '.' decimal_string?
| '.' decimal_string
let hexadecimal_fractional_constant =
("0x" |"0X")
(hexadecimal_string '.' hexadecimal_string?
| '.' hexadecimal_string)
let decimal_exponent = ('E'|'e') ('+'|'-')? decimal_string
let binary_exponent = ('P'|'p') ('+'|'-')? decimal_string
let floating_suffix = 'L' | 'l' | 'F' | 'f' | 'D' | 'd'
let floating_literal =
(
decimal_fractional_constant decimal_exponent? |
hexadecimal_fractional_constant binary_exponent?
)
floating_suffix?
C format is designed for programming languages not data, so it may support things your input does not require.
For extracting numbers in scientific notation in C++ with std::regex I normally use
((\\+|-)?[[:digit:]]+)(\\.(([[:digit:]]+)?))?((e|E)((\\+|-)?)[[:digit:]]+)?
which corresponds to
((\+|-)?\d+)(\.((\d+)?))?((e|E)((\+|-)?)\d+)?
Debuggex Demo
This will match any number of the form +12.3456e-78 where
the sign can be either + or - and is optional
the comma as well as the positions after the comma are optional
the exponent is optional and can be written with a lower- or upper-case letter
A corresponding code for parsing might look like this:
std::regex const scientific_regex {"((\\+|-)?[[:digit:]]+)(\\.(([[:digit:]]+)?))?((e|E)((\\+|-)?)[[:digit:]]+)?"};
std::string const str {"8.67548e-017 1 -1.55211e-016"};
for (auto it = std::sregex_iterator(str.begin(), str.end(), scientific_regex); it != std::sregex_iterator(); ++it) {
std::string const match {it->str()};
std::cout << match << std::endl;
}
If you want to convert the found sub-strings to a double number std::stod should handle the conversion correctly as already pointed out by Ben Voigt.
Try it here!