Regex to match mathematical expressions - c++

I have a regex I intend to use to "tokenize" a mathematical expression like:
a + b + 1 + 2
int main() {
string rxstrIdentifier = "\\b[a-zA-Z]\\w*\\b";
string rxstrConstant = "\\b\\d+\\b";
string rxstrRef = "(" + rxstrIdentifier + ")|(" + rxstrConstant + ")"; // identifier or constant
const regex rxExpr = regex("^(" + rxstrRef + ")(.*)$"); // {x} [{+} {y}]*
//const regex rxSubExpr = regex("^\\s*([+])\\s*(" + rxstrRef + ")(.*)$"); // {+} {x} [...]
string test = "b + a + 1";
cmatch res;
regex_search(test.c_str(), res, rxExpr);
cout << "operand: " << res[1] << endl;
cout << "res: " << res[2] << endl;
system("pause");
return 0;
}
Problem is operand and res gives just b in the example. I expected
operand: b
res: + a + 1
Used to work in another similar regex ...
const regex Parser::rxExpr = regex("^(\\w+)((\\s*([+])\\s*(\\w+))*)$"); // {x} [{+} {y}]*
const regex Parser::rxSubExpr = regex("^\\s*([+])\\s*(\\w+)(.*)$"); // {+} {x} [...]

Your regexes don't appear to allow for the whitespace in the string. \b matches word boundaries, but boundaries have zero width so nothing's consuming the spaces between the tokens.

Use (?:pattern) group:
string rxstrRef = "(?:" + rxstrIdentifier + ")|(?:" + rxstrConstant + ")"; // identifier or constant
This eliminates the impact on the search results

Related

Problem with special characters with RegEx in C++

I have an issue to replace a special characters in string (from IIS Sharepoint log files) that contains a domain name with forward slash and names that starts with t, n, r that makes confusions with regular expressions. my code is as follow:
std::setlocale(LC_ALL, ".ACP"); //Sets the locale to the ANSI code page obtained from the operating system. FR characters
std::string subject("2018-08-26 11:38:20 172.20.1.148 GET /BaseDocumentaire/Documents+de+la+page+Notes+de+services/Rappel+du+dispositif+de+Sécurité+relatif+aux+Moyens+de+paiement+et+d’épargne+en+agence.pdf - 80 0#.w|domainname\tonzaro 10.12.105.24 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:61.0)+Gecko/20100101+Firefox/61.0 200 0 0 29984");
std::string result;
std::string g1, g2, g5, g9, g10; //str groups in regex
try {
std::regex re("(\\d{4}-\\d{2}-\\d{2})( \\d{2}:\\d{2}:\\d{2})( 172.20.1.148)( GET | POST | HEAD )((/.*){1,4}/.*.(pdf|aspx))( -.*)(domainname.[a-zA-Z0-9]*)( \\d+.\\d+.\\d+.\\d+)");
std::sregex_iterator next(subject.begin(), subject.end(), re);
std::sregex_iterator end;
while (next != end) {
std::smatch match = *next;
std::cout << match.str() << "\n";
std::cout << "-------------------------------------------" << "\n";
g1 = match.str(1);
g2 = match.str(2);
g5 = match.str(5);
g9 = match.str(9);
g10 = match.str(10);
next++;
}
std::cout << "Date: " + g1 << "\n";
std::cout << "Time: " + g2 << "\n";
std::replace(g5.begin(), g5.end(), '+', ' ');
std::cout << "Link Document : " + g5 << "\n";
std::cout << "User: " + g9 << "\n";
std::cout << "IP: " + g10 << "\n";
}
catch (std::regex_error& e) {
std::cout << "Syntax error in the regular expression" << "\n";
}
My output for domain name is: domainname onzaro
Any help please for this problem with \, \t, \n or \r ?
I'd urge you to use raw string literals. This is solution designed for cases where the literal should not be processed in any way, such as yours.
The syntax is R "delimiter( raw_characters )delimiter", so in your case it could be:
std::string subject(R"raw(2018-08-26 11:38:20 172.20.1.148 GET /BaseDocumentaire/Documents+de+la+page+Notes+de+services/Rappel+du+dispositif+de+Sécurité+relatif+aux+Moyens+de+paiement+et+d’épargne+en+agence.pdf - 80 0#.w|domainname\tonzaro 10.12.105.24 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:61.0)+Gecko/20100101+Firefox/61.0 200 0 0 29984)raw");
std::regex re( R"raw((\d{4}-\d{2}-\d{2})( \d{2}:\d{2}:\d{2})( 172.20.1.148)( GET | POST | HEAD )((/.*){1,4}/.*.(pdf|aspx))( -.*)(domainname.[a-zA-Z0-9]*)( \d+.\d+.\d+.\d+))raw");
(I might have missed some superfluous \ above). See it live.
Those special characters are called escape sequences are being processed in string literals at compilation level (in phase 5 to be precise). For raw string literals this transformation is suppressed.
You don't care about any special character handling. You just need to take care that ")delimiter" doesn't appear in your literal, which I imagine could happen in regex.
'\t' is one character, a horizontal tab. If you want the characters \ and t, you need to escape the backslash: "\\t".

Why is my c++ regex matching correctly, but not returning the correct value?

I have a regex for matching terms of a polynomial, which I will use to implement a function for turning a string into a polynomial class. You can see the regex demoed here with the correct matches being generated. However, when I try to implement it, my program finds the matches properly, but prints them bizarrely to the screen. For instance:
-21323x^5+1233x4+123x^2-1232
Trying to match: -21323x^5+1233x4+123x^2-1232
-21323x^5
Trying to match: +1233x4+123x^2-1232
1233x4
Trying to match: +123x^2-1232
12xx^2
Trying to match: -1232
-1232
In this case, for some reason it prints 12xx^2 rather than 123x^2
And another:
-1234x^5+789x4+6x^2-567+123x
Trying to match: -1234x^5+789x4+6x^2-567+123x
-1234x^5
Trying to match: +789x4+6x^2-567+123x
789x4
Trying to match: +6x^2-567+123x
x^22
Trying to match: -567+123x
-567
Trying to match: +123x
23xx
In this case it shows x^22 instead of 6x^2 and 23xx instead of 123x.
This is my code:
Poly* Poly::fromString(std::string str) {
Poly* re = new Poly;
bool returnNull = true;
std::regex r_term("((-?[0-9]*)x(\\^?([0-9]+))?|-?[0-9]+)");
std::smatch sm;
while(std::regex_search(str, sm, r_term)) {
returnNull = false;
std::cout << "Trying to match: " << str << std::endl;
str = sm.suffix().str();
std::cout << sm.str() << std::endl;
}
if(returnNull) {
delete re;
return nullptr;
} else return re;
}
While Igor correctly notices the issue with your current code, I think all you need is to get full pattern matches, and for this purpose, I'd rather suggest using a regex iterator:
std::regex r("(-?[0-9]*)x(\\^?([0-9]+))?|-?[0-9]+");
std::string s = "-21323x^5+1233x4+123x^2-1232";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << "Match value: " << m.str() << '\n';
std::cout << "Group 1 value: " << m.str(1) << '\n';
std::cout << "Group 2 value: " << m.str(2) << '\n';
std::cout << "Group 3 value: " << m.str(3) << '\n';
}
See the C++ demo online.
The pattern details:
-? - 1 or 0 hyphens
[0-9]* - 0 or more digits
x - a literal char x
(\\^?([0-9]+))? - 1 or 0 sequences of:
\\^? - an optional (1 or 0) ^ symbols
[0-9]+ - 1 or more digits
| - or
-? - an optional hyphen/minus
[0-9]+ - 1 or more digits.

C++ Boost:regex_search expression - Issue combining expressions to catch all sequences

I'm trying to write a template parser and need to pickup (3) distinct sets of sequences for string replacement.
// Each of These Expressions Work Perfect Separately!
// All Sequences start with | pipe. Followed by
boost::regex expr {"(\\|[0-9]{2})"}; // 2 Digits only.
boost::regex expr {"(\\|[A-Z]{1,2}+[0-9]{1,2})"}; // 1 or 2 Uppercase Chars and 1 or 2 Digits.
boost::regex expr {"(\\|[A-Z]{2})(?!\\d)"}; // 2 Uppercase Chars with no following digits.
However, once I try to combine them into a single statement, I get can't them to work properly to catch all sequences. I must be missing something. Can anyone shed some light on what I'm missing?
Here is what I have so far:
// Each sequence is separated with a | for or between parenthesis.
boost::regex expr {"(\\|[0-9]{2})|(\\|[A-Z]{1,2}+[0-9]{1,2})|(\\|[A-Z]{2})(?!\\d)"};
I'm using the follow string for testing, and probably little more then needed here is the code as well.
#include <boost/regex.hpp>
#include <string>
#include <iostream>
std::string str = "|MC01 |U1 |s |A22 |12 |04 |2 |EW |SSAADASD |15";
boost::regex expr {"(\\|[0-9]{2})|(\\|[A-Z]{1,2}+[0-9]{1,2})|(\\|[A-Z]{2})(?!\\d)"};
boost::smatch matches;
std::string::const_iterator start = str.begin(), end = str.end();
while(boost::regex_search(start, end, matches, expr))
{
std::cout << "Matched Sub '" << matches.str()
<< "' following ' " << matches.prefix().str()
<< "' preceeding ' " << matches.suffix().str()
<< std::endl;
start = matches[0].second;
for(size_t s = 1; s < matches.size(); ++s)
{
std::cout << "+ Matched Sub " << matches[s].str()
<< " at offset " << matches[s].first - str.begin()
<< " of length " << matches[s].length()
<< std::endl;
}
}
I believe this is what you want:
const boost::regex expr {"(\\|[0-9]{2})|(\\|[A-Z]{1,2}+[0-9]{1,2})|(\\|[A-Z]{2})"}; // basically, remove the constraint on the last sub
I also suggest being explicit in your flags for expr and passed to regex_search.
I also fond that by added an extra check for matches on matched, this removes half-matched patterns which was throwing me off.
for(size_t s = 1; s < matches.size(); ++s)
{
if (matches[s].matched) // Check for bool True/False
{
std::cout << "+ Matched Sub " << matches[s].str()
<< " at offset " << matches[s].first - str.begin()
<< " of length " << matches[s].length()
<< std::endl;
}
}
Without it, matches where showing with an offset at the end of the string showing length 0. So I hope this helps anyone else who runs into this.
Another Tip is, in the loop, checking s == 1, 2, 3 refers back to the match on the expressions. Since I have (3) expressions, if it matched on the first part of the expression, s will have a 1 value when matched is a true value, otherwise it will have 2 or 3. Pretty nice!

'+' cannot add two pointers, but just printing an int and an explicit string?

I am trying to use an array to keep track of the totals of different types of items (up to 50 types). When I want to print the totals out, I get an error saying "'+' cannot add two pointers." I'm thinking the problem is with my totals array somehow, but I can't figure it out. Below is a sample of my code:
string printSolution()
{
int totals[50];
string printableSolution = "";
for (int k = 0; k < itemTypeCount; k++)
{
totals[k] = 0;
}
for (int i = 0; i < itemCount; i++)
{
totals[items[i].typeCode]++;
}
for (int a = 0; a < itemTypeCount; a++)
{
printableSolution.append("There are " + totals[a] + " of Item type " + (a + 1) + ". \n");
}
}
The string literals "Foo" are of const char*, i.e. pointer type.
To understand what happens with:
"There are " + totals[a] + " of Item type " + (a + 1) + ". \n"
Let's look at an expression:
"0123456789" + 5
This actually just offsets 5 bytes from the start, so becomes:
"56789"
So an expression:
"0123456789" + 5 + "foo"
becomes:
"56789" + "foo"
as pointers, and this is not defined.
What you really want is string concatenation; this can be achieved using std::string.
We can write:
std::string("56789") + "foo"
and this generates a std::string with value: "56789foo" as you desire.
But:
std::string("0123456789") + 5
is also not defined. You need to use:
std::string("0123456789") + std::to_string(5)
So, finally you want:
std::string("There are ") + std::to_string(totals[a]) + " of Item type " + std::to_string(a + 1) + ". \n"
Note now you do not need to explitly convert all the "" to std:string, as once you have one implicit type conversion will take care of the other operand in operator+. However, adding them would do no harm:
std::string("There are ") + std::to_string(totals[a]) + std::string(" of Item type ") + std::to_string(a + 1) + std::string(". \n")
The problem is here:
"There are " + totals[a] + " of Item type " + (a + 1) + ". \n"
It means char* + int + char* + int + char*. You need to print them out separately or change the int to a std::string.
Use C++-style formatting instead:
std::ostringstream oss;
oss << "There are " << totals[a] << " of Item type " << (a + 1) << ". \n";
printableSolution += oss.str();

Anything like substr but instead of stopping at the byte you specified, it stops at a specific string [duplicate]

This question already has answers here:
How do you search a std::string for a substring in C++?
(6 answers)
Closed 8 years ago.
I have a client for a pre-existing server. Let's say I get some packets "MC123, 456!##".
I store these packets in a char called message. To print out a specific part of them, in this case the numbers part of them, I would do something like "cout << message.substr(3, 7) << endl;".
But what if I receive another message "MC123, 456, 789!##". "cout << message.substr(3,7)" would only print out "123, 456", whereas I want "123, 456, 789". How would I do this assuming I know that every message ends with "!##".
First - Sketch out the indexing.
std::string packet1 = "MC123, 456!##";
// 0123456789012345678
// ^------^ desired text
std::string packet2 = "MC123, 456, 789!##";
// 0123456789012345678
// ^-----------^ desired text
The others answers are ok. If you wish to use std::string find,
consider rfind and find_first_not_of, as in the following code:
// forward
void messageShow(std::string packet,
size_t startIndx = 2);
// /////////////////////////////////////////////////////////////////////////////
int main (int, char** )
{
// 012345678901234567
// |
messageShow("MC123, 456!##");
messageShow("MC123, 456, 789!##");
messageShow("MC123, 456, 789, 987, 654!##");
// error test cases
messageShow("MC123, 456, 789##!"); // missing !##
messageShow("MC123x 456, 789!##"); // extraneous char in packet
return(0);
}
void messageShow(std::string packet,
size_t startIndx) // default value 2
{
static size_t seq = 0;
seq += 1;
std::cout << packet.size() << " packet" << seq << ": '"
<< packet << "'" << std::endl;
do
{
size_t bangAtPound_Indx = packet.rfind("!##");
if(bangAtPound_Indx == std::string::npos){ // not found, can't do anything more
std::cerr << " '!##' not found in packet " << seq << std::endl;
break;
}
size_t printLength = bangAtPound_Indx - startIndx;
const std::string DIGIT_SPACE = "0123456789, ";
size_t allDigitSpace = packet.find_first_not_of(DIGIT_SPACE, startIndx);
if(allDigitSpace != bangAtPound_Indx) {
std::cerr << " extraneous char found in packet " << seq << std::endl;
break; // something extraneous in string
}
std::cout << bangAtPound_Indx << " message" << seq << ": '"
<< packet.substr(startIndx, printLength) << "'" << std::endl;
}while(0);
std::cout << std::endl;
}
This outputs
13 packet1: 'MC123, 456!##'
10 message1: '123, 456'
18 packet2: 'MC123, 456, 789!##'
15 message2: '123, 456, 789'
28 packet3: 'MC123, 456, 789, 987, 654!##'
25 message3: '123, 456, 789, 987, 654'
18 packet4: 'MC123, 456, 789##!'
'!##' not found in packet 4
18 packet5: 'MC123x 456, 789!##'
extraneous char found in packet 5
Note: String indexes start at 0. The index of the digit '1' is 2.
The correct approach is to look for existence / location of the "known termination" string, then take the substring up to (but not including) that substring.
Something like
str::string termination = "!#$";
std::size_t position = inputstring.find(termination);
std::string importantBit = message.substr(0, position);
You could check the front of the string separately as well. Combining these, you could use regular expressions to make your code more robust, using a regex like
MC([0-9,]+)!#\$
This will return the bit between MC and !#$ but only if it consists entirely of numbers and commas. Obviously you can adapt this as needed.
UPDATE you asked in your comment how to use the regular expression. Here is a very simple program. Note - this is using C++11: you need to make sure our compiler supports it.
#include <iostream>
#include <regex>
int main(void) {
std::string s ("ABC123,456,789!#$");
std::smatch m;
std::regex e ("ABC([0-9,]+)!#\\$"); // matches the kind of pattern you are looking for
if (std::regex_search (s,m,e)) {
std::cout << "match[0] = " << m[0] << std::endl;
std::cout << "match[1] = " << m[1] << std::endl;
}
}
On my Mac, I can compile the above program with
clang++ -std=c++0x -stdlib=libc++ match.cpp -o match
If instead of just digits and commas you want "anything" in your expression (but it's still got fixed characters in front and behind) you can simply do
std::regex e ("ABC(.*)!#\\$");
Here, .+ means "zero or more of 'anything'" - but followed by !#$. The double backslash has to be there to "escape" the dollar sign, which has special meaning in regular expressions (it means "the end of the string").
The more accurately your regular expression reflects exactly what you expect, the better you will be able to trap any errors. This is usually a very good thing in programming. "Always check your inputs".
One more thing - I just noticed you mentioned that you might have "more stuff" in your string. This is where using regular expressions quickly becomes the best. You mentioned a string
MC123, 456!##*USRChester.
and wanted to extract 123, 456 and Chester. That is - stuff between MC and !#$, and more stuff after USR (if that is even there). Here is the code that shows how that is done:
#include <iostream>
#include <regex>
int main(void) {
std::string s1 ("MC123, 456!#$");
std::string s2 ("MC123, 456!#$USRChester");
std::smatch m;
std::regex e ("MC([0-9, ]+)!#\\$(?:USR)?(.*)$"); // matches the kind of pattern you are looking for
if (std::regex_search (s1,m,e)) {
std::cout << "match[0] = " << m[0] << std::endl;
std::cout << "match[1] = " << m[1] << std::endl;
std::cout << "match[2] = " << m[2] << std::endl;
}
if (std::regex_search (s2,m,e)) {
std::cout << "match[0] = " << m[0] << std::endl;
std::cout << "match[1] = " << m[1] << std::endl;
std::cout << "match[2] = " << m[2] << std::endl;
if (match[2].length() > 0) {
std::cout << m[2] << ": " << m[1] << std::endl;
}
}
}
Output:
match[0] = MC123, 456!#$
match[1] = 123, 456
match[2] =
match[0] = MC123, 456!#$USRChester
match[1] = 123, 456
match[2] = Chester
Chester: 123, 456
The matches are:
match[0] : "everything in the input string that was consumed by the Regex"
match[1] : "the thing in the first set of parentheses"
match[2] : "The thing in the second set of parentheses"
Note the use of the slightly tricky (?:USR)? expression. This says "This might (that's the ()? ) be followed by the characters USR. If it is, skip them (that's the ?: part) and match what follows.
As you can see, simply testing whether m[2] is empty will tell you whether you have just numbers, or number plus "the thing after the USR". I hope this gives you an inkling of the power of regular expressions for chomping through strings like yours.
If you are sure about the ending of the message, message.substr(3, message.size()-6) will do the trick.
However, it is good practice to check everything, just to avoid surprises.
Something like this:
if (message.size() < 6)
throw error;
if (message.substr(0,3) != "MCX") //the exact numbers do not match in your example, but you get the point...
throw error;
if (message.substr(message.size()-3) != "!##")
throw error;
string data = message.substr(3, message.size()-6);
Just calculate the offset first.
string str = ...;
size_t start = 3;
size_t end = str.find("!##");
assert(end != string::npos);
return str.substr(start, end - start);
You can get the index of "!##" by using:
message.find("!##")
Then use that answer instead of 7. You should also check for it equalling std::string::npos which indicates that the substring was not found, and take some different action.
string msg = "MC4,512,541,3123!##";
for (int i = 2; i < msg.length() - 3; i++) {
if (msg[i] != '!' && msg[i + 1] != '#' && msg[i + 2] != '#')
cout << msg[i];
}
or use char[]
char msg[] = "MC4,123,54!##";
sizeof(msg -1 ); //instead of msg.length()
// -1 for the null byte at the end (each char takes 1 byte so the size -1 == number of chars)