Why doesn't Boost.Regex find multiple matches in one string? - c++

I'm writing a small command-line program that asks the user for polynomials in the form ax^2+bx^1+cx^0. I'm going to parse the data later but for now I'm just trying to see if I can match the polynomial with the regular expression(\+|-|^)(\d*)x\^([0-9*]*)My problem is, it doesn't match multiple terms in the user-entered polynomial unless I change it to((\+|-|^)(\d*)x\^([0-9*]*))*(the difference is the entire expression is grouped and has an asterisk at the end). The first expression works if I type something such as "4x^2" but not "4x^2+3x^1+2x^0", since it doesn't check multiple times.
My question is, why won't Boost.Regex'sregex_match()find multiple matches within the same string? It does in the regular expression editor I used (Expresso) but not in the actual C++ code. Is it supposed to be like that?
Let me know if something doesn't make sense and I'll try to clarify. Thanks for the help.
Edit1: Here's my code (I'm following the tutorial here: http://onlamp.com/pub/a/onlamp/2006/04/06/boostregex.html?page=3)
int main()
{
string polynomial;
cmatch matches; // matches
regex re("((\\+|-|^)(\\d*)x\\^([0-9*]*))*");
cout << "Please enter your polynomials in the form ax^2+bx^1+cx^0." << endl;
cout << "Polynomial:";
getline(cin, polynomial);
if(regex_match(polynomial.c_str(), matches, re))
{
for(int i = 0; i < matches.size(); i++)
{
string match(matches[i].first, matches[i].second);
cout << "\tmatches[" << i << "] = " << match << endl;
}
}
system("PAUSE");
return 0;
}

You're using the wrong thing -- regex_match is intended to check whether a (single) regex matches the entirety of a sequence of characters. As such, you need to either specify a regex that matches the whole input, or use something else. For your situation, it probably makes the most sense to just modify the regex as you've already done (group it and add a Kleene star). If you wanted to iterate over the individual terms of the polynomial, you'd probably want to use something like a regex_token_iterator.
Edit: Of course, since you're embedding this into C++, you also have to double all your backslashes. Looking at it, I'm also a little confused about the regex you're using -- it doesn't look to me like it should really work quite right. Just for example, it seems to require a "+", "-" or "^" at the beginning of a term, but the first term won't normally have that. I'm also somewhat uncertain why there would be a "^" at the beginning of a term. Since the exponent is normally omitted when it's zero, it's probably better to allow it to be omitted. Taking those into account, I get something like: "[-+]?(\d*)x(\^([0-9])*)".
Incorporating that into some code, we can get something like this:
#include <iterator>
#include <regex>
#include <string>
#include <iostream>
int main() {
std::string poly = "4x^2+3x^1+2x";
std::tr1::regex term("[-+]?(\\d*)x(\\^[0-9])*");
std::copy(std::tr1::sregex_token_iterator(poly.begin(), poly.end(), term),
std::tr1::sregex_token_iterator(),
std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}
At least for me, that prints out each term individually:
4x^2
+3x^1
+2x
Note that for the moment, I've just printed out each complete term, and modified your input to show off the ability to recognize a term that doesn't include a power (explicitly, anyway).
Edit: to collect the results into a vector instead of sending them to std::cout, you'd do something like this:
#include <iterator>
#include <regex>
#include <string>
#include <iostream>
int main() {
std::string poly = "4x^2+3x^1+2x";
std::tr1::regex term("[-+]?(\\d*)x(\\^[0-9])*");
std::vector<std::string> terms;
std::copy(std::tr1::sregex_token_iterator(poly.begin(), poly.end(), term),
std::tr1::sregex_token_iterator(),
std::back_inserter(terms));
// Now terms[0] is the first term, terms[1] the second, and so on.
return 0;
}

Related

How to read double digits and single digits in C++

I have an issue where I cannot get my C++ program to read double digit integers.
My idea is to read it as string and then somehow parse it into separate integers and insert them into an array, but I am stuck on getting the code to read digits properly.
Sample Output:
i: 0 codeColumn 0
i: 1 codeColumn 1
i: 2 codeColumn 0 0
i: 3 codeColumn 0
i: 4 codeColumn 31 0
i: 5 codeColumn 1
i: 6 codeColumn 43 0
i: 7 codeColumn 3
i: 8 codeColumn 9 0
So the file is basically a line of triplets delimited by a comma:
0,1,0 0,0,31 0,0,18 0,0,8 0,11,0
My question is how do you get the trailing zeroes (see above) to move to a new line? I tried using "char" and a bunch of if statements to concatenate the single digits into double digits, but I feel like that's not really efficient or ideal. Any ideas?
My code:
#include <iostream> // Basic I/O
#include <string> // string classes
#include <fstream> // file stream classes
#include <sstream>
#include <vector>
int main()
{
ifstream fCode;
fCode.open("code.txt");
vector<string> codeColumn;
while (getline(fCode, codeLine, ',')) {
codeColumn.push_back(codeLine);
}
for (size_t i = 0; i < codeColumn.size(); ++i) {
cout << " i: " << i << " codeColumn " << codeColumn[i] << endl;
}
fCode.close();
}
getline(fCode, codeLine, ',')
is going to read between commas, so 0,1,0 0,0,31 will split up exactly as you have seen.
0,1,0 0,0,31
^ ^ ^ ^
The tokens collected are everything between the ^s
You have two delimiters you need to take into account comma and space. The easiest way to handle the space is with dumb old >>.
std::string triplet;
while (fCode >> triplet)
{
// do stuff with triplet. Maybe something like
std::istringstream strm(triplet); // make a stream out of the triplet
int a;
int b;
int c;
char sep1;
char sep2;
while (strm >> a >> sep1 >> b >> sep2 >> c // read all the tokens we want from triplet
&& sep1 == sep2 == ',') // and the separators are commas. Triplet is valid
{
// do something with a, b, and c
}
}
Documentation for std::istringstream.
So, I will show you 3 solutions from easy to understand C-Style code, then more-modern C++ code using the std::algorithm library and iterators, and, at the end an object oriented C++ solution.
I will also explain to you that std::getline can be, but should not be used for splitting strings into tokens.
I saw from your question that you had difficulties to understand that. And I understand your concern.
But let's start with an easy solution. I show the code and then explain it to you:
#include <iostream>
#include <fstream>
#include <string>
int main() {
// Open the source text file, and check, if there was no failure
if (std::ifstream fCode{ "r:\\code.txt" }; fCode) {
size_t tripletCounter{ 0 };
// Now, read all triplets from the file in a simple for loop
for (std::string triplet{}; fCode >> triplet; ) {
// Prepare output
std::cout << "\ni:\t" << tripletCounter++ << "\tcodeColumn:\t";
// Go through the triplet, search for comma, then output the parts
for (size_t i{ 0U }, startpos{ 0U }; i <= triplet.size(); ++i) {
// So, if there is a comma or the end of the string
if ((triplet[i] == ',') || (i == (triplet.size()))) {
// Print substring
std::cout << (triplet.substr(startpos, i - startpos)) << ' ';
startpos = i + 1;
}
}
}
}
else {
std::cerr << "\n*** Error, Could not open source file\n";
}
return 0;
}
You see, we need just a few lines of easy to understand code that will fullfil your requirements and produce the desired output.
Some maybe for you new features:
The if statement with initializer. This is available since C++17. You can (in addition to the condition) define a variable and initalize it. So, in
if (std::ifstream fCode{ "r:\\code.txt" }; fCode) {
we first define a variable with name "fCode" of type std::ifstream. We use the uniform initialzer "{}", to initialze it with the input file name.
This will call the constructor for the variable "fCode", and open the file. (This is was this constructor does). After the closing "}" of the "if-statement" the variable "fCode" will fall out of scope and the destructor for the std::ifstream will be called. This will close the file automatically.
This type of if-statement has been introduced to help to prevent name space solution. The variable shall only be visible in the scope, where it is used. Without that, you would have to define the std::ifstream outside (before) the if and it would be visible for the outer context and the file would be closed at a very late time. So, please get aquainted to that.
Next we define the a "tripletCounter". That is hust necessary for output. There is no other usage.
Then, again such an if-statement with initailizer. We first define an empty std::string "triplet" and then use the extractor operator to read text until the next white space. This is how the "extractor" (>>) works. We use the whole expression as condition, to check, if the extraction worlked, or if we hit the end of file (or some other error). This works because the extractor operator returns the stream in that is was working, so a reference to "fCode". And the stream has on overwritten boolen operator !, to check the condition of the stream. Please see here.
You should always and for every IO-Operation check, if it worked or not.
So, next we split the triple (e.g. "0,1,0") into its sub-strings with an very easy for loop. We go through all characters in the string and check, if the current chacter is a comma or the end of string. In that case, we output, the characters before the delimiter.
Very simple and easy to understand. std::getline is not needed here.
So, next solution, more advanced:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <iterator>
#include <regex>
std::regex re(",");
int main() {
// Open the source text file, and check, if there was no failure
if (std::ifstream fCode{ "r:\\code.txt" }; fCode) {
size_t tripletCounter{ 0 };
// Now, read all triplets from the file into a vector
std::vector triplets(std::istream_iterator<std::string>(fCode), {});
// Next, go through all triplets
for (const std::string &triplet : triplets) {
// Prepare output
std::cout << "\ni:\t" << tripletCounter++ << "\tcodeColumn:\t";
// Split triplet into code column. All codes are in vector codeColums
std::vector codeColumns(std::sregex_token_iterator(triplet.begin(), triplet.end(), re, -1), {});
//Show codes
for (const std::string& code : codeColumns) std::cout << code << ' ';
}
}
else {
std::cerr << "\n*** Error, Could not open source file\n";
}
return 0;
}
The beginning is the same. But then:
// Now, read all triplets from the file into a vector
std::vector triplets(std::istream_iterator<std::string>(fCode), {});
UhOh. Whats that. Let's start with the std::istream_iterator. If you read the linked description, then you will find out, that it will basically call the extractor operator >> for the specified type. And since it is an iterator, it will call it again and again, if the iterator is incremented. Ok, understandable, but then
We define variable triplets as std::vector and call its constructor with 2 arguments. That constructor is the the so called range constructor of the std::vector. Please see the descrition for constructor 5. Aha, it gets a "begin()" iterator and an "end()" iterator. Aha, but what is this strange {} instead of the "end()"-iterator. This is the default initializer (please see here and here. And if we look at the description of the std::istream_iterator we can see the the default is the end iterator. OK, understood.
I assum that you know about the range based for, which comes next. Good. But now, we come to the most difficult point. Splitting a string with delimiters. People are using std::getline. But why? Why are people doing such strange stuff?
What do people expect from the function, when they read
getline ?
Most people would say, Hm, I guess it will read a complete line from somewhere. And guess what, that was the basic intention for this function. Read a line from a stream and put it into a string.
As you can see here std::getline has some additional functionality.
And this lead to a major misuse of this function for splitting up std::strings into tokens.
Splitting strings into tokens is a very old task. In very early C there was the function strtok, which still exists, even in C++. Please see std::strtok.
But because of the additional functionality of std::getline is has been heavily misused for tokenizing strings. If you look on the top question/answer regarding how to parse a CSV file (please see here), then you will see what I mean.
People are using std::getline to read a text line, a string, from the original stream, then stuffing it into an std::istringstream again and use std::getline with delimiter again to parse the string into tokens.
Weird.
Because, since many many years, we have a dedicated, special function for tokenizing strings, especially and explicitly designed for that purpose. It is the
std::sregex_token_iterator
And since we have such a dedicated function, we should simply use it.
This thing is an iterator. For iterating over a string, hence the function name is starting with an s. The begin part defines, on what range of input we shall operate, (begin(), end()), then there is a std::regex for what should be matched / or what should not be matched in the input string. The type of matching strategy is given with last parameter.
0 --> give me the stuff that I defined in the regex and
-1 --> give me that what is NOT matched based on the regex.
We can use this iterator for storing the tokens in a std::vector. The std::vector has a range constructor, which takes 2 iterators as parameter, and copies the data between the first iterator and 2nd iterator to the std::vector. The statement
std::vector tokens(std::sregex_token_iterator(s.begin(), s.end(), re, -1), {});
defines a variable “tokens” as a std::vector and uses again the range-constructor of the std::vector. Please note: I am using C++17 and can define the std::vector without template argument. The compiler can deduce the argument from the given function parameters. This feature is called CTAD ("class template argument deduction"). I also used that for the vector above.
Additionally, you can see that I do not use the "end()"-iterator explicitly.
This iterator will be constructed from the empty brace-enclosed default initializer with the correct type, because it will be deduced to be the same as the type of the first argument due to the std::vector constructor requiring that, as already described.
You can read any number of tokens in a line and put it into the std::vector
But you can do even more. You can validate your input. If you use 0 as last parameter, you define a std::regex that even validates your input. And you get only valid tokens.
Overall, the usage of a dedicated functionality is superior over the misused std::getline and people should simply use it.
Some people may complain about the function overhead, but how many of them are using big data. And even then, the approach would be probably then to use string.findand string.substring or std::stringviews or whatever.
So, somehow advanced, but you will eventually learn it.
And now we will use an object oriented approach. As you know, C++ is an object oriented language.
We can put data, and methods working with that data, in a class (struct). The functionality is encapsulated. Only the class should know, how to operate on its data. Sw, we will define a class "Code". This contains a std::array consisting of 3 st::strings. and associated functions. For the array we made a typedef for easier writing. The functions that we need, are input and output. So, we will overwrite the extractor and the inserter operator.
In these operators, we use functions as dscribed above.
And as a result of all this work, we get an elegant main function, where all the work is done in 3 lines of code.
Please see:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <array>
#include <algorithm>
using Triplet = std::array<std::string, 3>;
std::regex re(",");
struct Code {
// Our Data
Triplet triplet{};
// Overwrite extractor operator for easier input
friend std::istream& operator >> (std::istream& is, Code& c) {
// Read a triplet with commans
if (std::string s{}; is >> s) {
// Copy the single columns of the triplet in to our internal Data structure
std::copy(std::sregex_token_iterator(s.begin(), s.end(), re, -1), {}, c.triplet.begin());
}
return is;
}
// Overwrite inserter for easier output
friend std::ostream& operator << (std::ostream& os, const Code& c) {
return os << c.triplet[0] << ' ' << c.triplet[1] << ' ' << c.triplet[2];
}
};
int main() {
// Open the source text file, and check, if there was no failure
if (std::ifstream fCode{ "r:\\code.txt" }; fCode) {
// Now, read all triplets from the file, split it and put the Codes into a vector
std::vector code(std::istream_iterator<Code>(fCode), {});
// Show output
for (size_t tripletCounter{ 0U }; tripletCounter < code.size(); tripletCounter++)
std::cout << "\ni:\t" << tripletCounter << "\tcodeColumn:\t" << code[tripletCounter];
}
else {
std::cerr << "\n*** Error, Could not open source file\n";
}
return 0;
}

Replace single backslash with double in a string c++

I am trying to replace one backslash with two. To do that I tried using the following code
str = "d:\test\text.txt"
str.replace("\\","\\\\");
The code does not work. Whole idea is to pass str to deletefile function, which requires double blackslash.
since c++11, you may try using regex
#include <regex>
#include <iostream>
int main() {
auto s = std::string(R"(\tmp\)");
s = std::regex_replace(s, std::regex(R"(\\)"), R"(\\)");
std::cout << s << std::endl;
}
A bit overkill, but does the trick is you want a "quick" sollution
There are two errors in your code.
First line: you forgot to double the \ in the literal string.
It happens that \t is a valid escape representing the tab character, so you get no compiler error, but your string doesn't contain what you expect.
Second line: according to the reference of string::replace,
you can replace a substring by another substring based on the substring position.
However, there is no version that makes a substitution, i.e. replace all occurences of a given substring by another one.
This doesn't exist in the standard library. It exists for example in the boost library, see boost string algorithms. The algorithm you are looking for is called replace_all.

have a programming project for an intro c++ class one of the function we need to create is a split function

i was hoping to get some feedback on if i am doing this the "smart way" or if maybe i could be doing it faster. if i were splitting on white spaces
i would probably use getline(stringstream, word, delimiter)
but i didnt know how to adapt the delimiter to all the good characters so i just looped through the whole string generated a new word until i reached a bad character but as i am fairly new to programming im not sure if its the best way to do it
thanks for any feedback
#include <iostream>
#include <string>
using std::string;
#include <vector>
using std::vector;
#include <sstream>
#include <algorithm>
#include <iterator> //delete l8r
using std::cout; using std::cin; using std::endl;
/*
void split(string line, vector<string>&words, string good_chars)
o
Find words in the line that consist of good_chars.
Any other character is considered a separator.
o
Once you have a word, convert all the characters to lower case.
You then push each word onto the reference vector words.
Important: split goes in its own file. This is both for your own benefit, you can reuse
split, and for grading purposes.We will provide a split.h for you.
*/
void split(string line, vector<string> & words, string good_chars){
string good_word;
for(auto c : line){
if(good_chars.find(c)!=string::npos){
good_word.push_back(c);
}
else{
if(good_word.size()){
std::transform(good_word.begin(), good_word.end(), good_word.begin(), ::tolower);
words.push_back(good_word);
}
good_word = "";
}
}
}
int main(){
vector<string> words;
string good_chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'";
// TEST split
split("This isn't a TEST.", words, good_chars);
// words should have: {"this", "isn't", "a", "test"}, no period in test
std::copy(words.begin(), words.end(), std::ostream_iterator<string>(cout, ","));
cout << endl;
return 0;
}
I'd say that this is a reasonable approach given the context of an intro to C++ class. I'd even say that it's fairly likely that this is the approach your instructor expects to see.
There are, of course, a few optimization tweaks that can be done. Like instantiating a 256-element bool array, using good_chars to set the corresponding values to true, and all others defaulting to false, then replacing the find() call with a quick array lookup.
But, I'd predirect that if you were to hand in such a thing, you'll be suspected of copying stuff you found on the intertubes, so leave that alone.
One thing you might consider doing is using tolower when you push_back each character, instead, and removing the extra std::transform pass over the word.

regular expression Can't find sequence of numbers

Let
exp = ^[0-9!##$%^&*()_+-=[]{};':"\|,.<>/?\s]*$
be a regular expression that allows me to find all sequences of numbers with or without special characters.
by using exp I manage to extract all sequences of numbers that are greater than 5. But the number 98200 cannot be extracted. I am not using any limits to how long should the sequence of numbers be.
Source code:
#include <boost/regex.hpp>
#include iostream;
using namespace std;
int main()
{
string s = "16000";
string exp = ^[0-9!##$%^&*()_+-=[]{};':"\\|,.<>\\/?\\s]*$
const boost::regex e(exp);
bool isSequence = boost::regex_match(s,e);
//isSequence is boolean and should be equal to 1
cout << isSequence << endl;
return 0;
}
In C#, you need to escape the ]. You don't need to escape [ {} () when they are inside a character class. Also, if you want to include the dash as an included character in the character class, it should be at the beginning or end of the list. The sequence that you have of +-= translates to [+,-./0123456789:;<=] which makes your regex redundant. Finally, because of the terminal quantifier, you are allowing matching of zero length strings. This may be what you want, but if not, consider the '+' quantifier.
What about simply
[^A-Za-z]+
with or without the ^ $ anchors at the beginning/end
Indiscriminately escaping everything works for me.. :)
string exp = "^[0-9\\!##\\$\\%\\^&*\\(\\)_\\+\\-=\\[\\]\\{\\};\\\':\\\"\\\\|,\\.<>\\/?\\s]*$";
Note the double backslash... I'm sure you can workout which of the characters in your list means anything special, and only escape those, as I don't have the time to lookup what has special meaning in this context, I escaped everything, and this works fine for a few of the cases I tested
16000 => returns 1 16A000 => returns 0 16#000 => returns 1
Which I'm guessing is what you want...
I have shifted the brackets to the front of the character class and therewith I get the output 1 for 98200 using the following code:
#include <string>
#include <boost/regex.hpp>
#include <iostream>
using namespace std;
int main()
{
std::cout << "main()\n";
string s = "98200";
string exp = "^[][0-9!##$%^&*()_+-={};':\"\\|,.<>\\/?\\s]*$";
const boost::regex e(exp);
bool isSequence = boost::regex_match(s,e);
//isSequence is boolean and should be equal to 1
cout << isSequence << endl;
return 0;
}
/**
Local Variables:
compile-command: "g++ -g test.cc -o test.exe -lboost_regex-mt; ./test.exe"
End:
*/
EDIT: Note, that I used my experience with emacs regular
expressions. The info pages of emacs explain: "To include a ] in a
character set, you must make it the first character." I tried this
with boost::regexp and it worked. Later on when I had more time I read
in the boost manual
http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.character_sets
that this is not specified for the perl regular expression syntax.
The perl syntax is the standard setting for boost::regex. According to the
specification the comment by
https://stackoverflow.com/users/2872922/ron-rosenfeld is the best
answer.
In the following program I eliminate the character range which was incidentally encoded into your regular expression.
Testing shows that the bracket at the beginning of the character set is included into the character set. So it turns out that my statement was right even if it is not specified in the official manual of boost::regex.
Nevertheless, I suggest that https://stackoverflow.com/users/2872922/ron-rosenfeld inserts his comment as an answer and you mark it as the solution. This will help others reading this thread.
#include <string>
#include <boost/regex.hpp>
#include <iostream>
using namespace std;
int main()
{
std::cout << "main()\n";
string s = "98-[2]00";
string exp = "^[][0-9!##$%^&*()_+={};':\"|,.<>/?\\s-]*$";
const boost::regex e(exp);
bool isSequence = boost::regex_match(s,e);
//isSequence is boolean and should be equal to 1
cout << isSequence << endl;
return 0;
}
/**
Local Variables:
compile-command: "g++ -g test.cc -o test.exe -lboost_regex-mt; ./test.exe"
End:
*/
I asked at http://lists.boost.org/boost-users/2013/12/80707.php
The answer of John Maddock (the author of the boost::regex library) is:
>I discovered that if one uses an closing bracket as the first character of
>a
>character class the character class includes this bracket.
>This works with the standard setting of boost::regex (i.e., perl-regular
>expressions) but it is not documented in the
>manual page
>
>http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/syntax/
>perl_syntax.html#boost_regex.syntax.perl_syntax.character_sets
>
>Is this an undocumented feature, a bug or did I misinterpret something in
>the manual?
It's a feature, both Perl and POSIX extended regular expression behave the
same way.
John.

How can I match the \0 character in a regex in C++?

I need to match the text '\0' with the same regex that I would match 'a' or 'b'. (a regex for a character constant in C++). I've tried a bunch of different regexes, but haven't gotten a successful one yet. My latest attempt:
^['].|\\0[']
Most of the other things I've tried have given seg faults, so this is really the closest I've gotten.
This works pretty nicely with what I've tested ('a','b','\0').
If you don't have std::regex or boost::regex I guess what you can get out of it is the fact that the regex I used is ('.'|'\\0').
#include <boost/regex.hpp>
#include <string>
#include <iostream>
#include <vector>
int main() {
std::vector<std::string> strings;
strings.push_back(R"('a')");
strings.push_back(R"('b')");
strings.push_back(R"('\0')");
boost::regex rgx(R"(('.'|'\\0'))");
boost::smatch match;
for(auto& i : strings) {
if(boost::regex_match(i,match, rgx)) {
boost::ssub_match submatch = match[1];
std::cout << submatch.str() << '\n';
}
}
}
Example
There's nothing magic about '\0'; it's just a character, like any other character, and there's nothing (almost) special you have to do to use it in a regular expression. The only problem you might run into is if you use it in the middle of a character literal that you pass to a function that treats it as the end of a string. To avoid that, force it into a std::string:
const char s[] = "a\0b";
std::string not_my_str(s); // not_my_str holds "a"
std::string str(s, 3); // str holds "a\0b"
Once you've constructed the string object, the embedded '\0' gets no special treatment. Except, of course, if you copy the contents with a function that treats it specially.
The regex that works (in this instance, using the C header ) is:
^('(.|([\\]0))')
Thanks to #WhozCraig for the help!