Different behavior in C regex VS C++11 regex - c++

I need a code that splits math-notation permutations into its elements, lets suppose this permutation:
The permutation string will be:
"(1,2,5)(3,4)" or "(3,4)(1,2,5)" or "(3,4)(5,1,2)"
The patterns i've tried are this:
([0-9]+[ ]*,[ ]*)*[0-9]+ for each permutation cycle. This would split the "(1,2,5)(3,4)" string in two strings "1,2,5" and "3,4".
([0-9]+) for each element in cycle. This would split each cycle in individual numbers.
When i've tried this patterns in this page they work well. And also, i've used them with the C++11 regex library with good results:
#include <iostream>
#include <string>
#include <regex>
void elements(const std::string &input)
{
const std::regex ElementRegEx("[0-9]+");
for (std::sregex_iterator Element(input.begin(), input.end(), ElementRegEx); Element != std::sregex_iterator(); ++Element)
{
const std::string CurrentElement(*Element->begin());
std::cout << '\t' << CurrentElement << '\n';
}
}
void cycles(const std::string &input)
{
const std::regex CycleRegEx("([0-9]+[ ]*,[ ]*)*[0-9]+");
for (std::sregex_iterator Cycle(input.begin(), input.end(), CycleRegEx); Cycle != std::sregex_iterator(); ++Cycle)
{
const std::string CurrentCycle(*Cycle->begin());
std::cout << CurrentCycle << '\n';
elements(CurrentCycle);
}
}
int main(int argc, char **argv)
{
std::string input("(1,2,5)(3,4)");
std::cout << "input: " << input << "\n\n";
cycles(input);
return 0;
}
The Output compiling with Visual Studio 2010 (10.0):
input: (1,2,5)(3,4)
1,2,5
1
2
5
3,4
3
4
But unfortunately, i cannot use the C++11 tools on my project, the project will run under a Linux plataform and it must be compiled with gcc 4.2.3; so i'm forced to use the C regex library in the regex.h header. So, using the same patterns but with different library i'm getting different results:
Here is the test code:
void elements(const std::string &input)
{
regex_t ElementRegEx;
regcomp(&ElementRegEx, "([0-9]+)", REG_EXTENDED);
regmatch_t ElementMatches[MAX_MATCHES];
if (!regexec(&ElementRegEx, input.c_str(), MAX_MATCHES, ElementMatches, 0))
{
int Element = 0;
while ((ElementMatches[Element].rm_so != -1) && (ElementMatches[Element].rm_eo != -1))
{
regmatch_t &ElementMatch = ElementMatches[Element];
std::stringstream CurrentElement(input.substr(ElementMatch.rm_so, ElementMatch.rm_eo - ElementMatch.rm_so));
std::cout << '\t' << CurrentElement << '\n';
++Element;
}
}
regfree(&ElementRegEx);
}
void cycles(const std::string &input)
{
regex_t CycleRegEx;
regcomp(&CycleRegEx, "([0-9]+[ ]*,[ ]*)*[0-9]+", REG_EXTENDED);
regmatch_t CycleMatches[MAX_MATCHES];
if (!regexec(&CycleRegEx, input.c_str(), MAX_MATCHES, CycleMatches, 0))
{
int Cycle = 0;
while ((CycleMatches[Cycle].rm_so != -1) && (CycleMatches[Cycle].rm_eo != -1))
{
regmatch_t &CycleMatch = CycleMatches[Cycle];
const std::string CurrentCycle(input.substr(CycleMatch.rm_so, CycleMatch.rm_eo - CycleMatch.rm_so));
std::cout << CurrentCycle << '\n';
elements(CurrentCycle);
++Cycle;
}
}
regfree(&CycleRegEx);
}
int main(int argc, char **argv)
{
cycles("(1,2,5)(3,4)")
return 0;
}
The expected output is the same as using C++11 regex, but the real ouput was:
input: (1,2,5)(3,4)
1,2,5
1
1
2,
2
2
Finally, the questions are:
Could someone give me a hint about where i'm misunderstanding the C regex engine?
Why the behavior is different in the C regex vs the C++ regex?

You're misunderstanding the output of regexec. The pmatch buffer (after pmatch[0]) is filled with sub-matches of the regex, not with consecutive matches in the string.
For example, if your regex is [a-z]([+ ])([0-9]) matched against x+5, then pmatch[0] will reference x+5 (the whole match), and pmatch[1] and pmatch[2] will reference + and 5 respectively.
You need to repeat the regexec in a loop, starting from the end of the previous match:
int start = 0;
while (!regexec(&ElementRegEx, input.c_str() + start, MAX_MATCHES, ElementMatches, 0))
{
regmatch_t &ElementMatch = ElementMatches[0];
std::string CurrentElement(input.substr(start + ElementMatch.rm_so, ElementMatch.rm_eo - ElementMatch.rm_so));
std::cout << '\t' << CurrentElement << '\n';
start += ElementMatch.rm_eo;
}

Related

How can I trim empty/whitespace lines?

I have to process badly mismanaged text with creative indentation. I want to remove the empty (or whitespace) lines at the beginning and end of my text without touching anything else; meaning that if the first or last actual lines respectively begin or end with whitespace, these will stay.
For example, this:
<lines, empty or with whitespaces ...>
<text, maybe preceded by whitespace>
<lines with or without text...>
<text, maybe followed by whitespace>
<lines, empty or with whitespaces ...>
turns to
<text, maybe preceded by whitespace>
<lines with or without text...>
<text, maybe followed by whitespace>
preserving the spaces at the beginning and the end of the actual text lines (the text might also be entirely whitespace)
A regex replacing (\A\s*(\r\n|\Z)|\r\n\s*\Z) by emptiness does exactly what I want, but regex is kind of overkill, and I fear it might cost me some time when processing texts with a lot of lines but not much to trim.
On the other hand, an explicit algorithm is easy to make (just read until a non-whitespace/the end while remembering the last line feed, then truncate, and do the same backwards) but it feels like I'm missing something obvious.
How can I do this?
As you can see from this discussion, trimming whitespace requires a lot of work in C++. This should definitely be included in the standard library.
Anyway, I've checked how to do it as simply as possible, but nothing comes near the compactness of RegEx. For speed, it's a different story.
In the following you can find three versions of a program which does the required task. With regex, with std functions and with just a couple of indexes. The last one can be also made faster because you can avoid copying altogether, but I left it for fair comparison:
#include <string>
#include <sstream>
#include <chrono>
#include <iostream>
#include <regex>
#include <exception>
struct perf {
std::chrono::steady_clock::time_point start_;
perf() : start_(std::chrono::steady_clock::now()) {}
double elapsed() const {
auto stop = std::chrono::steady_clock::now();
std::chrono::duration<double> elapsed_seconds = stop - start_;
return elapsed_seconds.count();
}
};
std::string Generate(size_t line_len, size_t empty, size_t nonempty) {
std::string es(line_len, ' ');
es += '\n';
for (size_t i = 0; i < empty; ++i) {
es += es;
}
std::string nes(line_len - 1, ' ');
es += "a\n";
for (size_t i = 0; i < nonempty; ++i) {
nes += nes;
}
return es + nes + es;
}
int main()
{
std::string test;
//test = " \n\t\n \n \tTEST\n\tTEST\n\t\t\n TEST\t\n \t\n \n ";
std::cout << "Generating...";
std::cout.flush();
test = Generate(1000, 8, 10);
std::cout << " done." << std::endl;
std::cout << "Test 1...";
std::cout.flush();
perf p1;
std::string out1;
std::regex re(R"(^\s*\n|\n\s*$)");
try {
out1 = std::regex_replace(test, re, "");
}
catch (std::exception& e) {
std::cout << e.what() << std::endl;
}
std::cout << " done. Elapsed time: " << p1.elapsed() << "s" << std::endl;
std::cout << "Test 2...";
std::cout.flush();
perf p2;
std::stringstream is(test);
std::string line;
while (std::getline(is, line) && line.find_first_not_of(" \t\n\v\f\r") == std::string::npos);
std::string out2 = line;
size_t end = out2.size();
while (std::getline(is, line)) {
out2 += '\n';
out2 += line;
if (line.find_first_not_of(" \t\n\v\f\r") != std::string::npos) {
end = out2.size();
}
}
out2.resize(end);
std::cout << " done. Elapsed time: " << p2.elapsed() << "s" << std::endl;
if (out1 == out2) {
std::cout << "out1 == out2\n";
}
else {
std::cout << "out1 != out2\n";
}
std::cout << "Test 3...";
std::cout.flush();
perf p3;
static bool whitespace_table[] = {
1,1,1,1,1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
};
size_t sfl = 0; // Start of first line
for (size_t i = 0, end = test.size(); i < end; ++i) {
if (test[i] == '\n') {
sfl = i + 1;
}
else if (whitespace_table[(unsigned char)test[i]]) {
break;
}
}
size_t ell = test.size(); // End of last line
for (size_t i = test.size(); i-- > 0;) {
if (test[i] == '\n') {
ell = i;
}
else if (whitespace_table[(unsigned char)test[i]]) {
break;
}
}
std::string out3 = test.substr(sfl, ell - sfl);
std::cout << " done. Elapsed time: " << p3.elapsed() << "s" << std::endl;
if (out1 == out3) {
std::cout << "out1 == out3\n";
}
else {
std::cout << "out1 != out3\n";
}
return 0;
}
Running it on C++ Shell you get these timings:
Generating... done.
Test 1... done. Elapsed time: 4.2288s
Test 2... done. Elapsed time: 0.0077323s
out1 == out2
Test 3... done. Elapsed time: 0.000695783s
out1 == out3
If performance is important, it's better to really test it with the real files.
As a side note, this regex doesn't work on MSVC, because I couldn't find a way of avoiding ^ and $ to match the start and end of lines, that is disable the multiline mode of operation. If you run this, it throws an exception saying regex_error(error_complexity): The complexity of an attempted match against a regular expression exceeded a pre-set level.
I think I'll ask how to cope with this!
If whitespace in front of the first line or after the last non-whitespace-only line can be removed then this answer https://stackoverflow.com/a/217605/14258355 will suffice.
However, due to this constraint and if you do not want to use regex, I would propose to convert the string into lines and then build the string back up again from the first to the last non-whitespace-only line.
Here is a working example: https://godbolt.org/z/rozxj6saj
Convert the string to lines:
std::vector<std::string> StringToLines(const std::string &s) {
// Create vector with lines (not using input stream to keep line break
// characters)
std::vector<std::string> result;
std::string line;
for (auto c : s) {
line.push_back(c);
// Check for line break
if (c == '\n' || c == '\r') {
result.push_back(line);
line.clear();
}
}
// add last bit
result.push_back(line);
return result;
}
Build the string from the first to the last non-whitespace-only line:
bool IsNonWhiteSpaceString(const std::string &s) {
return s.end() != std::find_if(s.begin(), s.end(), [](unsigned char uc) {
return !std::isspace(uc);
});
}
std::string TrimVectorEmptyEndsIntoString(const std::vector<std::string> &v) {
std::string result;
// Find first non-whitespace line
auto it_begin = std::find_if(v.begin(), v.end(), [](const std::string &s) {
return IsNonWhiteSpaceString(s);
});
// Find last non-whitespace line
auto it_end = std::find_if(v.rbegin(), v.rend(), [](const std::string &s) {
return IsNonWhiteSpaceString(s);
});
// Build the string
for (auto it = it_begin; it != it_end.base(); std::advance(it, 1)) {
result.append(*it);
}
return result;
}
Usage example:
// Create a test string
std::string test_string(
" \n\t\n \n TEST\n\tTEST\n\t\tTEST\n TEST\t\n \t");
// Output result
std::cout << TrimVectorEmptyEndsIntoString(StringToLines(test_string));
Output showing whitespace:

How to run a string search algorithm through whole body of text

I am using the brute force string search algorithm to search through a small sentence, however I want the algorithm to return every time it finds the certain string instead of finding it once and then stopping
//Declare and initialise variables
string pat, text;
text = "This is a test sentence, find test within this string";
cout << text << endl;
//User input for pat
cout << "Please enter the string you want to search for" << endl;
cin >> pat;
//Set the length of the pat and text
int patLength = pat.size();
int textLength = text.size();
//Algorithm
for (int i = 0; i < textLength - patLength; ++i)
{
//Do while loop to run through the whole text
do
{
int j;
for (j = 0; j < patLength; j++)
{
if (text[i + j] != pat[j])
break; // Doesn't match here.
}
if (j == patLength)
{
finds.push(i); // Matched here.
}
} while (i < textLength);
}
//Print output
cout << "String: " << pat << " was found at positions: " << finds.top();
The program stores each find in a queue. When I run this program, it asks for the 'pat', then does nothing. I have done a bit of debugging and found that it is probably the do while loop. However I can't find a fix
You could use the std::string::find function combined with a function that you call for each find.
#include <iostream>
#include <functional>
#include <vector>
#include <sstream>
void Algorithm(
const std::string& text, const std::string& pat,
std::function<void(const std::string&,size_t)> f, std::vector<size_t>& positions)
{
size_t pos=0;
while((pos=text.find(pat, pos)) != std::string::npos) {
// store the position
positions.push_back(pos);
// call the supplied function
f(text, pos++);
}
}
// function to call for each position in which the pattern is found
void gotit(const std::string& found_in, size_t pos) {
std::cout << "Found in \"" << found_in << "\" # " << pos << "\n";
}
int main(int argc, char* argv[]) {
std::vector<std::string> args(argv+1, argv+argc);
if(args.size()==0)
args.push_back("This is a test sentence, find test within this string");
for(const auto& text : args) {
std::vector<size_t> found_at;
std::cout << "Please enter the string you want to search for: ";
std::string pat;
std::cin >> pat;
Algorithm(text, pat, gotit, found_at);
std::cout << "collected positions:\n";
for(size_t pos : found_at) {
std::cout << pos << "\n";
}
}
}
My first bit of advice would be to structure your code into separate functions.
Let's say you have a function that returns the position of the pattern's first occurrence in a sequence of characters:
using position = typename std::string::const_iterator;
position first_occurrence(position text_begin, position text_end, const std::string& pattern);
If there is no more occurrence of the pattern, it returns text_end.
You can now write a very simple loop:
auto occurrence = first_occurrence(text_begin, pattern);
while (occurrence != text_end) {
occurrences.push_back(occurrence);
occurrence = first_occurence(occurrence + 1, text_end, pattern);
}
to accumulate all the occurrences of the pattern.
The first_occurrence function already exists in the standard library under the name of std::search. Since C++17, you can customize this function with pattern-searching specialized searchers, such as std::boyer_moore_searcher: it pre-processes the pattern to make it faster to look for in the string. Here's an example application to your problem:
#include <algorithm>
#include <string>
#include <vector>
#include <functional>
using occurrence = typename std::string::const_iterator;
std::vector<occurrence> find_occurrences(const std::string& input, const std::string& pattern) {
auto engine = std::boyer_moore_searcher(pattern.begin(), pattern.end());
std::vector<occurrence> occurrences;
auto it = std::search(input.begin(), input.end(), engine);
while (it != input.end()) {
occurrences.push_back(it);
it = std::search(std::next(it), input.end(), engine);
}
return occurrences;
}
#include <iostream>
int main() {
std::string text = "This is a test sentence, find test within this string";
std::string pattern = "st";
auto occs = find_occurrences(text, pattern);
for (auto occ: occs) std::cout << std::string(occ, std::next(occ, pattern.size())) << std::endl;
}

Difficulties with string declaration/reference parameters (c++)

Last week I got an homework to write a function: the function gets a string and a char value and should divide the string in two parts, before and after the first occurrence of the existing char.
The code worked but my teacher told me to do it again, because it is not well written code. But I don't understand how to make it better. I understand so far that defining two strings with white spaces is not good, but i get out of bounds exceptions otherwise. Since the string input changes, the string size changes everytime.
#include <iostream>
#include <string>
using namespace std;
void divide(char search, string text, string& first_part, string& sec_part)
{
bool firstc = true;
int counter = 0;
for (int i = 0; i < text.size(); i++) {
if (text.at(i) != search && firstc) {
first_part.at(i) = text.at(i);
}
else if (text.at(i) == search&& firstc == true) {
firstc = false;
sec_part.at(counter) = text.at(i);
}
else {
sec_part.at(counter) = text.at(i);
counter++;
}
}
}
int main() {
string text;
string part1=" ";
string part2=" ";
char search_char;
cout << "Please enter text? ";
getline(cin, text);
cout << "Please enter a char: ? ";
cin >> search_char;
divide(search_char,text,aprt1,part2);
cout << "First string: " << part1 <<endl;
cout << "Second string: " << part2 << endl;
system("PAUSE");
return 0;
}
I would suggest you, learn to use c++ standard functions. there are plenty utility function that can help you in programming.
void divide(const std::string& text, char search, std::string& first_part, std::string& sec_part)
{
std::string::const_iterator pos = std::find(text.begin(), text.end(), search);
first_part.append(text, 0, pos - text.begin());
sec_part.append(text, pos - text.begin());
}
int main()
{
std::string text = "thisisfirst";
char search = 'f';
std::string first;
std::string second;
divide(text, search, first, second);
}
Here I used std::find that you can read about it from here and also Iterators.
You have some other mistakes. you are passing your text by value that will do a copy every time you call your function. pass it by reference but qualify it with const that will indicate it is an input parameter not an output.
Why is your teacher right ?
The fact that you need to initialize your destination strings with empty space is terrible:
If the input string is longer, you'll get out of bound errors.
If it's shorter, you got wrong answer, because in IT and programming, "It works " is not the same as "It works".
In addition, your code does not fit the specifications. It should work all the time, independently of the current value which is stored in your output strings.
Alternative 1: your code but working
Just clear the destination strings at the beginning. Then iterate as you did, but use += or push_back() to add chars at the end of the string.
void divide(char search, string text, string& first_part, string& sec_part)
{
bool firstc = true;
first_part.clear(); // make destinations strings empty
sec_part.clear();
for (int i = 0; i < text.size(); i++) {
char c = text.at(i);
if (firstc && c != search) {
first_part += c;
}
else if (firstc && c == search) {
firstc = false;
sec_part += c;
}
else {
sec_part += c;
}
}
}
I used a temporary c instead of text.at(i) or text\[i\], in order to avoid multiple indexing But this is not really required: nowadays, optimizing compilers should produce equivalent code, whatever variant you use here.
Alternative 2: use string member functions
This alternative uses the find() function, and then constructs a string from the start until that position, and another from that position. There is a special case when the character was not found.
void divide(char search, string text, string& first_part, string& sec_part)
{
auto pos = text.find(search);
first_part = string(text, 0, pos);
if (pos== string::npos)
sec_part.clear();
else sec_part = string(text, pos, string::npos);
}
As you understand yourself these declarations
string part1=" ";
string part2=" ";
do not make sense because the entered string in the object text can essentially exceed the both initialized strings. In this case using the string method at can result in throwing an exception or the strings will have trailing spaces.
From the description of the assignment it is not clear whether the searched character should be included in one of the strings. You suppose that the character should be included in the second string.
Take into account that the parameter text should be declared as a constant reference.
Also instead of using loops it is better to use methods of the class std::string such as for example find.
The function can look the following way
#include <iostream>
#include <string>
void divide(const std::string &text, char search, std::string &first_part, std::string &sec_part)
{
std::string::size_type pos = text.find(search);
first_part = text.substr(0, pos);
if (pos == std::string::npos)
{
sec_part.clear();
}
else
{
sec_part = text.substr(pos);
}
}
int main()
{
std::string text("Hello World");
std::string first_part;
std::string sec_part;
divide(text, ' ', first_part, sec_part);
std::cout << "\"" << text << "\"\n";
std::cout << "\"" << first_part << "\"\n";
std::cout << "\"" << sec_part << "\"\n";
}
The program output is
"Hello World"
"Hello"
" World"
As you can see the separating character is included in the second string though I think that maybe it would be better to exclude it from the both strings.
An alternative and in my opinion more clear approach can look the following way
#include <iostream>
#include <string>
#include <utility>
std::pair<std::string, std::string> divide(const std::string &s, char c)
{
std::string::size_type pos = s.find(c);
return { s.substr(0, pos), pos == std::string::npos ? "" : s.substr(pos) };
}
int main()
{
std::string text("Hello World");
auto p = divide(text, ' ');
std::cout << "\"" << text << "\"\n";
std::cout << "\"" << p.first << "\"\n";
std::cout << "\"" << p.second << "\"\n";
}
Your code will only work as long the character is found within part1.length(). You need something similar to this:
void string_split_once(const char s, const string & text, string & first, string & second) {
first.clear();
second.clear();
std::size_t pos = str.find(s);
if (pos != string::npos) {
first = text.substr(0, pos);
second = text.substr(pos);
}
}
The biggest problem I see is that you are using at where you should be using push_back. See std::basic_string::push_back. at is designed to access an existing character to read or modify it. push_back appends a new character to the string.
divide could look like this :
void divide(char search, string text, string& first_part,
string& sec_part)
{
bool firstc = true;
for (int i = 0; i < text.size(); i++) {
if (text.at(i) != search && firstc) {
first_part.push_back(text.at(i));
}
else if (text.at(i) == search&& firstc == true) {
firstc = false;
sec_part.push_back(text.at(i));
}
else {
sec_part.push_back(text.at(i));
}
}
}
Since you aren't handling exceptions, consider using text[i] rather than text.at(i).

C++ split string using a list of words as separators

I would like to split a string like this one
“this1245is#g$0,therhsuidthing345”
using a list of words like the one bellow
{“this”, “is”, “the”, “thing”}
into this list
{“this”, “1245”, “is”, “#g$0,”, “the”, “rhsuid”, “thing”, “345”}
// ^--------------^---------------^------------------^-- these were the delimiters
The delimiters are allowed to appear more than once in the string to split, and it can be done using regular expressions
The precedence is in the order in which the delimiters appear in the array
The platform I'm developing for has no support for the Boost library
Update
This is what I have for the moment
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("this1245is#g$0,therhsuidthing345");
std::string delimiters[] = {"this", "is", "the", "thing"};
for (int i=0; i<4; i++) {
std::string delimiter = "(" + delimiters[i] + ")(.*)";
std::regex e (delimiter); // matches words beginning by the i-th delimiter
// default constructor = end-of-sequence:
std::sregex_token_iterator rend;
std::cout << "1st and 2nd submatches:";
int submatches[] = { 1, 2 };
std::sregex_token_iterator c ( s.begin(), s.end(), e, submatches );
while (c!=rend) std::cout << " [" << *c++ << "]";
std::cout << std::endl;
}
return 0;
}
output:
1st and 2nd submatches:[this][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[is][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[the][rhsuidthing345]
1st and 2nd submatches:[thing][345]
I think I need to make some recursive thing to call on each iteration
Build the expression you want for matches only (re), then pass in {-1, 0} to your std::sregex_token_iterator to return all non-matches (-1) and matches (0).
#include <iostream>
#include <regex>
int main() {
std::string s("this1245is#g$0,therhsuidthing345");
std::regex re("(this|is|the|thing)");
std::sregex_token_iterator iter(s.begin(), s.end(), re, { -1, 0 });
std::sregex_token_iterator end;
while (iter != end) {
//Works in vc13, clang requires you increment separately,
//haven't gone into implementation to see if/how ssub_match is affected.
//Workaround: increment separately.
//std::cout << "[" << *iter++ << "] ";
std::cout << "[" << *iter << "] ";
++iter;
}
}
I don't know how to perform the precedence requirement. This seems to work on the given input:
std::vector<std::string> parse (std::string s)
{
std::vector<std::string> out;
std::regex re("\(this|is|the|thing).*");
std::string word;
auto i = s.begin();
while (i != s.end()) {
std::match_results<std::string::iterator> m;
if (std::regex_match(i, s.end(), m, re)) {
if (!word.empty()) {
out.push_back(word);
word.clear();
}
out.push_back(std::string(m[1].first, m[1].second));
i += out.back().size();
} else {
word += *i++;
}
}
if (!word.empty()) {
out.push_back(word);
}
return out;
}
vector<string> strs;
boost::split(strs,line,boost::is_space());

How do I check if a C++ std::string starts with a certain string, and convert a substring to an int?

How do I implement the following (Python pseudocode) in C++?
if argv[1].startswith('--foo='):
foo_value = int(argv[1][len('--foo='):])
(For example, if argv[1] is --foo=98, then foo_value is 98.)
Update: I'm hesitant to look into Boost, since I'm just looking at making a very small change to a simple little command-line tool (I'd rather not have to learn how to link in and use Boost for a minor change).
Use rfind overload that takes the search position pos parameter, and pass zero for it:
std::string s = "tititoto";
if (s.rfind("titi", 0) == 0) { // pos=0 limits the search to the prefix
// s starts with prefix
}
Who needs anything else? Pure STL!
Many have misread this to mean "search backwards through the whole string looking for the prefix". That would give the wrong result (e.g. string("tititito").rfind("titi") returns 2 so when compared against == 0 would return false) and it would be inefficient (looking through the whole string instead of just the start). But it does not do that because it passes the pos parameter as 0, which limits the search to only match at that position or earlier. For example:
std::string test = "0123123";
size_t match1 = test.rfind("123"); // returns 4 (rightmost match)
size_t match2 = test.rfind("123", 2); // returns 1 (skipped over later match)
size_t match3 = test.rfind("123", 0); // returns std::string::npos (i.e. not found)
You would do it like this:
std::string prefix("--foo=");
if (!arg.compare(0, prefix.size(), prefix))
foo_value = std::stoi(arg.substr(prefix.size()));
Looking for a lib such as Boost.ProgramOptions that does this for you is also a good idea.
Just for completeness, I will mention the C way to do it:
If str is your original string, substr is the substring you want to
check, then
strncmp(str, substr, strlen(substr))
will return 0 if str
starts with substr. The functions strncmp and strlen are in the C
header file <string.h>
(originally posted by Yaseen Rauf here, markup added)
For a case-insensitive comparison, use strnicmp instead of strncmp.
This is the C way to do it, for C++ strings you can use the same function like this:
strncmp(str.c_str(), substr.c_str(), substr.size())
If you're already using Boost, you can do it with boost string algorithms + boost lexical cast:
#include <boost/algorithm/string/predicate.hpp>
#include <boost/lexical_cast.hpp>
try {
if (boost::starts_with(argv[1], "--foo="))
foo_value = boost::lexical_cast<int>(argv[1]+6);
} catch (boost::bad_lexical_cast) {
// bad parameter
}
This kind of approach, like many of the other answers provided here is ok for very simple tasks, but in the long run you are usually better off using a command line parsing library. Boost has one (Boost.Program_options), which may make sense if you happen to be using Boost already.
Otherwise a search for "c++ command line parser" will yield a number of options.
Code I use myself:
std::string prefix = "-param=";
std::string argument = argv[1];
if(argument.substr(0, prefix.size()) == prefix) {
std::string argumentValue = argument.substr(prefix.size());
}
Nobody used the STL algorithm/mismatch function yet. If this returns true, prefix is a prefix of 'toCheck':
std::mismatch(prefix.begin(), prefix.end(), toCheck.begin()).first == prefix.end()
Full example prog:
#include <algorithm>
#include <string>
#include <iostream>
int main(int argc, char** argv) {
if (argc != 3) {
std::cerr << "Usage: " << argv[0] << " prefix string" << std::endl
<< "Will print true if 'prefix' is a prefix of string" << std::endl;
return -1;
}
std::string prefix(argv[1]);
std::string toCheck(argv[2]);
if (prefix.length() > toCheck.length()) {
std::cerr << "Usage: " << argv[0] << " prefix string" << std::endl
<< "'prefix' is longer than 'string'" << std::endl;
return 2;
}
if (std::mismatch(prefix.begin(), prefix.end(), toCheck.begin()).first == prefix.end()) {
std::cout << '"' << prefix << '"' << " is a prefix of " << '"' << toCheck << '"' << std::endl;
return 0;
} else {
std::cout << '"' << prefix << '"' << " is NOT a prefix of " << '"' << toCheck << '"' << std::endl;
return 1;
}
}
Edit:
As #James T. Huggett suggests, std::equal is a better fit for the question: Is A a prefix of B? and is slight shorter code:
std::equal(prefix.begin(), prefix.end(), toCheck.begin())
Full example prog:
#include <algorithm>
#include <string>
#include <iostream>
int main(int argc, char **argv) {
if (argc != 3) {
std::cerr << "Usage: " << argv[0] << " prefix string" << std::endl
<< "Will print true if 'prefix' is a prefix of string"
<< std::endl;
return -1;
}
std::string prefix(argv[1]);
std::string toCheck(argv[2]);
if (prefix.length() > toCheck.length()) {
std::cerr << "Usage: " << argv[0] << " prefix string" << std::endl
<< "'prefix' is longer than 'string'" << std::endl;
return 2;
}
if (std::equal(prefix.begin(), prefix.end(), toCheck.begin())) {
std::cout << '"' << prefix << '"' << " is a prefix of " << '"' << toCheck
<< '"' << std::endl;
return 0;
} else {
std::cout << '"' << prefix << '"' << " is NOT a prefix of " << '"'
<< toCheck << '"' << std::endl;
return 1;
}
}
With C++17 you can use std::basic_string_view & with C++20 std::basic_string::starts_with or std::basic_string_view::starts_with.
The benefit of std::string_view in comparison to std::string - regarding memory management - is that it only holds a pointer to a "string" (contiguous sequence of char-like objects) and knows its size. Example without moving/copying the source strings just to get the integer value:
#include <exception>
#include <iostream>
#include <string>
#include <string_view>
int main()
{
constexpr auto argument = "--foo=42"; // Emulating command argument.
constexpr auto prefix = "--foo=";
auto inputValue = 0;
constexpr auto argumentView = std::string_view(argument);
if (argumentView.starts_with(prefix))
{
constexpr auto prefixSize = std::string_view(prefix).size();
try
{
// The underlying data of argumentView is nul-terminated, therefore we can use data().
inputValue = std::stoi(argumentView.substr(prefixSize).data());
}
catch (std::exception & e)
{
std::cerr << e.what();
}
}
std::cout << inputValue; // 42
}
Given that both strings — argv[1] and "--foo" — are C strings, #FelixDombek's answer is hands-down the best solution.
Seeing the other answers, however, I thought it worth noting that, if your text is already available as a std::string, then a simple, zero-copy, maximally efficient solution exists that hasn't been mentioned so far:
const char * foo = "--foo";
if (text.rfind(foo, 0) == 0)
foo_value = text.substr(strlen(foo));
And if foo is already a string:
std::string foo("--foo");
if (text.rfind(foo, 0) == 0)
foo_value = text.substr(foo.length());
Starting with C++20, you can use the starts_with method.
std::string s = "abcd";
if (s.starts_with("abc")) {
...
}
text.substr(0, start.length()) == start
Using STL this could look like:
std::string prefix = "--foo=";
std::string arg = argv[1];
if (prefix.size()<=arg.size() && std::equal(prefix.begin(), prefix.end(), arg.begin())) {
std::istringstream iss(arg.substr(prefix.size()));
iss >> foo_value;
}
At the risk of being flamed for using C constructs, I do think this sscanf example is more elegant than most Boost solutions. And you don't have to worry about linkage if you're running anywhere that has a Python interpreter!
#include <stdio.h>
#include <string.h>
int main(int argc, char **argv)
{
for (int i = 1; i != argc; ++i) {
int number = 0;
int size = 0;
sscanf(argv[i], "--foo=%d%n", &number, &size);
if (size == strlen(argv[i])) {
printf("number: %d\n", number);
}
else {
printf("not-a-number\n");
}
}
return 0;
}
Here's some example output that demonstrates the solution handles leading/trailing garbage as correctly as the equivalent Python code, and more correctly than anything using atoi (which will erroneously ignore a non-numeric suffix).
$ ./scan --foo=2 --foo=2d --foo='2 ' ' --foo=2'
number: 2
not-a-number
not-a-number
not-a-number
I use std::string::compare wrapped in utility method like below:
static bool startsWith(const string& s, const string& prefix) {
return s.size() >= prefix.size() && s.compare(0, prefix.size(), prefix) == 0;
}
C++20 update :
Use std::string::starts_with
https://en.cppreference.com/w/cpp/string/basic_string/starts_with
std::string str_value = /* smthg */;
const auto starts_with_foo = str_value.starts_with(std::string_view{"foo"});
In C++20 now there is starts_with available as a member function of std::string defined as:
constexpr bool starts_with(string_view sv) const noexcept;
constexpr bool starts_with(CharT c) const noexcept;
constexpr bool starts_with(const CharT* s) const;
So your code could be something like this:
std::string s{argv[1]};
if (s.starts_with("--foo="))
In case you need C++11 compatibility and cannot use boost, here is a boost-compatible drop-in with an example of usage:
#include <iostream>
#include <string>
static bool starts_with(const std::string str, const std::string prefix)
{
return ((prefix.size() <= str.size()) && std::equal(prefix.begin(), prefix.end(), str.begin()));
}
int main(int argc, char* argv[])
{
bool usage = false;
unsigned int foos = 0; // default number of foos if no parameter was supplied
if (argc > 1)
{
const std::string fParamPrefix = "-f="; // shorthand for foo
const std::string fooParamPrefix = "--foo=";
for (unsigned int i = 1; i < argc; ++i)
{
const std::string arg = argv[i];
try
{
if ((arg == "-h") || (arg == "--help"))
{
usage = true;
} else if (starts_with(arg, fParamPrefix)) {
foos = std::stoul(arg.substr(fParamPrefix.size()));
} else if (starts_with(arg, fooParamPrefix)) {
foos = std::stoul(arg.substr(fooParamPrefix.size()));
}
} catch (std::exception& e) {
std::cerr << "Invalid parameter: " << argv[i] << std::endl << std::endl;
usage = true;
}
}
}
if (usage)
{
std::cerr << "Usage: " << argv[0] << " [OPTION]..." << std::endl;
std::cerr << "Example program for parameter parsing." << std::endl << std::endl;
std::cerr << " -f, --foo=N use N foos (optional)" << std::endl;
return 1;
}
std::cerr << "number of foos given: " << foos << std::endl;
}
Why not use gnu getopts? Here's a basic example (without safety checks):
#include <getopt.h>
#include <stdio.h>
int main(int argc, char** argv)
{
option long_options[] = {
{"foo", required_argument, 0, 0},
{0,0,0,0}
};
getopt_long(argc, argv, "f:", long_options, 0);
printf("%s\n", optarg);
}
For the following command:
$ ./a.out --foo=33
You will get
33
Ok why the complicated use of libraries and stuff? C++ String objects overload the [] operator, so you can just compare chars.. Like what I just did, because I want to list all files in a directory and ignore invisible files and the .. and . pseudofiles.
while ((ep = readdir(dp)))
{
string s(ep->d_name);
if (!(s[0] == '.')) // Omit invisible files and .. or .
files.push_back(s);
}
It's that simple..
You can also use strstr:
if (strstr(str, substr) == substr) {
// 'str' starts with 'substr'
}
but I think it's good only for short strings because it has to loop through the whole string when the string doesn't actually start with 'substr'.
With C++11 or higher you can use find() and find_first_of()
Example using find to find a single char:
#include <string>
std::string name = "Aaah";
size_t found_index = name.find('a');
if (found_index != std::string::npos) {
// Found string containing 'a'
}
Example using find to find a full string & starting from position 5:
std::string name = "Aaah";
size_t found_index = name.find('h', 3);
if (found_index != std::string::npos) {
// Found string containing 'h'
}
Example using the find_first_of() and only the first char, to search at the start only:
std::string name = ".hidden._di.r";
size_t found_index = name.find_first_of('.');
if (found_index == 0) {
// Found '.' at first position in string
}
More about find
More about find_first_of
Good luck!
std::string text = "--foo=98";
std::string start = "--foo=";
if (text.find(start) == 0)
{
int n = stoi(text.substr(start.length()));
std::cout << n << std::endl;
}
Since C++11 std::regex_search can also be used to provide even more complex expressions matching. The following example handles also floating numbers thorugh std::stof and a subsequent cast to int.
However the parseInt method shown below could throw a std::invalid_argument exception if the prefix is not matched; this can be easily adapted depending on the given application:
#include <iostream>
#include <regex>
int parseInt(const std::string &str, const std::string &prefix) {
std::smatch match;
std::regex_search(str, match, std::regex("^" + prefix + "([+-]?(?=\\.?\\d)\\d*(?:\\.\\d*)?(?:[Ee][+-]?\\d+)?)$"));
return std::stof(match[1]);
}
int main() {
std::cout << parseInt("foo=13.3", "foo=") << std::endl;
std::cout << parseInt("foo=-.9", "foo=") << std::endl;
std::cout << parseInt("foo=+13.3", "foo=") << std::endl;
std::cout << parseInt("foo=-0.133", "foo=") << std::endl;
std::cout << parseInt("foo=+00123456", "foo=") << std::endl;
std::cout << parseInt("foo=-06.12e+3", "foo=") << std::endl;
// throw std::invalid_argument
// std::cout << parseInt("foo=1", "bar=") << std::endl;
return 0;
}
The kind of magic of the regex pattern is well detailed in the following answer.
EDIT: the previous answer did not performed the conversion to integer.
if(boost::starts_with(string_to_search, string_to_look_for))
intval = boost::lexical_cast<int>(string_to_search.substr(string_to_look_for.length()));
This is completely untested. The principle is the same as the Python one. Requires Boost.StringAlgo and Boost.LexicalCast.
Check if the string starts with the other string, and then get the substring ('slice') of the first string and convert it using lexical cast.