C++ regex segfault on long sequences - c++

I was parsing stackoverflow dump and came up on this seemingly innocent question with small, almost invisible detail that it has 22311 spaces at the end of text.
I'm using std::regex (somehow they work better for me than boost::regex) to replace all continuous whitespaces with single space like this:
std::regex space_regex("\\s+", std::regex::optimize);
...
std::regex_replace(out, in, in + strlen(in), space_regex, " ");
SIGSEGV shows up and I have begun to investigate.
Test code:
#include <regex>
...
std::regex r("\\s+", std::regex::optimize);
const char* bomb2 = "Small text\n\nwith several\n\nlines.";
std::string test(bomb2);
for (auto i = 0; i < N; ++i) test += " ";
std::string out = std::regex_replace(test.c_str(), r, " ");
std::cout << out << std::endl;
for (gcc 5.3.0)
$ g++ -O3 -std=c++14 regex-test.cpp -o regex-test.out
maximum N before SIGSEGV shows up is 21818 (for this particular string), and for
$ g++ -O0 -std=c++14 regex-test.cpp -o regex-test.out
it's 12180.
'Ok, let's try clang, it's trending and aims to replace gcc' - never have I been so wrong. With -O0 clang (v. 3.7.1) crashes on 9696 spaces - less then gcc, but not much, yet with -O3 and even with -O2 it crashes on ZERO spaces.
Crash dump presents huge stacktraces (35k frames) of recursive calls of
std::__detail::_Executor<char*, std::allocator<std::__cxx11::sub_match<char*> >, std::__cxx11::regex_traits<char>, true>::_M_dfs
Question 1: Is this a bug? If so, should I report it?
Question 2: Is there smart way to overcome the problem (other than increasing system stack size, trying other regex libraries and writing own function to replace whitespaces)?
Amendment: bug report created for libstdc++

Is this a bug? If so, should I report it?
Yes this is a bug.
cout << '"' << regex_replace("Small text\n\nwith several\n\nlines." + string(22311, ' '), regex("\\s+", regex::optimize), " ") << '"' << endl;
Runs fine with libc++: http://coliru.stacked-crooked.com/a/f9ee5438745a5b22
Runs fine with Visual Studio 2015, you can test by copying and running the code at: http://webcompiler.cloudapp.net/
Fails with libstdc++: http://coliru.stacked-crooked.com/a/3f4bbe5c46b6b627
This has been bugged in libstdc++ here.
Is there smart way to overcome the problem?
If you're asking for a new regex that works, I've tried a handful of different versions, and all of them fail on libstdc++, so I'd say, if you want to use a regex to solve this, you'll need to compile against libc++.
But honestly if you're using a regex to strip duplicate white space, "Now you have two problems"
A better solution could use adjacent_find which runs fine with libstdc++ as well:
const auto func = [](const char a, const char b){ return isspace(a) && isspace(b); };
for(auto it = adjacent_find(begin(test), end(test), func); it != end(test); it = adjacent_find(it, end(test), func)) {
*it = ' ';
it = test.erase(next(it), find_if_not(next(it), end(test), [](const auto& i) { return isspace(i); }));
}
This will return the same thing your regex would:
"Small text with several lines. "
But if you're going for simplicity, you could also use unique:
test.resize(distance(test.begin(), unique(test.begin(), test.end(), [](const auto& a, const auto& b) { return isspace(a) && isspace(b); })));
Which will return:
"Small text
with several
lines. "

Question 2 (smart way to overcome the problem)
Not really smart but... you can iterate a limited replace.
An example
#include <regex>
#include <iostream>
int main()
{
constexpr int N = 22311;
//std::regex r("\\s+");
std::regex r("\\s{2,100}");
const char* bomb2 = "Small text\n\nwith several\n\nlines.";
std::string test(bomb2);
for (auto i = 0; i < N; ++i)
test += " ";
std::string out = test;
std::size_t preSize;
do
{
preSize = out.size();
out = std::regex_replace(out, r, " ");
}
while ( out.size() < preSize );
std::cout << '\"' << out << '\"' << std::endl;
return 0;
}

Related

c++ regex search pattern not found

Following the example here I wrote following code:
using namespace std::regex_constants;
std::string str("{trol,asdfsad},{safsa, aaaaa,aaaaadfs}");
std::smatch m;
std::regex r("\\{(.*)\\}"); // matches anything between {}
std::cout << "Initiating search..." << std::endl;
while (std::regex_search(str, m, r)) {
for (auto x : m) {
std::cout << x << " ";
}
std::cout << std::endl;
str = m.suffix().str();
}
But to my surprise, it doesn't find anything at all which I fail to understand. I would understand if the regex matches whole string since .* is greedy but nothing at all? What am I doing wrong here?
To be clear - I know that regexes are not suitable for Parsing BUT I won't deal with more levels of bracket nesting and therefore I find usage of regexes good enough.
If you want to use basic posix syntax, your regex should be
{\\(.*\\)}
If you want to use default ECMAScript, your regex should be
\\{(.*)\\}
with clang and libc++ or with gcc 4.9+ (since only it fully support regex) your code give:
Initiating search...
{trol,asdfsad},{safsa, aaaaa,aaaaadfs} trol,asdfsad},{safsa, aaaaa,aaaaadfs
Live example on coliru
Eventually it turned out to really be problem with gcc version so I finally got it working using boost::regex library and following code:
std::string str("{trol,asdfsad},{safsa,aaaaa,aaaaadfs}");
boost::regex rex("\\{(.*?)\\}", boost::regex_constants::perl);
boost::smatch result;
while (boost::regex_search(str, result, rex)) {
for (uint i = 0; i < result.size(); ++i) {
std::cout << result[i] << " ";
}
std::cout << std::endl;
str = result.suffix().str();
}

C++11 - bad_alloc on a constexpr

Arrays with bitmasks are really popular, often times they are tedious to write and they make the code less readable, I would like to generate them with a constexpr, here is my try
#include <iostream>
#include <cstdint>
#include <vector>
#include <utility>
typedef uint32_t myT;
template <typename T>
constexpr std::vector<T> vecFarm(T &&lower, T &&upper, T &&step) {
// std::cout << lower << " " << upper << " " << step << "\n";
std::vector<T> v;
if (lower < upper) {
for (T count = lower; count < upper; count += step) {
v.push_back(count);
};
}
return (v);
}
int main() {
std::vector<myT> k(std::move(vecFarm(myT(0), ~(myT(0)), myT(256)))); //why
// this doesn't work ?
// std::vector<myT> k(std::move(vecFarm(myT(0), ((~(myT(0))) >> 16), myT(256))));
// but this one works
// let's see what we got
for (const auto &j : k) {
std::cout << j << " ";
}
std::cout << "\n";
return (0);
}
I have used std::move, unnamed objects and a constexpr, this code compiles fine with
g++-4.8 -O3 -std=c++11 -pthread -Werror -Wall -Wextra
but it fails at runtime because of a bad_alloc, and I can see my "small" application allocating a lot of space .
Maybe the error is huge and I can't see it, but why this doesn't work ?
Why my application does the allocation at run-time ? Isn't supposed to compute everything at compile-time ? I was expecting this to maybe fail at compile-time not at run-time.
std::bad_alloc usually means it cannot allocate any more memory. Changing your code to the following will show you why:
for (T count = lower; count < upper; count += step) {
std::cout << "count:" << count << "\n";
std::cout << "upper:" << upper << "\n";
};
This prints the following on the first loop when I tested it:
count:0
upper:4294967295
In other words, you have a long way to go before count < upper fails and the for loop stops, especially since you are adding only 256 each time.
Also, in order for constexpr functions to be evaluated at compile time, there are certain conditions it has to fullfil. For example, its return type must be LiteralType, and your function returns std::vector, also, exactly one return statement that contains only literal values, constexpr variables and functions. and you have a compound statement. Therefore, your function cannot be evaluated at compile time.
Also, note that if you do not fullfill these conditions, the constexpr qualifier is ignored, although if you turn on -pedantic it should give you better diagnostics.

Getting C-string from local copy of returned std::string

I am trying to debug a problem related to the scope of the character array contained within a std::string. I have posted the relevant code sample below,
#include <iostream>
#include <string>
const char* objtype;
namespace A
{
std::string get_objtype()
{
std::string result;
std::string envstr( ::getenv("CONFIG_STR") );
std::size_t pos1 = 0, pos2 = 0, pos3 = 0;
pos1 = envstr.find_first_of("objtype");
if (pos1 != std::string::npos)
pos2 = envstr.find_first_of("=", pos1+7);
if (pos2 != std::string::npos)
{
pos3 = envstr.find_first_of(";", pos2+1);
if (pos3 != std::string::npos)
result = envstr.substr(pos2+1, pos3 - pos2 - 1);
}
const char* result_cstr = result.c_str();
std::cerr << "get_objtype()" << reinterpret_cast<long>((void*)result_cstr) << std::endl;
return result;
}
void set_objtype()
{
objtype = get_objtype().c_str();
std::cerr << "Objtype " << objtype << std::endl;
std::cerr << "main()" << reinterpret_cast<long>((void*)objtype) << std::endl;
}
}
int main()
{
using namespace A;
std::cerr << "main()" << reinterpret_cast<long>((void*)objtype) << std::endl;
set_objtype();
if (::strcmp(objtype, "AAAA") == 0)
std::cerr << "Do work for objtype == AAAA " << std::endl;
else
std::cerr << "Do work for objtype != AAAA" << std::endl;
}
This was compiled and executed on MacOS 12.3 with g++ 4.2.1. The output from running this is as follows,
$ g++ -g -DNDEBUG -o A.exe A.cpp
$ CONFIG_STR="objtype=AAAA;objid=21" ./A.exe
main()0
get_objtype()140210713147944
Objtype AAAA
main()140210713147944
Do work for objtype == AAAA
$
My questions are these:
The pointer value printed from main() and get_objtype() are the same. Is this due to RVO?
The last line of output shows that the global pointer to C-string is ok even when the enclosing std::string is out of scope. So, when does the returned value go out of scope and the string array deleted? Any help from the community is appreciated. Thanks.
The pointer value won't change, but the memory it points to may no longer be part of a string.
objtype is invalid on the line right after you set it in set_objtype() because the result of get_objtype() isn't saved anywhere, so the compiler is free to kill it there and then.
It may work, but it's accessing invalid memory, so it is invalid code and if you rely on things like this, you will eventually run into big problems.
You should look at the disassembly using objdump to check if its RVO.
But, from experiments I did (making result global and making copies of it), it looks like c_str is reference counted.

Split string by regex in VC++

I am using VC++ 10 in a project. Being new to C/C++ I just Googled, it appears that in standard C++ doesnt have regex? VC++ 10 seems to have regex. However, how do I do a regex split? Do I need boost just for that?
Searching the web, I found that many recommend Boost for many things, tokenizing/splitting string, parsing (PEG), and now even regex (though this should be build in ...). Can I conclude boost is a must have? Its 180MB for just trivial things, supported naively in many languages?
C++11 standard has std::regex. It also included in TR1 for Visual Studio 2010. Actually TR1 is available since VS2008, it's hidden under std::tr1 namespace. So you don't need Boost.Regex for VS2008 or later.
Splitting can be performed using regex_token_iterator:
#include <iostream>
#include <string>
#include <regex>
const std::string s("The-meaning-of-life-and-everything");
const std::tr1::regex separator("-");
const std::tr1::sregex_token_iterator endOfSequence;
std::tr1::sregex_token_iterator token(s.begin(), s.end(), separator, -1);
while(token != endOfSequence)
{
std::cout << *token++ << std::endl;
}
if you need to get also the separator itself, you could obtain it from sub_match object pointed by token, it is pair containing start and end iterators of token.
while(token != endOfSequence)
{
const std::tr1::sregex_token_iterator::value_type& subMatch = *token;
if(subMatch.first != s.begin())
{
const char sep = *(subMatch.first - 1);
std::cout << "Separator: " << sep << std::endl;
}
std::cout << *token++ << std::endl;
}
This is sample for case when you have single char separator. If separator itself can be any substring you need to do some more complex iterator work and possible store previous token submatch object.
Or you can use regex groups and place separators in first group and the real token in second:
const std::string s("The-meaning-of-life-and-everything");
const std::tr1::regex separatorAndStr("(-*)([^-]*)");
const std::tr1::sregex_token_iterator endOfSequence;
// Separators will be 0th, 2th, 4th... tokens
// Real tokens will be 1th, 3th, 5th... tokens
int subMatches[] = { 1, 2 };
std::tr1::sregex_token_iterator token(s.begin(), s.end(), separatorAndStr, subMatches);
while(token != endOfSequence)
{
std::cout << *token++ << std::endl;
}
Not sure it is 100% correct, but just to illustrate the idea.
Here an example from this blog.
You'll have all your matches in res
std::tr1::cmatch res;
str = "<h2>Egg prices</h2>";
std::tr1::regex rx("<h(.)>([^<]+)");
std::tr1::regex_search(str.c_str(), res, rx);
std::cout << res[1] << ". " << res[2] << "\n";

very weird hang at a for loop initialization

I have a very weird bug that I can't seem to figure out. I have narrowed it down to a small section of code (unless the compiler is reordering my statements, which I don't believe is true).
...
std::cout << "here"<< std::endl;
std::vector<int>::iterator n_iter;
std::vector<int>::iterator l_iter;
std::cout << "here?" << std::endl;
for(n_iter = n.begin(), std::cout << "not here" ; std::cout << "or here" && n_iter < n.end(); n_iter++)
{
std::cout << "do i get to the n loop?";
...
}
When I run this, I see the first "here", the second "here?", but I don't get the "not here" or the "or here" output. And I definitely don't get the "do i get to the n loop?".
The weird thing is that my program is working (it is almost using up an entire cpu core... ), but it doesn't finish, it just hangs.
I've tried using clang++ and g++, and I'm not using any optimizations. I have the boost library installed (and am using the boost_program_options part of it), along with armadillo. But I don't think the compiler should be reordering things...
It happens with or without the cout calls inside the for loop declaration, and it doesn't just skip the loop.
The vector "n" has a length of at least 1, and is given by a boost_program_options call.
Any ideas?
The first thing you should try is to output std::endl after each string. This flushes the buffer for the output.
The following program (which has some extra newlines that yours didn't):
#include <string>
#include <iostream>
#include <vector>
int main() {
std::vector<int> n;
n.push_back(3);
n.push_back(3);
n.push_back(3);
std::cout << "here"<< std::endl;
std::vector<int>::iterator n_iter;
std::vector<int>::iterator l_iter;
std::cout << "here?" << std::endl;
for(n_iter = n.begin(), std::cout << "not here\n" ; std::cout << "or here\n" && n_iter < n.end(); n_iter++)
{
std::cout << "do i get to the n loop?\n";
}
}
Has the following output:
[5:02pm][wlynch#orange /tmp] make foo
g++ foo.cc -o foo
[5:02pm][wlynch#orange /tmp] ./foo
here
here?
not here
or here
do i get to the n loop?
or here
do i get to the n loop?
or here
do i get to the n loop?
or here
This appears to be what you expect, so I'm not sure where you are having issues on your end, but it may be in skipped code.