Combining regex and ranges causes memory issues - c++

I wanted to construct a view over all the sub-matches of regex in text. Here are two ways to define such a view:
char const text[] = "The IP addresses are: 192.168.0.25 and 127.0.0.1";
std::regex regex{R"((\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}))"};
auto sub_matches_view =
std::ranges::subrange(
std::cregex_iterator{std::ranges::begin(text), std::ranges::end(text), regex},
std::cregex_iterator{}
) |
std::views::join;
auto sub_matches_sv_view =
std::ranges::subrange(
std::cregex_iterator{std::ranges::begin(text), std::ranges::end(text), regex},
std::cregex_iterator{}
) |
std::views::join |
std::views::transform([](std::csub_match const& sub_match) -> std::string_view { return {sub_match.first, sub_match.second}; });
sub_matches_view's value type is std::csub_match. It is created by first constructing a view of std::cmatch objects (via the regex iterator), and since each std::cmatch is a range of std::csub_match objects, it is flattened with std::views::join.
sub_matches_sv_view's value type is std::string_view. It is identical to sub_matches_view, except it also wraps each element of sub_matches_view in a std::string_view.
Here's an usage example of the above ranges:
for(auto const& sub_match : sub_matches_view) {
std::cout << std::string_view{sub_match.first, sub_match.second} << std::endl; // #1
}
for(auto const& sv : sub_matches_sv_view) {
std::cout << sv << std::endl; // #2
}
Loop #1 works without problems - the printed results are correct. However, loop #2 causes heap-use-after-free issues according to the Address Sanitizer. In fact, just looping over sub_matches_sv_view without accessing the elements at all causes this problem too. Here is the code on Compiler Explorer as well as the output of the Address Sanitizer.
I am out of ideas as to where my mistake is. text and regex never go out of scope, I don't see any iterators that might be accessed outside of their lifetimes. The std::csub_match object holds iterators (.first, .second) into text, so I don't think it needs to remain alive itself after constructing the std::string_view in std::views::transform.
I know there are many other ways to iterate over regex matches, but I am specifically interested in what's causing the memory bugs in my program, I don't need work-arounds for this issue.

The problem is std::regex_iterator and the fact that it stashes.
That type basically looks like this:
class regex_iterator {
vector<match> matches;
public:
auto operator*() const -> vector<match> const& { return matches; }
};
What this means, for instance, is that even though this iterator's reference type is T const&, if you have two copies of the same iterator, they'll actually give you references into different objects.
Now, join_view<R>::iterator basically looks like this:
class iterator {
// the iterator into the range we're joining
iterator_t<R> outer;
// an iterator into *outer that we're iterating over
iterator_t<range_reference_t<R>> inner;
};
Which, for regex_iterator, roughly looks like this:
class iterator {
// the regex matches
vector<match> outer;
// the current match
match* inner;
};
Now, what happens when you copy this iterator? The copy's inner still refers to the original's outer! These aren't actually independent in the way that you'd expect. Which means that if the original goes out of scope, we have a dangling iterator!
This is what you're seeing here: transform_view ends up copying the iterator (as it is certainly allowed to do), and now you have a dangling iterator (libc++'s implementation moves instead, which is why it happens to work in this case as 康桓瑋 pointed out). But we can reproduce the same issue without transform as long as we copy the iterator and destroy the original. For instance:
#include <ranges>
#include <regex>
#include <iostream>
#include <optional>
int main() {
std::string_view text = "The IP addresses are: 192.168.0.25 and 127.0.0.1";
std::regex regex{R"((\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}))"};
auto a = std::ranges::subrange(
std::cregex_iterator(std::ranges::begin(text), std::ranges::end(text), regex),
std::cregex_iterator{}
);
auto b = a | std::views::join;
std::optional i = b.begin();
std::cout << std::string_view((*i)->first, (*i)->second) << '\n'; // fine
auto j = *i;
i.reset();
std::cout << std::string_view(j->first, j->second) << '\n'; // boom
}
I'm not sure what a solution to this problem would look like, but the cause is the std::regex_iterator and not the views::join or the views::transform.

Related

C++: is it possible to use "universal" pointer to vector?

Good day, SO community!
I am new to C++ and I've ran into a situation in my project, where I have 2 vectors of similar paired data types:
std::vector<std::pair<int, std::string> firstDataVector
std::vector<std::pair<int, std::string> secondDataVector
and in one part of the code I need to select and process the vector depending on the external string value. So my question is - is it possible to create a pointer to vector outside of the conditions
if (stringValue.find("firstStringCondition"))
{
//use firstDataVector
}
if (stringValue.find("secondStringCondition"))
{
//use secondDataVector
}
some kind of pDataVector pointer, to which could be assigned the existing vectors (because now project has only two of them, but the vectors count might be increased)
I've tried to createstd::vector<std::string> &pDataVector pointer, but it will not work because reference variable must be initialized. So summarizing the question - is it possible to have universal pointer to vector?
You are trying to create a reference to one of the vectors - and that's certainly possible, but it must be initialized to reference it. You can't defer it.
It's unclear what you want to happen if no match is found in stringValue so I've chosen to throw an exception.
now project has only two of them, but the vectors count might be increased
Create a vector with a mapping between strings that you would like to try to find in stringValue and then the vector you'd like to create a reference to.
When initializing pDataVector, you can call a functor, like a lambda, that returns the reference.
In the functor, loop over the vector holding the strings you'd like to try to find, and return the referenced vector on the first match you get.
It could look like this:
#include <functional>
#include <iostream>
#include <string>
#include <vector>
int main() {
using vpstype = std::vector<std::pair<int, std::string>>;
vpstype firstDataVector{{1, "Hello"}};
vpstype secondDataVector{{2, "World"}};
// A vector of the condition strings you want to check keeping the order
// in which you want to check them.
std::vector<std::pair<std::string, std::reference_wrapper<vpstype>>>
conditions{
{"firstStringCondition", firstDataVector},
{"secondStringCondition", secondDataVector},
// add more mappings here
};
// an example stringValue
std::string stringValue = "ssdfdfsdfsecondStringConditionsdfsfsdf";
// initialize the vpstype reference:
auto& pDataVector = [&]() -> vpstype& {
// loop over all the strings and referenced vpstypes:
for (auto& [cond, vps] : conditions) {
if (stringValue.find(cond) != std::string::npos) return vps;
}
throw std::runtime_error("stringValue doesn't match any condition string");
}();
// and use the result:
for (auto [i, s] : pDataVector) {
std::cout << i << ' ' << s << '\n'; // prints "2 world"
}
}
You can indeed inintialize references conditionally. Either use a function or lambda that returns the vector you want to reference, or hard code it like below.
std::vector<std::string> &pDataVector =
(stringValue.find("firstStringCondition") != std::string::npos) ?
firstDataVector : ((stringValue.find("secondStringCondition") != std::string::npos) ?
secondDataVector : thirdDataVector);

Can std::span iterators outlive the span object they are created from?

Put it other way, conversely, are std::span iterators invalidated after the span instance is destroyed?
I have a vector I need to iterate over with different layouts. I'm trying to make use of std::span to avoid writing a lot of iterator boilerplate or bringing in an external library dependency. Simplified example:
#include <iostream>
#include <span>
#include <vector>
template <size_t N>
struct my_view {
std::vector<int> vec;
auto as_span() {
return std::span<int[N]>((int(*)[N])vec.data(), vec.size() / N);
}
auto begin() {
return as_span().begin();
}
auto end() {
return as_span().end();
}
};
int main() {
std::vector vec {1, 2, 3, 4, 5, 6};
my_view<2> pairs {std::move(vec)};
for (auto pair : pairs) {
std::cout << pair[0] << " " << pair[1] << std::endl;
}
my_view<3> triplets {std::move(pairs.vec)};
for (auto triplet : triplets) {
std::cout << triplet[0] << " " << triplet[1] << " " << triplet[2] << std::endl;
}
return 0;
}
https://godbolt.org/z/n1djETane
I appreciate that this question asks both the positive and negative version of the same question consecutively. Thus...
Can std::span iterators outlive the span object they are created from?
Yes.
Are std::span iterators invalidated after the span instance is destroyed?
No.
This is why you in [span.syn] you see:
template<class ElementType, size_t Extent>
inline constexpr bool ranges::enable_borrowed_range<span<ElementType, Extent>> = true;
Where, from [range.range]/5:
Given an expression E such that decltype((E)) is T, T models borrowed_­range only if the validity of iterators obtained from the object denoted by E is not tied to the lifetime of that object.
[Note 2: Since the validity of iterators is not tied to the lifetime of an object whose type models borrowed_­range, a function can accept arguments of such a type by value and return iterators obtained from it without danger of dangling. — end note]
Were a span's iterators to be tied to the lifetime of the span, it would be invalid to do this -- span would have to be non-borrowed (as, e.g., vector<T> clearly is not borrowed).
gsl::span's iterators used to keep a pointer to the span (which would obviously cause the iterators to be invalidated), but that was changed in Feb 2020 (I have not looked through the comments to find the discussion there, but that one matches the standard behavior).

A list iterator reference - Program output

Given the following algorithm:
#include <iostream>
#include <list>
int main ()
{
std::list<int> mylist = {5,10,15,20};
std::cout << "mylist contains:"<<"\n";
//for (auto it = mylist.begin(); it != mylist.end(); ++it)
auto it = mylist.begin();
while ( it != mylist.end())
{
std::cout << *it;
std::cout << " "<< &it<<std::endl;
it++;
}
std::cout << '\n';
return 0;
}
The output of this program is:
mylist contains:
5 0x7dc9445e5b50
10 0x7dc9445e5b50
15 0x7dc9445e5b50
20 0x7dc9445e5b50
Since we are moving our iterator, why the address doesn't change?
I was excepting a 4 bytes offset of the reference.
Try this instead:
std::cout << " "<< &*it<<std::endl;
Read this for a description of legacy bidirectional iterators.
https://en.cppreference.com/w/cpp/named_req/BidirectionalIterator
It's not incredibly obvious from this documentation but for this iterator and for iterators in general throughout STL the operator *() will return either:
T & operator *();
or
const T & operator *() const; // for const iterators (list iterator in this case).
I hope this is a bit more explanatory than my previous rather terse answer.
Your problem is pretty simple: it is a variable. Right now, you're printing out the address of that variable. You assign different values to that variable, but the variable is stored at the same location in memory for the entirety of its existence, so when you print out its address, you see the same value every time.
What you want to do is print out the value stored in that variable--the address of the object to which the iterator refers. In the case of an iterator, it can be somewhat tricky to do that, because an iterator is an abstraction--by design, you don't really know what it contains.
We can, however, take a fairly solid guess about how an iterator for an std::list is probably implemented. In particular, it's probably something like this:
template <class T>
class iterator {
T *data;
public:
T &operator*() { return *data; }
// ... and many more functions we don't care about for now
};
The important point here is that although it may store different data to produce its result, operator* needs to return a reference to the object to which the iterator refers. So, to get the address that's (at least effectively) stored in the iterator, we can use operator* to get a reference to the object, then & to take the address of that object.
The downside: while this tells you the addresses of the objects the iterator refers to at different times, it doesn't necessarily tell you what the iterator itself really contains. As long as the iterator produces the correct results from the specified operations, the iterator is free to produce them in essentially any way it sees fit (but that said, with an iterator for an std::list, yes, it's probably going to use a pointer).

C++ undefined behavior on unordered_map as an rvalue in a range-based for loop

I hate to add to the myriad of questions about undefined behavior on stack overflow, but this problem has mystified me.
When using a range-based for loop on an unordered_map that is created inline, I get an unexpected result. When using the same loop on an unordered_map that was first assigned to a variable, I get the expected result.
I would expect both loops to print 1, but that is not what I observe.
Any help understanding what is happening would be appreciated. Thanks!
I'm running on Debian 10 with g++ 8.3.0
#include <algorithm>
#include <unordered_map>
#include <iostream>
#include <vector>
int main() {
for (const int i : std::unordered_map<int, std::vector<int>> {{0, std::vector<int> {1}}}.at(0)) {
std::cout << i << std::endl; //prints 0
}
std::unordered_map<int, std::vector<int>> map {
{0, std::vector<int> {1}}
};
for (const int i : map.at(0)) {
std::cout << i << std::endl; //prints 1
}
}
The problem lies in the fact that you create a temporary std::unordered_map and return a reference to one of its contents. Let's inspect two behaviours that occur here:
1. Range-based for expansion:
From the question you linked in the comments we can see that the following syntax:
for ( for-range-declaration : expression ) statement
translates directly to the following:
{
auto && __range = range-init;
for ( auto __begin = begin-expr,
__end = end-expr;
__begin != __end;
++__begin ) {
for-range-declaration = *__begin;
statement
}
}
The important part is to understand that when you either create a temporary or provide an lvalue, a reference to it will be created (auto&& __range). If we're dealing with lvalues, we encouter the results we expect. However, when range-init returns a temporary object, things get a little more interesting. We encounter lifetime extension.
2. Lifetime extension:
This is way simpler than it may seem. If you return a temporary object to initialize (bind) a reference to it, the lifetime of said object is extended to match the lifetime of said reference. That means that if range-init returns a temporary, saving a reference to it (__range) extends the lifetime of that temporary up to the last curly bracket of the code I copy-pasted above. That's the reason for those most-outer brackets.
Your case:
In your case, we have quite a tricky situation. Inspecting your loop:
for (const int i : std::unordered_map<int, std::vector<int>> {{0, std::vector<int> {1}}}.at(0)) {
std::cout << i << std::endl;
}
We must acknowledge two things:
You create a temporary std::unordered_map.
range-init is not referring to your std::unordered_map. It refers to what .at(0) retuns - a value from that map.
This results in the following consequence - the lifetime of your map is not extended. That means that its destructor will be called at the end of the full expression (at the ; from the auto && __range = range-init;). When std::unordered_map::~unordered_map is called, it calls destructors of everything it manages - e.g. its keys and values. In other words, that destructor call will call the destructor of the vector you obtained a reference to via the at(0) call. Your __range is now a dangling reference.

Is it possible to iterate over an iterator?

I have a working program that capitalizes strings in a vector, using iterators:
vector<string> v7{ 10, "apples" };
for (auto vIterator= v7.begin(); vIterator!= v7.end(); ++vIterator){
auto word = *vIterator; //here
auto charIterator = word.begin();
*charIterator = toupper(*charIterator);
*vIterator = word; //also here, i guess i could just print `word` instead?
cout << *vIterator << endl;
}
My question is;
2nd line inside the loop # the comment, i had to save the pointer to the iterator to another string variable before i was able to iterate over it.
Iterating over the pointer like so
*vIterator.begin();
didn't seem to work.
Is this the correct practice, or am i missing something?
I'm new to the C languages, the concept behind pointer-like tools is quite hard to understand even if i can use them, and in this case it just feels like I'm doing it wrong.
Edit: It was a syntax error (*vIterator).begin();
It just didn't make sense why i'd have to save it to another variable before iterating over it, cheers.
Since you are using C++11 look how simpler your code can become using ranged loops like the example below:
std::vector<std::string> v(10, "apples");
for(auto &&word : v) {
word[0] = toupper(word[0]);
}
LIVE DEMO
Now as far as it concerns the (*vIterator.begin(); didn't seem to work.):
The dot operator (i.e., .) has a higher precedence than the dereference operator (i.e., *). Thus, *vIterator.begin() is interpreted as *(vIterator.begin()). The compiler rightfully complains because vIterator hasn't got a member begin().
Think of iterators as if they were pointers. The correct way to access the members of an object via a pointer/iterator pointing to it is either using the arrow operator (i.e., vIterator->begin()) or first dereference the pointer/iterator and then use the dot operator (i.e., (*vIterator).begin()).
So your code via the use of iterators would become:
std::vector<std::string> v(10, "apples");
for(auto it(v.begin()), ite(v.end()); it != ite; ++it) {
*(it->begin()) = toupper(*(it->begin()));
}
LIVE DEMO
The correct way to write *vIterator.begin(); is (*vIterator).begin(); or, more often, vIterator->begin();. Also note that you can also access the first character of a string directly (without having to iterate over it) as word[0].
A simple STL-ish way of doing it:
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;
int main()
{
vector<string> v7{ 10, "apples" };
for_each(v7.begin(), v7.end(), [](string& word){word[0] = toupper(word[0]);});
}