My question is pretty simple. I have a vector of values (threads here, irrelevant) and I want to iterate through them. However there are two version of the code which looks same to me but only the second one works. I want to know why.
Version 1 (Does not compile)
int main(){
int someValue = 5;
vector<std::thread *> threadVector;
threadVector.resize(20);
for (int i = 0; i < 20; i++) {
threadVector[i] = new std::thread(foo, std::ref(someValue));
}
for (std::vector<std::thread *>::iterator it = threadVector.begin(); it != threadVector.end(); ++it) {
*it->join(); // *********Notice this Line*********
}
system("pause"); // I know I shouldn't be using this
}
Version 2 (Does work)
int main(){
int someValue = 5;
vector<std::thread *> threadVector;
threadVector.resize(20);
for (int i = 0; i < 20; i++) {
threadVector[i] = new std::thread(foo, std::ref(someValue));
}
for (std::vector<std::thread *>::iterator it = threadVector.begin(); it != threadVector.end(); ++it) {
(*it)->join(); // *********Notice this Line*********
}
system("pause"); // I know I shouldn't be using this
}
This is an issue with order of operations.
*it->join();
is parsed as:
*(it->join());
Taking it as a challenge, I just dabbed my feet in C++11 for the first time. I have found out that you can achieve the same without any dynamic allocation of std::thread objects:
#include <iostream>
#include <thread>
#include <vector>
void function()
{
std::cout << "thread function\n";
}
int main()
{
std::vector<std::thread> ths;
ths.push_back(std::move(std::thread(&function)));
ths.push_back(std::move(std::thread(&function)));
ths.push_back(std::move(std::thread(&function)));
while (!ths.empty()) {
std::thread th = std::move(ths.back());
ths.pop_back();
th.join();
}
}
This works because std::thread has a constructor and assignment operator taking an rvalue reference, making it movable. Further, all containers have gained support for storing movable objects and they move instead of copy them on reallocations. Read some online articles about this new C++11 feature, it's too wide to explain here and I also don't know it well enough.
About the concern you raised that threads have a memory cost, I don't think that your approach is an optimization. Rather the dynamic allocation itself has an overhead, both in memory and performance. For small objects, the memory overhead of one or two pointers plus possibly some padding is enormous. I wouldn't be surprised if std::thread objects had the size of a single pointer only, giving you an overhead of more than 100%.
Note that this only concerns the std::thread object. The memory required for the actual thread, in particular its stack, are a different issue. However, std::thread objects and the actual threads don't have a 1:1 relation and dynamic allocation of the std::thread object doesn't change anything there either.
If you're still afraid that the reallocations are too expensive, you could reserve a suitable size up front to avoid reallocations. However, if that really is an issue, then you are creating and terminating threads way too much, and that will by far dwarf the overhead of shifting a few, small objects around. Consider using a threadpool.
Related
My application consists of calling dozens of functions millions of times. In each of those functions, one or a few temporary std::vector containers of POD (plain old data) types are initialized, used, and then destructed. By profiling my code, I find the allocations and deallocations lead to a huge overhead.
A lazy solution is to rewrite all the functions as functors containing those temporary buffer containers as class members. However this would blow up the memory consumption as the functions are many and the buffer sizes are not trivial.
A better way is to analyze the code, gather all the buffers, premeditate how to maximally reuse them, and feed a minimal set of shared buffer containers to the functions as arguments. But this can be too much work.
I want to solve this problem once for all my future development during which temporary POD buffers become necessary, without having to have much premeditation. My idea is to implement a container port, and take the reference to it as an argument for every function that may need temporary buffers. Inside those functions, one should be able to fetch containers of any POD type from the port, and the port should also auto-recall the containers before the functions return.
// Port of vectors of POD types.
struct PODvectorPort
{
std::size_t Nlent; // Number of dispatched containers.
std::vector<std::vector<std::size_t> > X; // Container pool.
PODvectorPort() { Nlent = 0; }
};
// Functor that manages the port.
struct PODvectorPortOffice
{
std::size_t initialNlent; // Number of already-dispatched containers
// when the office is set up.
PODvectorPort *p; // Pointer to the port.
PODvectorPortOffice(PODvectorPort &port)
{
p = &port;
initialNlent = p->Nlent;
}
template<typename X, typename Y>
std::vector<X> & repaint(std::vector<Y> &y) // Repaint the container.
{
// return *((std::vector<X>*)(&y)); // UB although works
std::vector<X> *rst = nullptr;
std::memcpy(&rst, &y, std::min(
sizeof(std::vector<X>*), sizeof(std::vector<Y>*)));
return *rst; // guess it makes no difference. Should still be UB.
}
template<typename T>
std::vector<T> & lend()
{
++p->Nlent;
// Ensure sufficient container pool size:
while (p->X.size() < p->Nlent) p->X.push_back( std::vector<size_t>(0) );
return repaint<T, std::size_t>( p->X[p->Nlent - 1] );
}
void recall() { p->Nlent = initialNlent; }
~PODvectorPortOffice() { recall(); }
};
struct ArbitraryPODstruct
{
char a[11]; short b[7]; int c[5]; float d[3]; double e[2];
};
// Example f1():
// f2(), f3(), ..., f50() are similarly defined.
// All functions are called a few million times in certain
// order in main().
// port is defined in main().
void f1(other arguments..., PODvectorPort &port)
{
PODvectorPort portOffice(port);
// Oh, I need a buffer of chars:
std::vector<char> &tmpchar = portOffice.lend();
tmpchar.resize(789); // Trivial if container already has sufficient capacity.
// ... do things
// Oh, I need a buffer of shorts:
std::vector<short> &tmpshort = portOffice.lend();
tmpshort.resize(456); // Trivial if container already has sufficient capacity.
// ... do things.
// Oh, I need a buffer of ArbitraryPODstruct:
std::vector<ArbitraryPODstruct> &tmpArb = portOffice.lend();
tmpArb.resize(123); // Trivial if container already has sufficient capacity.
// ... do things.
// Oh, I need a buffer of integers, but also tmpArb is no longer
// needed. Why waste it? Cache hot.
std::vector<int> &tmpint = portOffice.repaint(tmpArb);
tmpint.resize(300); // Trivial.
// ... do things.
}
Although the code is compliable by both gcc-8.3 and MSVS 2019 with -O2 to -Ofast, and passes extensive tests for all options, I expect criticism due to the hacky nature of PODvectorPortOffice::repaint(), which "casts" the vector type in-place.
A set of sufficient but not necessary conditions for the correctness and efficiency of the above code are:
std::vector<T> stores 3 pointers to the underlying buffer's &[0], &[0] + .size(), &[0] + .capacity().
std::vector<T>'s allocator calls malloc().
malloc() returns an 8-byte (or sizeof(std::size_t)) aligned address.
So, if this is unacceptable to you, what would be the modern, proper way of addressing my need? Is there a way of writing a manager that achieve what my code does only without violating the Standard?
Thanks!
Edits: A little more context of my problem. Those functions mainly compute some simple statistics of the inputs. The inputs are data streams of financial parameters of different types and sizes. To compute the statistics, those data need to be altered and re-arranged first, thus the buffers for temporary copies. Computing the statistics is cheap, thus the allocations and deallocations can become expensive, relatively. Why do I want a manger for arbitrary POD type? Because 2 weeks from now I may start receiving a data stream of a different type, which can be a bunch of primitive types zipped in a struct, or a struct of the composite types encountered so far. I, of course, would like the upper stream to just send separate flows of primitive types, but I have no control of that aspect.
More edits: after tons of reading and code experimenting regarding the strict aliasing rule, the answer should be, don't try everything I put up there --- it works, for now, but don't do it. Instead, I'll be diligent and stick to my previous code-as-you-go style, just add a vector<vector<myNewType> > into the port once a new type comes up, and manage it in a similar way. The accepted answer also offers a nice alternative.
Even more edits: conceived a stronger class that has better chance to thwart potential optimizations under the strict aliasing rule. DO NOT USE IT WITHOUT TESTING AND THOROUGH UNDERSTANDING OF THE STRICT ALIASING RULE.
// -std=c++17
#include <cstring>
#include <cstddef>
#include <iostream>
#include <vector>
#include <chrono>
// POD: plain old data.
// Idea: design a class that can let you maximally reuse temporary
// containers during a program.
// Port of vectors of POD types.
template <std::size_t portsize = 42>
class PODvectorPort
{
static constexpr std::size_t Xsize = portsize;
std::size_t signature;
std::size_t Nlent; // Number of dispatched containers.
std::vector<std::size_t> X[portsize]; // Container pool.
PODvectorPort(const PODvectorPort &);
PODvectorPort & operator=( const PODvectorPort& );
public:
std::size_t Ndispatched() { return Nlent; }
std::size_t showSignature() { return signature; }
PODvectorPort() // Permuted random number generator.
{
std::size_t state = std::chrono::high_resolution_clock::now().time_since_epoch().count();
state ^= (uint64_t)(&std::memmove);
signature = ((state >> 18) ^ state) >> 27;
std::size_t rot = state >> 59;
signature = (signature >> rot) | (state << ((-rot) & 31));
Nlent = 0;
}
template<typename podvecport>
friend class PODvectorPortOffice;
};
// Functor that manages the port.
template<typename podvecport>
class PODvectorPortOffice
{
// Number of already-dispatched containers when the office is set up.
std::size_t initialNlent;
podvecport *p; // Pointer to the port.
PODvectorPortOffice( const PODvectorPortOffice& ); // non construction-copyable
PODvectorPortOffice& operator=( const PODvectorPortOffice& ); // non copyable
constexpr void check()
{
while (__cplusplus < 201703)
{
std::cerr << "PODvectorPortOffice: C++ < 17, Stall." << std::endl;
}
// Check if allocation will be 8-byte (or more) aligned.
// Intend it not to work on machine < 64-bit.
constexpr std::size_t aln = alignof(std::max_align_t);
while (aln < 8)
{
std::cerr << "PODvectorPortOffice: Allocation is not at least 8-byte aligned, Stall." <<
std::endl;
}
while ((aln & (aln - 1)) != 0)
{
std::cerr << "PODvectorPortOffice: Alignment is not a power of 2 bytes. Stall." << std::endl;
}
// Random checks to see if sizeof(vector<S>) != sizeof(vector<T>).
if(true)
{
std::size_t vecHeadSize[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
vecHeadSize[0] = sizeof(std::vector<char>(0));
vecHeadSize[1] = sizeof(std::vector<short>(1));
vecHeadSize[2] = sizeof(std::vector<int>(2));
vecHeadSize[3] = sizeof(std::vector<long>(3));
vecHeadSize[4] = sizeof(std::vector<std::size_t>(5));
vecHeadSize[5] = sizeof(std::vector<float>(7));
vecHeadSize[6] = sizeof(std::vector<double>(11));
vecHeadSize[7] = sizeof(std::vector<std::vector<char> >(13));
vecHeadSize[8] = sizeof(std::vector<std::vector<int> >(17));
vecHeadSize[9] = sizeof(std::vector<std::vector<double> >(19));
struct tmpclass1 { char a; short b; };
struct tmpclass2 { char a; float b; };
struct tmpclass3 { char a; double b; };
struct tmpclass4 { int a; char b; };
struct tmpclass5 { double a; char b; };
struct tmpclass6 { double a[5]; char b[3]; short c[3]; };
vecHeadSize[10] = sizeof(std::vector<tmpclass1>(23));
vecHeadSize[11] = sizeof(std::vector<tmpclass2>(29));
vecHeadSize[12] = sizeof(std::vector<tmpclass3>(31));
vecHeadSize[13] = sizeof(std::vector<tmpclass4>(37));
vecHeadSize[14] = sizeof(std::vector<tmpclass4>(41));
vecHeadSize[15] = sizeof(std::vector<tmpclass4>(43));
std::size_t notSame = 0;
for(int i = 0; i < 16; ++i)
notSame += vecHeadSize[i] != sizeof(std::size_t) * 3;
while (notSame)
{
std::cerr << "sizeof(std::vector<S>) != sizeof(std::vector<T>), \
PODvectorPortOffice cannot handle. Stall." << std::endl;
}
}
}
void recall() { p->Nlent = initialNlent; }
public:
PODvectorPortOffice(podvecport &port)
{
check();
p = &port;
initialNlent = p->Nlent;
}
template<typename X, typename Y>
std::vector<X> & repaint(std::vector<Y> &y) // Repaint the container.
// AFTER A VECTOR IS REPAINTED, DO NOT USE THE OLD VECTOR AGAIN !!
{
while (std::is_same<bool, X>::value)
{
std::cerr << "PODvectorPortOffice: Cannot repaint the vector to \
std::vector<bool>. Stall." << std::endl;
}
std::vector<X> *x;
std::vector<Y> *yp = &y;
std::memcpy(&x, &yp, sizeof(x));
return *x; // Not compliant with strict aliasing rule.
}
template<typename T>
std::vector<T> & lend()
{
while (p->Nlent >= p->Xsize)
{
std::cerr << "PODvectorPortOffice: No more containers. Stall." << std::endl;
}
++p->Nlent;
return repaint<T, std::size_t>( p->X[p->Nlent - 1] );
}
~PODvectorPortOffice()
{
// Because p->signature can only be known at runtime, an aggressive,
// compliant compiler (ACC) will never remove this
// branch. Volatile might do, but trustworthiness?
if(p->signature == 0)
{
constexpr std::size_t sizeofvec = sizeof(std::vector<std::size_t>);
char dummy[sizeofvec * p->Xsize];
std::memcpy(dummy, p->X, p->Nlent * sizeofvec);
std::size_t ticketNum = 0;
char *xp = (char*)(p->X);
for(int i = 0, iend = p->Nlent * sizeofvec; i < iend; ++i)
{
xp[i] &= xp[iend - i - 1] * 5;
ticketNum += xp[i] ^ ticketNum;
}
std::cerr << "Congratulations! After the port office was decommissioned, \
you found a winning lottery ticket. The odds is less than 2.33e-10. Your \
ticket number is " << ticketNum << std::endl;
std::memcpy(p->X, dummy, p->Nlent * sizeofvec);
// According to the strict aliasing rule, a char* can point to any memory
// block pointed by another pointer of any type T*. Thus given an ACC,
// the writes to that block via the char* must be fully acknowledged in
// time by T*, namely, for reading contents from T*, a reload instruction
// will be kept in the assembly code to achieve a sort of
// "register-cache-memory coherence" (RCMC).
// We also do not care about the renters' (who received the reference via
// .lend()) RCMC, because PODvectorPortOffice never accesses the contents
// of those containers.
}
recall();
}
};
Any adversarial test case to break it, especially on GCC>=8.3 or MSVS >= 2019, is welcomed!
Let me frame this by saying I don't think there's an "authoritative" answer to this question. That said, you've provided enough constraints that a suggested path is at least worthwhile. Let's review the requirements:
Solution must use std::vector. This is in my opinion the most unfortunate requirement for reasons I won't get into here.
Solution must be standards compliant and not resort to rule violations, like the strict aliasing rule.
Solution must either reduce the number of allocations performed, or reduce the overhead of allocations to the point of being negligible.
In my opinion this is definitely a job for a custom allocator. There are a couple of off-the-shelf options that come close to doing what you want, for example the Boost Pool Allocators. The one you're most interested in is boost::pool_allocator. This allocator will create a singleton "pool" for each distinct object size (note: not object type), which grows as needed, but never shrinks until you explicitly purge it.
The main difference between this and your solution is that you'll have distinct pools of memory for objects of different sizes, which means it will use more memory than your posted solution, but in my opinion this is a reasonable trade-off. To be maximally efficient, you could simply start a batch of operations by creating vectors of each needed type with an appropriate size. All subsequent vector operations which use these allocators will do trivial O(1) allocations and deallocations. Roughly in pseudo-code:
// be careful with this, probably want [[nodiscard]], this is code
// is just rough guidance:
void force_pool_sizes(void)
{
std::vector<int, boost::pool_allocator<int>> size_int_vect;
std::vector<SomePodSize16, boost::pool_allocator<SomePodSize16>> size_16_vect;
...
size_int_vect.resize(100); // probably makes malloc calls
size_16_vect.resize(200); // probably makes malloc calls
...
// on return, objects go out of scope, but singleton pools
// with allocated blocks of memory remain for future use
// until explicitly purged.
}
void expensive_long_running(void)
{
force_pool_sizes();
std::vector<int, boost::pool_allocator<int>> data1;
... do stuff, malloc/free will never be called...
std::vector<SomePodSize16, boost::pool_allocator<SomePodSize16>> data2;
... do stuff, malloc/free will never be called...
// free everything:
boost::singleton_pool<boost::pool_allocator_tag, sizeof(int)>::release_memory();
}
If you want to take this a step further on being memory efficient, if you know for a fact that certain pool sizes are mutually exclusive, you could modify the boost pool_allocator to use a slightly different singleton backing store which allows you to move a memory block from one block size to another. This is probably out of scope for now, but the boost code itself is straightforward enough, if memory efficiency is critical, it's probably worthwhile.
It's worth pointing out that there's probably some confusion about the strict aliasing rule, especially when it comes to implementing your own memory allocators. There are lots and lots of SO questions about strict aliasing and what it does and doesn't mean. This one is a good place to start.
The key takeaway is that it's perfectly ordinary and acceptable in low level C++ code to take an array of memory and cast it to some object type. If this were not the case, std::allocator wouldn't exist. You also wouldn't have much use for things like std::aligned_storage. Look at the example use case for std::aligned_storage on cppreference. An STL-like static_vector class is created which keeps an array of aligned_storage objects that get recast to a concrete type. Nothing about this is "unacceptable" or "illegal", but it does require some additional knowledge and care in handling.
The reason your solution is especially going to enrage the code lawyers is that you're taking pointers of one non-char object type and casting them to different non-char object types. This is a particularly offensive violation of the strict aliasing rule, but also not really necessary given some of your other options.
Also keep in mind that it's not an error to alias memory, it's a warning. I'm not saying go crazy with aliasing, but I am saying that as with all things C and C++, there are justifiable cases to break rules, when you have very thorough knowledge and understanding of both your compiler and the machine you're running on. Just be prepared for some very long and painful debug sessions if it turns out you didn't in fact know those two things as well as you thought you did.
Can I access (without locking) an std::map entry while another thread inserts/erases entrys?
example pseudo C++:
typedef struct {
int value;
int stuff;
}some_type_t;
std::map<char,some_type_t*> my_map;
//thread 1 does:
my_map.at('a')->value = 1;
//thread 2 does:
some_type_t* stuff = my_map.at('b');
//thread 3 does:
my_map.erase('c');
//I'm not modifying any elements T is a pointer to an previously allocated "some_type_t"
std C++11 says that all members should be thread safe (erase() is not const).
No. And yes.
Two or more threads can perform const operations on a map, where a few non-const operations also count (operations that return non-const iterators, hence are not const, like begin and similar stuff).
You can also modify the non-key component of an element of a map or set while other operations that do not read/write/destroy said element's non-key component run (which is most of them, except stuff like erase or =).
You cannot erase or insert or other similar non-const map operations while doing anything with const and similar map operations (like find or at). Note that [] can be similar to erase if the element is added.
The standard has an explicit list of non-const operations that count as const for the purposes of thread safety -- but they are basically lookups that return iterator or &.
No no no no no!
map::erase modifies the links between the map entries and that messes with the search algorithm used in map::at. You can use the elements during an erase, but you can not execute the searching algorithm itself!
I've created an illustration program. On my machine this program sometimes print OK, sometimes throws a map exception - that means that the search is gone wrong. Increasing the number of reading threads make the exception appear more often.
#include <iostream>
#include <thread>
#include <map>
#include <cassert>
std::map<int, int> m;
// this thread uses map::at to access elements
void foo()
{
for (int i = 0; i < 1000000; ++i) {
int lindex = 10000 + i % 1000;
int l = m.at(lindex);
assert(l == lindex);
int hindex = 90000 + i % 1000;
int h = m.at(hindex);
assert(h == hindex);
}
std::cout << "OK" << std::endl;
}
// this thread uses map::erase to delete elements
void bar()
{
for (int i = 20000; i < 80000; ++i) {
m.erase(i);
}
}
int main()
{
for (int i = 0; i < 100000; ++i) {
m[i] = i;
}
std::thread read1(foo);
std::thread read2(foo);
std::thread erase(bar);
read1.join();
read2.join();
erase.join();
return 0;
}
No. std::map are not thread safe. Intel's thread building block (tbb) library has some concurrent containers. Check tbb
With the code below, the question is:
If you use the "returnIntVector()" function, is the vector copied from the local to the "outer" (global) scope? In other words is it a more time and memory consuming variation compared to the "getIntVector()"-function? (However providing the same functionality.)
#include <iostream>
#include <vector>
using namespace std;
vector<int> returnIntVector()
{
vector<int> vecInts(10);
for(unsigned int ui = 0; ui < vecInts.size(); ui++)
vecInts[ui] = ui;
return vecInts;
}
void getIntVector(vector<int> &vecInts)
{
for(unsigned int ui = 0; ui < vecInts.size(); ui++)
vecInts[ui] = ui;
}
int main()
{
vector<int> vecInts = returnIntVector();
for(unsigned int ui = 0; ui < vecInts.size(); ui++)
cout << vecInts[ui] << endl;
cout << endl;
vector<int> vecInts2(10);
getIntVector(vecInts2);
for(unsigned int ui = 0; ui < vecInts2.size(); ui++)
cout << vecInts2[ui] << endl;
return 0;
}
In theory, yes it's copied. In reality, no, most modern compilers take advantage of return value optimization.
So you can write code that acts semantically correct. If you want a function that modifies or inspects a value, you take it in by reference. Your code does not do that, it creates a new value not dependent upon anything else, so return by value.
Use the first form: the one which returns vector. And a good compiler will most likely optimize it. The optimization is popularly known as Return value optimization, or RVO in short.
Others have already pointed out that with a decent (not great, merely decent) compiler, the two will normally end up producing identical code, so the two give equivalent performance.
I think it's worth mentioning one or two other points though. First, returning the object does officially copy the object; even if the compiler optimizes the code so that copy never takes place, it still won't (or at least shouldn't) work if the copy ctor for that class isn't accessible. std::vector certainly supports copying, but it's entirely possible to create a class that you'd be able to modify like in getIntVector, but not return like in returnIntVector.
Second, and substantially more importantly, I'd generally advise against using either of these. Instead of passing or returning a (reference to) a vector, you should normally work with an iterator (or two). In this case, you have a couple of perfectly reasonable choices -- you could use either a special iterator, or create a small algorithm. The iterator version would look something like this:
#ifndef GEN_SEQ_INCLUDED_
#define GEN_SEQ_INCLUDED_
#include <iterator>
template <class T>
class sequence : public std::iterator<std::forward_iterator_tag, T>
{
T val;
public:
sequence(T init) : val(init) {}
T operator *() { return val; }
sequence &operator++() { ++val; return *this; }
bool operator!=(sequence const &other) { return val != other.val; }
};
template <class T>
sequence<T> gen_seq(T const &val) {
return sequence<T>(val);
}
#endif
You'd use this something like this:
#include "gen_seq"
std::vector<int> vecInts(gen_seq(0), gen_seq(10));
Although it's open to argument that this (sort of) abuses the concept of iterators a bit, I still find it preferable on practical grounds -- it lets you create an initialized vector instead of creating an empty vector and then filling it later.
The algorithm alternative would look something like this:
template <class T, class OutIt>
class fill_seq_n(OutIt result, T num, T start = 0) {
for (T i = start; i != num-start; ++i) {
*result = i;
++result;
}
}
...and you'd use it something like this:
std::vector<int> vecInts;
fill_seq_n(std::back_inserter(vecInts), 10);
You can also use a function object with std::generate_n, but at least IMO, this generally ends up more trouble than it's worth.
As long as we're talking about things like that, I'd also replace this:
for(unsigned int ui = 0; ui < vecInts2.size(); ui++)
cout << vecInts2[ui] << endl;
...with something like this:
std::copy(vecInts2.begin(), vecInts2.end(),
std::ostream_iterator<int>(std::cout, "\n"));
In C++03 days, getIntVector() is recommended for most cases. In case of returnIntVector(), it might create some unncessary temporaries.
But by using return value optimization and swaptimization, most of them can be avoided. In era of C++11, the latter can be meaningful due to the move semantics.
In theory, the returnIntVector function returns the vector by value, so a copy will be made and it will be more time-consuming than the function which just populates an existing vector. More memory will also be used to store the copy, but only temporarily; since vecInts is locally scoped it will be stack-allocated and will be freed as soon as the returnIntVector returns. However, as others have pointed out, a modern compiler will optimize away these inefficiencies.
returnIntVector is more time consuming because it returns a copy of the vector, unless the vector implementation is realized with a single pointer in which case the performance is the same.
in general you should not rely on the implementation and use getIntVector instead.
I have small but frequently used function objects. Each thread gets its own copy. Everything is allocated statically. Copies don't share any global or static data. Do I need to protect this objects from false sharing?
Thank you.
EDIT: Here is a toy program which uses Boost.Threads. Can false sharing occur for the field data?
#include <boost/thread/thread.hpp>
struct Work {
void operator()() {
++data;
}
int data;
};
int main() {
boost::thread_group threads;
for (int i = 0; i < 10; ++i)
threads.create_thread(Work());
threads.join_all();
}
False sharing between threads is when 2 or more threads use the same cache line.
E.g. :
struct Work {
Work( int& d) : data( d ) {}
void operator()() {
++data;
}
int& data;
};
int main() {
int false_sharing[10] = { 0 };
boost::thread_group threads;
for (int i = 0; i < 10; ++i)
threads.create_thread(Work(false_sharing[i]));
threads.join_all();
int no_false_sharing[10 * CACHELINE_SIZE_INTS] = { 0 };
for (int i = 0; i < 10; ++i)
threads.create_thread(Work(no_false_sharing[i * CACHELINE_SIZE_INTS]));
threads.join_all();
}
The threads in the first block do suffer from false sharing. The threads in the second block do not (thanks to CACHELINE_SIZE).
Data on the stack is always 'far' away from other threads. (E.g. under windows, at least a couple of pages).
With your definition of a function object, false sharing can appear, because the instances of Work get created on the heap and this heap space is used inside the thread.
This may lead to several Work instances to be adjacent and so may incur sharing of cache lines.
But ... your sample does not make sense, because data is never touched outside and so false sharing is induced needlessly.
The easiest way, to prevent problems like this, is to copy your 'shared' data locally on tho the stack, and then work on the stack copy. When your work is finished copy it back to the output var.
E.g:
struct Work {
Work( int& d) : data( d ) {}
void operator()()
{
int tmp = data;
for( int i = 0; i < lengthy_op; ++i )
++tmp;
data = tmp;
}
int& data;
};
This prevents all problems with sharing.
I did a fair bit of research and it seems there is no silver bullet solution to false sharing. Here is what I come up with (thanks to Christopher):
1) Pad your data from both sides with unused or less frequently used stuff.
2) Copy your data into stack and copy it back after all hard work is done.
3) Use cache aligned memory allocation.
I' don't feel entirely safe with the details, but here's my take:
(1) Your simplified example is broken since boost create_thread expects a reference, you pass a temporary.
(2) if you'd use vector<Work> with one item fro each thread, or othrwise have them in memory sequentially, false sharing will occur.
I need to share a stack of strings between processes (possibly more complex objects in the future). I've decided to use boost::interprocess but I can't get it to work. I'm sure it's because I'm not understanding something. I followed their example, but I would really appreciate it if someone with experience with using that library can have a look at my code and tell me what's wrong. The problem is it seems to work but after a few iterations I get all kinds of exceptions both on the reader process and sometimes on the writer process. Here's a simplified version of my implementation:
using namespace boost::interprocess;
class SharedMemoryWrapper
{
public:
SharedMemoryWrapper(const std::string & name, bool server) :
m_name(name),
m_server(server)
{
if (server)
{
named_mutex::remove("named_mutex");
shared_memory_object::remove(m_name.c_str());
m_segment = new managed_shared_memory (create_only,name.c_str(),65536);
m_stackAllocator = new StringStackAllocator(m_segment->get_segment_manager());
m_stack = m_segment->construct<StringStack>("MyStack")(*m_stackAllocator);
}
else
{
m_segment = new managed_shared_memory(open_only ,name.c_str());
m_stack = m_segment->find<StringStack>("MyStack").first;
}
m_mutex = new named_mutex(open_or_create, "named_mutex");
}
~SharedMemoryWrapper()
{
if (m_server)
{
named_mutex::remove("named_mutex");
m_segment->destroy<StringStack>("MyStack");
delete m_stackAllocator;
shared_memory_object::remove(m_name.c_str());
}
delete m_mutex;
delete m_segment;
}
void push(const std::string & in)
{
scoped_lock<named_mutex> lock(*m_mutex);
boost::interprocess::string inStr(in.c_str());
m_stack->push_back(inStr);
}
std::string pop()
{
scoped_lock<named_mutex> lock(*m_mutex);
std::string result = "";
if (m_stack->size() > 0)
{
result = std::string(m_stack->begin()->c_str());
m_stack->erase(m_stack->begin());
}
return result;
}
private:
typedef boost::interprocess::allocator<boost::interprocess::string, boost::interprocess::managed_shared_memory::segment_manager> StringStackAllocator;
typedef boost::interprocess::vector<boost::interprocess::string, StringStackAllocator> StringStack;
bool m_server;
std::string m_name;
boost::interprocess::managed_shared_memory * m_segment;
StringStackAllocator * m_stackAllocator;
StringStack * m_stack;
boost::interprocess::named_mutex * m_mutex;
};
EDIT Edited to use named_mutex. Original code was using interprocess_mutex which is incorrect, but that wasn't the problem.
EDIT2 I should also note that things work up to a point. The writer process can push several small strings (or one very large string) before the reader breaks. The reader breaks in a way that the line m_stack->begin() does not refer to a valid string. It's garbage. And then further execution throws an exception.
EDIT3 I have modified the class to use boost::interprocess::string rather than std::string. Still the reader fails with invalid memory address. Here is the reader/writer
//reader process
SharedMemoryWrapper mem("MyMemory", true);
std::string myString;
int x = 5;
do
{
myString = mem.pop();
if (myString != "")
{
std::cout << myString << std::endl;
}
} while (1); //while (myString != "");
//writer
SharedMemoryWrapper mem("MyMemory", false);
for (int i = 0; i < 1000000000; i++)
{
std::stringstream ss;
ss << i; //causes failure after few thousand iterations
//ss << "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" << i; //causes immediate failure
mem.push(ss.str());
}
return 0;
There are several things that leaped out at me about your implementation. One was the use of a pointer to the named mutex object, whereas the documentation of most boost libraries tends to bend over backwards to not use a pointer. This leads me to ask for a reference to the program snippet you worked from in building your own test case, as I have had similar misadventures and sometimes the only way out was to go back to the exemplar and work forward one step at a time until I come across the breaking change.
The other thing that seems questionable is your allocation of a 65k block for shared memory, and then in your test code, looping to 1000000000, pushing a string onto your stack each iteration.
With a modern PC able to execute 1000 instructions per microsecond and more, and operating systems like Windows still doling out execution quanta in 15 millisecond. chunks, it won't take long to overflow that stack. That would be my first guess as to why things are haywire.
P.S.
I just returned from fixing my name to something resembling my actual identity. Then the irony hit that my answer to your question has been staring us both in the face from the upper left hand corner of the browser page! (That is, of course, presuming I was correct, which is so often not the case in this biz.)
Well maybe shared memory is not the right design for your problem to begin with. However we would not know, because we don't know what you try to achieve in the first place.