It's usually said that immutable data structures are more "friendly" for concurrency programming. The explanation is that if a data structure is mutable and one thread modifies it, then another thread will "see" the previous mode of the data structure.
Although it's impossible to modify immutable data structure, if one needs to change it, it's possible to create a new data structure and give it the reference of the "old" data structure.
In my opinion, this situation is also not thread safe because one thread can access the old data structure and the second one can access the new data structure. If so, why is immutable data structure considered more thread-safe?
The idea here is that you can´t change an object once it has been created. Every object that is part of some structure is itself immutable. When you create new structure, and reuse some components of old structure, you still can´t change any internal value of any component that makes up this new structure. You can easily identify each structure by its root component reference.
Of course, you still need to make sure you swap them in thread safe fashion, this is usually done using variants of CAS (compare and swap) instructions. Or you can use functional programing, which idiom of functions without side-effects (take immutable input and produce new result) is ideal for thread safe, multi threaded programming.
There are many benefits to immutability over mutability, but that does not mean immutable is always better. Every approach has its benefits and applications. Take a look at this answer for more details on immutability uses. Also check this nicely written answer about mutability benefits in some circumstances.
Related
I havea class, used for data storage, of which there is only a single instance.
The caller is message driven and has become too large and is a prime candidate for refactoring, such that each message is handled by a separate thread. However, these could then compete to read/write the data.
If I were using mutexes (mutices?), I would only use them on write operations. I don't think that matters here, as the data are atomic, not the functions which access the data.
Is there any easy way to make all of the data atomic? Currently it consists of simple types, vectors and objects of other classes. If I have to add std::atomic<> to every sub-field, I may as well use mutexes.
std::atomic requires the type to be trivially copyable. Since you are saying std::vector is involved, that makes it impossible to use it, either on the whole structure or the std::vector itself.
The purpose of std::atomic is to be able to atomically replace the whole value of the object. You cannot do something like access individual members or so on.
From the limited context you gave in your question, I think std::mutex is the correct approach. Each object that should be independently accessible should have its own mutex protecting it.
Also note that the mutex generally needs to protect writes and reads, since a read happening unsynchronized with a write is a data race and causes undefined behavior, not only unsynchronized writes.
I always hear people saying that it's easier to manage immutable objects when working with multiple threads since when one thread accessed an immutable object, it doesn't have to worry that another thread is changing it.
Well, what happens if I have an immutable list of all employees in a company and a new employee is hired? In this case the immutable list has to be duplicated and the new copy of it has to include another employee object. Then the reference to list of employees should be directed to the new list.
When this scenario happens, the list itself doesn't change, but the reference to this list changes, and therefore the code "sees" different data.
If so, I don't understand why immutable objects makes our lives easier when working with multi-threads. What am I missing?
The main problem concurrent updates of mutable data, is that threads may perceive variable values stemming from different versions, i.e. a mixture of old and new values when speaking of a single update, forming an inconsistent state, violating the invariants of these variables.
See, for example, Java’s ArrayList. It has an int field holding the current size and a reference to an array whose elements are references to the contained objects. The values of these variables have to fulfill certain invariants, e.g. if the size is non-zero, the array reference is never null and the array length is always greater or equal to the size. When seeing values of different updates for these variables, these invariants do not hold anymore, so threads may see a list contents which never existed in this form or fail with spurious exceptions, reporting an illegal state that should be impossible (like NullPointerException or ArrayIndexOutOfBoundeException).
Note that thread safe or concurrent data structures only solve the problem regarding the internals of the data structure, so operations do not fail with spurious exceptions anymore (regarding the collection’s state, we’ve not talked about the contained element’s state yet), but operations iterating over these collections or looking at more than one contained element in any form, are still subject to possibly observing an inconsistent state regarding the contained elements. This also applies to the check-then-act anti-pattern, where an application first checks for a condition (e.g. using contains), before acting upon it (like fetching, adding or removing an element), whereas the condition might change in-between.
In contrast, a thread working on an immutable data structure may work on an outdated version of it, but all variables belonging to that structure are consistent to each other, reflecting the same version. When performing an update, you don’t need to think about exclusion of other threads, it’s simply not necessary, as the new data structures are not seen by other threads. The entire task of publishing a new version reduces to the task of publishing the root reference to the new version of your data structure. If you can’t stop the other threads processing the old version, the worst thing that can happen, is that you may have to repeat the operation using the new data afterwards, in other words, just a performance issue, in the worst case.
This works smooth with programming languages with garbage collection, as these allow it to let the new data structure refer to old objects, just replacing the changed objects (and their parents), without needing to worry about which objects are still in use and which not.
Here is an example: (a) we have a immutable list, (b) we have writer thread that adds elements to the list and (c) 1000 read threads that read the list without changing it.
It will work without locks.
If we have more than one writer thread we will still need a write lock. If we have to remove entries from the list we will need read-write lock.
Is it valuable? Do not know.
I have been writing a program that has a rather large structure that is passed by reference to a few functions. However, there are a few other functions that need access to small pieces of information within the large structure. It's not being edited, just read.
I was thinking of creating a a second structure that just copies the specific pieces of information needed and passing that by reference, rather than passing the entire structure by reference.
What I am wondering is two things:
Since I am passing the large structure by reference, there really is no performance impact. Correct?
Even if 1) is correct, is it bad practice to be passing around a structure that shouldn't be edited (even though it wouldn't be edited, but still I'm talking about the principle here).
More specifically:
I have a configuration structure that sets up the programs configuration by calling a function and passing the structure by reference. There is some information (process name, command line arguments) that I want to use for informative purposes only. I'm asking if it's bad practice to pass around a structure that wasn't meant for the purpose of what I want to use it for.
1) Since I am passing the large structure by reference, there really is no performance impact. Correct?
Correct.
2) Even if 1) is correct, is it bad etiquette to be passing around a structure that shouldn't be edited (even though it wouldn't be edited, but still I'm talking about the principle here).
You could let your function accept a reference to const to make sure the function won't alter the state of the corresponding argument.
I'm asking if it's bad practice to pass around a structure that wasn't meant for the purpose of what I want to use it for.
I'm not sure what you mean by this. The way you write it, this definitely seems to be a bad practice: you shouldn't use something for doing what it wasn't meant for. That means distorting the semantics of an object. However, the rest of your question doesn't seem to imply this.
Rather, it seems like you are concerned with passing a reference to a function because that may allow the function to alter the argument's state; but provided the function takes a reference to const, it won't be able to alter the state of its argument. In that case, no it's not a bad practice.
If you are referring to the fact that the function only need to work with some of the data members or member functions of your structure, then again that is not necessarily a bad design. It would be silly to require that each function access every member of a data structure.
Of course, this is the best I can write without knowing anything concrete about the semantics of the function and the particular data structure.
Correct.
Pass it by const reference; you'll get the performance gains of pass-by-reference withoug allowing editing.
By the way, if only a fraction of the "big structure" is required to that function it may be an indicator that such fields store some information "on their own" - i.e. the rest of the "big struct" is not needed to interpret them correctly. In this case, you may consider moving them to a separate struct, that will itself be a member of the first "big struct".
As one step further, you can keep such configuration objects in a shared pointer and pass it anywhere you want and so you dont have to worry about ownership of the structure. In this way you ensure that a single original configuration object is shared by the all program components
Like others have said, use const.
If you are doing C++, access those small pieces of information with accessor functions. Then functions that don't need to change the state of your struct will not have to touch any member fields, only member functions.
As others have mentioned, const& if you aren't modifying the data.
However, your point about "should I copy the data to a smaller struct" has mostly been glossed over. The answer is "maybe".
A good reason not to do it is that it is a waste of time -- literally, it costs time to copy stuff around.
A good reason to do it is that it reduces the effective state of your subprocedure. A subprocedure that doesn't access global variables (and hence global state), and isn't passed any pointers, has a very limited state. Procedures with limited state are easier to test, often easier to understand, and usually easier to debug.
Often you want to call each function with the absolute least amount of data required for that function to solve the problem it has. If you avoid passing in a "pointer to everything" (and references are pointers) to every function, you can maintain this rule, and it can often result in code that is easier to maintain.
On the other hand, stripping the data out of the big monolithic state and into small local structs can contain bugs and errors.
One way to avoid this problem entirely is to avoid the big monolithic state object with parameters all mixed together, and if there are some parameters that are bundled together to answer some questions, they should be in their own sub-struct to start with. Now calling the subprocedure is easy -- you pass in the sub-struct which already has the parameters bundled.
I’m looking for a readonly-dictionary to be accessed from multiple threads. While ConcurrentDictionary exposes such capabilities, I don’t want to have the overhead and the strange API.
.Net 4.5 while providing such a class, the documentation states that only static calls are safe.
I wonder why?
ReadOnlyDictionary is just a wrapper around any other dictionary. As such, it's only as thread-safe as the underlying dictionary.
In particular, if there's a thread modifying the underlying dictionary while another thread reads from the wrapper, there's no guarantee of safety.
If you want a ReadOnlyDictionary which is effectively immutable from all angles, you can create a clone of the original dictionary, create a ReadOnlyDictionary wrapper around that, and then not keep a reference to the clone anywhere. With only read operations going on, it should then be thread-safe. Of course, if the key or value types are mutable, that opens up a second degree of "thread-unsafety" to worry about.
My application problem is the following -
I have a large structure foo. Because these are large and for memory management reasons, we do not wish to delete them when processing on the data is complete.
We are storing them in std::vector<boost::shared_ptr<foo>>.
My question is related to knowing when all processing is complete. First decision is that we do not want any of the other application code to mark a complete flag in the structure because there are multiple execution paths in the program and we cannot predict which one is the last.
So in our implementation, once processing is complete, we delete all copies of boost::shared_ptr<foo>> except for the one in the vector. This will drop the reference counter in the shared_ptr to 1. Is it practical to use shared_ptr.use_count() to see if it is equal to 1 to know when all other parts of my app are done with the data.
One additional reason I'm asking the question is that the boost documentation on the shared pointer shared_ptr recommends not using "use_count" for production code.
Edit -
What I did not say is that when we need a new foo, we will scan the vector of foo pointers looking for a foo that is not currently in use and use that foo for the next round of processing. This is why I was thinking that having the reference counter of 1 would be a safe way to ensure that this particular foo object is no longer in use.
My immediate reaction (and I'll admit, it's no more than that) is that it sounds like you're trying to get the effect of a pool allocator of some sort. You might be better off overloading operator new and operator delete to get the effect you want a bit more directly. With something like that, you can probably just use a shared_ptr like normal, and the other work you want delayed, will be handled in operator delete for that class.
That leaves a more basic question: what are you really trying to accomplish with this? From a memory management viewpoint, one common wish is to allocate memory for a large number of objects at once, and after the entire block is empty, release the whole block at once. If you're trying to do something on that order, it's almost certainly easier to accomplish by overloading new and delete than by playing games with shared_ptr's use_count.
Edit: based on your comment, overloading new and delete for class sounds like the right thing to do. If anything, integration into your existing code will probably be easier; in fact, you can often do it completely transparently.
The general idea for the allocator is pretty much the same as you've outlined in your edited question: have a structure (bitmaps and linked lists are both common) to keep track of your free objects. When new needs to allocate an object, it can scan the bit vector or look at the head of the linked list of free objects, and return its address.
This is one case that linked lists can work out quite well -- you (usually) don't have to worry about memory usage, because you store your links right in the free object, and you (virtually) never have to walk the list, because when you need to allocate an object, you just grab the first item on the list.
This sort of thing is particularly common with small objects, so you might want to look at the Modern C++ Design chapter on its small object allocator (and an article or two since then by Andrei Alexandrescu about his newer ideas of how to do that sort of thing). There's also the Boost::pool allocator, which is generally at least somewhat similar.
If you want to know whether or not the use count is 1, use the unique() member function.
I would say your application should have some method that eliminates all references to the Foo from other parts of the app, and that method should be used instead of checking use_count(). Besides, if use_count() is greater than 1, what would your program do? You shouldn't be relying on shared_ptr's features to eliminate all references, your application architecture should be able to eliminate references. As a final check before removing it from the vector, you could assert(unique()) to verify it really is being released.
I think you can use shared_ptr's custom deleter functionality to call a particular function when the last copy has been released. That way, you're not using use_count at all.
You would need to hold something other than a copy of the shared_ptr in your vector so that the shared_ptr is only tracking the outstanding processing.
Boost has several examples of custom deleters in the shared_ptr docs.
I would suggest that instead of trying to use the shared_ptr's use_count to keep track, it might be better to implement your own usage counter. this way you will have full control over this rather than using the shared_ptr's one which, as you rightly suggest, is not recommended. You can also pre-set your own counter to allow for the number of threads you know will need to act on the data, rather than relying on them all being initialised at the beginning to get their copies of the structure.