Is a method with no linearization points always not linearizable? - concurrency

If you can definitely prove that a method has no linearization points, does it necessarily mean that that method is not linearizable? Also, as a sub question, how can you prove that a method has no linearizatioon points?

To build upon the answers described above, a method can be described as linearizable. As referenced in the book that djoker mentioned: http://www.amazon.com/dp/0123705916/?tag=stackoverfl08-20
on page 69, exercise 32, we see
It should be noted that enq() is indeed a method, that is described as possibily being linearizable/not linearizable.
Proving that there are linearizable points comes down to finding if there are examples that can break linearizability. If you make the assumption that various read/write memory operations in a method are linearizable, and then prove by contradiction that there are non-linearizable situations that result from such an assumption, you can declare that the previously mentioned read/write operation is not a valid linearization point.
Take, for example, the following enq()/deq() methods, assuming they are part of a standard queue implementation with head/tail pointers thaand a backing array "arr":
public terribleQueue(){
arr = new T[10];
tail = 0;
head = 0;
}
void enq(T x){
int slot = tail;
arr[slot] = x;
tail = tail + 1;
}
T deq(){
if( head == tail ) throw new EmptyQueueException();
T temp = arr[head];
head = head + 1;
return temp;
}
In this terrible implementation, we can easily prove, for example, that the first line of enq is not a valid linearization point, by assuming that it is a linearization point, and then finding an example displaying otherwise, as seen here:
Take the example two threads, A and B, and the example history:
A: enq( 1 )
A: slot = 0
B: enq( 2 )
B: slot = 0
(A and B are now past their linearization points, therefore we are not allowed to re-order them to fit our history)
A: arr[0] = 1
B: arr[0] = 2
A: tail = 1
B: tail = 2
C: deq()
C: temp = arr[0] = 2
C: head = 1
C: return 2
Now we see that because of our choice of linearization point (which fixes the order of A and B), this execution will be impossible make linearizable, because we cannot make C's deq return 1, no matter where we put it.
Kind of a long winded answer, but I hope this helps

If you can definitely prove that a method has no linearization points, does it necessarily
mean that that method is not linearizable?
Firstly, linearizability is not property of a method, it is property of execution sequence.
how can you prove that a method has no linearizatioon points?
It depends on the execution sequence whether we are able to find linearization point for the method
or not.
For example, we have the below sequence, for thread A on a FIFO queue. t1, t2, t3 are time
intervals.
A.enq(1) A.enq(2) A.deq(1)
t1 t2 t3
We can choose linearization points(lp) for first two enq methods as any points in time interval t1 and t2 respectively, and for deq any point in t3. The points that we choose are lp for these methods.
Now, consider a faulty implementation
A.enq(1) A.enq(2) A.deq(2)
t1 t2 t3
Linerizability allows lp to respect the real-time ordering. Therefore, lp of the methods should follow the time ordering i.e. t1 < t2 < t3. However, since our implementation is incorrect, we cannot clearly do this. Hence, we cannot find linearization point for the method A.deq(2), in turn our seq. too in not linerizable.
Hope this helps, if you need to know more you can read this book:
http://www.amazon.com/Art-Multiprocessor-Programming-Maurice-Herlihy/dp/0123705916

This answer is based on me reading about linearizability on wikipedia for the first time, and trying to map it to my existing understanding of memory consistency through happens-before relationships. So I may be misunderstanding the concept.
If you can definitely prove that a method has no linearization points, does
it necessarily mean that that method is not linearizable?
It is possible to have a scenario where shared, mutable state is concurrently operated on by multiple threads without any synchronization or visibility aids, and still maintain all invariants without risk of corruption.
However, those cases are very rare.
how can you prove that a method has no linearizatioon points?
As I understand linearization points, and I may be wrong here, they are where happens-before relationships are established between threads. If a method (recursively through every method it calls in turn) establishes no such relationships, then I would assert that it has no linearizatioon points.

Related

Writing a custom, highly-specialized, special-purpose standard-compliant C++ allocator

Brief Preface
I recognize that there are many nuances and requirements for a standard-compatible allocator. There are a number of questions here covering a range of topics associated with allocators. I realize that the requirements set out by the standard are critical to ensuring that the allocator functions correctly in all cases, doesn't leak memory, doesn't cause undefined-behaviour, etc. This is particularly true where the allocator is meant to be used (or at least, can be used) in a wide range of use cases, with a variety of underlying types and different standard containers, object sizes, etc.
In contrast, I have a very specific use case where I personally strictly control all of the conditions associated with its use, as I describe in detail below. Consequently, I believe that what I've done is perfectly acceptable given the highly-specific nature of what I'm trying to implement.
I'm hoping someone with far more experience and understanding than me can either confirm that the description below is acceptable or point out the problems (and, ideally, how to fix them too).
Overview / Specific Requirements
In a nutshell, I'm trying to write an allocator that is to be used within my own code and for a single, specific purpose:
I need "a few" std::vector (probably uint16_t), with a fixed (at runtime) number of elements. I'm benchmarking to determine the best tradeoff of performance/space for the exact integer type[1]
As noted, the number of elements is always the same, but it depends on some runtime configuration data passed to the application
The number of vectors is also either fixed or at least bounded. The exact number is handled by a library providing an implementation of parallel::for(execution::par_unseq, ...)
The vectors are constructed by me (i.e. so I know with certainty that they will always be constructed with N elements)
[1] The value of the vectors are used to conditionally copy a float from one of 2 vectors to a target: c[i] = rand_vec[i] < threshold ? a[i] : b[i] where a, b, c are contiguous arrays of float, rand_vec is the std::vector I'm trying to figure out here, and threshold is a single variable of type integer_tbd. The code compiles as SSE SIMD operations. I do not remember the details of this, but I believe that this requires additional shifting instructions if the ints are smaller than the floats.
On this basis, I've written a very simple allocator, with a single static boost::lockfree::queue as the free-list. Given that I will construct the vectors myself and they will go out of scope when I'm finished with them, I know with certainty that all calls to alloc::deallocate(T*, size_t) will always return vectors of the same size, so I believe that I can simply push them back onto the queue without worrying about a pointer to a differently-sized allocation being pushed onto the free-list.
As noted in the code below, I've added in runtime tests for both the allocate and deallocate functions for now, while I've been confirming for myself that these situations cannot and will not occur. Again, I believe it is unquestionably safe to delete these runtime tests. Although some advice would be appreciated here too -- considering the surrounding code, I think they should be handled adequately by the branch predictor so they don't have a significant runtime cost (although without instrumenting, hard to say for 100% certain).
In a nutshell - as far as I can tell, everything here is completely within my control, completely deterministic in behaviour, and, thus, completely safe. This is also suggested when running the code under typical conditions -- there are no segfaults, etc. I haven't yet tried running with sanitizers yet -- I was hoping to get some feedback and guidance before doing so.
I should point out that my code runs 2x faster compared to using std::allocator which is at least qualitatively to be expected.
CR_Vector_Allocator.hpp
class CR_Vector_Allocator {
using T = CR_Range_t; // probably uint16_t or uint32_t, set elsewhere.
private:
using free_list_type = boost::lockfree::queue>;
static free_list_type free_list;
public:
T* allocate(size_t);
void deallocate(T* p, size_t) noexcept;
using value_type = T;
using pointer = T*;
using reference = T&;
template struct rebind { using other = CR_Vector_Allocator;};
};
CR_Vector_Allocator.cc
CR_Vector_Allocator::T* CR_Vector_Allocator::allocate(size_t n) {
if (n <= 1)
throw std::runtime_error("Unexpected number of elements to initialize: " +
std::to_string(n));
T* addr_;
if (free_list.pop(addr_)) return addr_;
addr_ = reinterpret_cast<T*>(std::malloc(n * sizeof(T)));
return addr_;
}
void CR_Vector_Allocator::deallocate(T* p, size_t n) noexcept {
if (n <= 1) // should never happen. but just in case, I don't want to leak
free(p);
else
free_list.push(p);
}
CR_Vector_Allocator::free_list_type CR_Vector_Allocator::free_list;
It is used in the following manner:
using CR_Vector_t = std::vector<uint16_t, CR_Vector_Allocator>;
CR_Vector_t Generate_CR_Vector(){
/* total_parameters is a member of the same class
as this member function and is defined elsewhere */
CR_Vector_t cr_vec (total_parameters);
std::uniform_int_distribution<uint16_t> dist_;
/* urng_ is a member variable of type std::mt19937_64 in the class */
std::generate(cr_vec.begin(), cr_vec.end(), [this, &dist_](){
return dist_(this->urng_);});
return cr_vec;
}
void Prepare_Next_Generation(...){
/*
...
*/
using hpx::parallel::execution::par_unseq;
hpx::parallel::for_loop_n(par_unseq, 0l, pop_size, [this](int64_t idx){
auto crossovers = Generate_CR_Vector();
auto new_parameters = Generate_New_Parameters(/* ... */, std::move(crossovers));
}
}
Any feedback, guidance or rebukes would be greatly appreciated.
Thank you!!

Is there any elegant way of iterating through a list whose elements' positions can change?

I am currently running into a disgusting problem. Suppose there is a list aList of objects(whose type we call Object), and I want to iterate through it. Basically, the code would be like this:
for(int i = 0; i < aList.Size(); ++i)
{
aList[i].DoSth();
}
The difficult part here is, the DoSth() method could change the caller's position in the list! So two consequences could occur: first, the iteration might never be able to come to an end; second, some elements might be skipped (the iteration is not necessarily like above, since it might be a linked list). Of course, the first one is the major concern.
The problem must be solved with these constraints:
1) The possibility of doing position-exchanging operations cannot be excluded;
2) The position-exchanging operations can be delayed until the iteration finishes, if necessary and doable;
3) Since it happens quite often, the iteration can be modified only minimally (so actions like creating a copy of the list is not recommended).
The language I'm using is C++, but I think there are similar problems in JAVA and C#, etc.
The following are what I've tried:
a) Try forbidding the position-exchanging operations during the iteration. However, that involves too many client code files and it's just not practical to find and modify all of them.
b) Modify every single method(e.g., Method()) of Object that can change the position of itself and will be called by DoSth() directly or indirectly, in this way: first we can know that aList is doing the iteration, and we'll treat Method() accordingly. If the iteration is in progress, then we delay what Method() wants to do; otherwise, it does what it wants to right now. The question here is: what is the best (easy-to-use, yet efficient enough) way of delaying a function call here? The parameters of Method() could be rather complex. Moreover, this approach will involve quite a few functions, too!
c) Try modifying the iteration process. The real situation I encounter here is quite complex because it involves two layers of iterations: the first of them is a plain array iteration, while the second is a typical linked list iteration lying in a recursive function. The best I can do about the second layer of iteration for now, is to limit its iteration times and prevent the same element from being iterated more than once.
So I guess there could be some better way to tackle this problem? Maybe some awesome data structure will help?
Your question is a little light on detail, but from what you have written it seems that you are making the mistake of mixing concerns.
It is likely that your object can perform some action that causes it to either continue to exist or not. The decision that it should no longer exist is a separate concern to that of actually storing it in a container.
So let's split those concerns out:
#include <vector>
enum class ActionResult {
Dies,
Lives,
};
struct Object
{
ActionResult performAction();
};
using Container = std::vector<Object>;
void actions(Container& cont)
{
for (auto first = begin(cont), last = end(cont)
; first != last
; )
{
auto result = first->performAction();
switch(result)
{
case ActionResult::Dies:
first = cont.erase(first); // object wants to die so remove it
break;
case ActionResult::Lives: // object wants to live to continue
++first;
break;
}
}
}
If there are indeed only two results of the operation, lives and dies, then we could express this iteration idiomatically:
#include <algorithm>
// ...
void actions(Container& cont)
{
auto actionResultsInDeath = [](Object& o)
{
auto result = o.performAction();
return result == ActionResult::Dies;
};
cont.erase(remove_if(begin(cont), end(cont),
actionResultsInDeath),
end(cont));
}
Well, problem solved, at least in regard to the situation I'm interested in right now. In my situation, aList is really a linked list and the Object elements are accessed through pointers. If the size of aList is relatively small, then we have an elegant solution just like this:
Object::DoSthBig()
{
Object* pNext = GetNext();
if(pNext)
pNext->DoSthBig();
DoSth();
}
This has the underlying hypothesis that each pNext keeps being valid during the process. But if the element-deletion operation has already been dealt with discreetly, then everything is fine.
Of course, this is a very special example and is unable to be applied to other situations.

Why my concurrentSkipListSet stucks while multi add?

I want to test the performance of ConcurrentSkipListSet vs. ConcurrentLinkedQueue,so I make a test:
ConcurrentSkipListSet<Integer> concurrentSkipListSet=new ConcurrentSkipListSet<>((o1,o2)->{return 1;});
HashSet<Callable<Integer>> sets=new HashSet<>();
for(int i=0;i<1000;i++){
final int j=i;
sets.add(()->{concurrentSkipListSet.add(j);
System.out.println(j);
return null;
});
}
Long c=System.currentTimeMillis();
System.out.println(c);
ExecutorService service=Executors.newFixedThreadPool(10);
try {
service.invokeAll(sets);
}catch(Exception e){}
System.out.println(System.currentTimeMillis()-c);
I am so confused that the program stucks after sout about 20~50 j,and it won't finish in about an hour. If I change the i as i<10,it finished at 3 millis sometimes or stucks after sout about 4~5 j sometimes.
A newCachedThreadPool perfroms same as the newFixedThreadPool in IDEA and Eclipse.
Please help me to analyze it,3Q.
Now I think it is not the newCachedThreadPool's problem but that concurrentSkipListSet.add(j);
when I changed SkipList to LinkedQueue or a synchronized HashSet, it worked well and finished in 168 millis or 170 millis.
Please help me to analyze it,3Q.
The problem may be in the comparator you're supplying to the ConcurrentSkipListSet constructor. It's always returning 1 which may lead to some kind of infinite loop in ConcurrentSkipListSet implementation. You could use ConcurrentSkipListSet constructor with no parameters to use natural ordering for Integer.
Consider what's going on when you're always returning 1 from a comparator:
Suppose we have two object A and B. A sorting algorithm at some point may ask your comparator "is A greater then B?" calling compare(A, B). You return 1 which means that indeed A > B and B should precede A in sorted order. Then at some point there's a chance that the algorithm will ask "is B greater then A?" and your compare(B, A) also will return 1 which means B > A and A should precede B in sorted order.
You can see that this comparator behavior is completely inconsistent. For some algorithms this may lead to infinite loops. For instance an algorithms may endlessly swap a pair of elements.

C structure pointer dereferencing speed

I have a question regarding the speed of pointer dereferencing. I have a structure like so:
typedef struct _TD_RECT TD_RECT;
struct _TD_RECT {
double left;
double top;
double right;
double bottom;
};
My question is, which of these would be faster and why?
CASE 1:
TD_RECT *pRect;
...
for(i = 0; i < m; i++)
{
if(p[i].x < pRect->left) ...
if(p[i].x > pRect->right) ...
if(p[i].y < pRect->top) ...
if(p[i].y > pRect->bottom) ...
}
CASE 2:
TD_RECT *pRect;
double left = pRect->left;
double top = pRect->top;
double right = pRect->right;
double bottom = pRect->bottom;
...
for(i = 0; i < m; i++)
{
if(p[i].x < left) ...
if(p[i].x > right) ...
if(p[i].y < top) ...
if(p[i].y > bottom) ...
}
So in case 1, the loop is directly dereferencing the pRect pointer to obtain the comparison values. In case 2, new values were made on the function's local space (on the stack) and the values were copied from the pRect to the local variables. Through a loop there will be many comparisons.
In my mind, they would be equally slow, because the local variable is also a memory reference on the stack, but I'm not sure...
Also, would it be better to keep referencing p[] by index, or increment p by one element and dereference it directly without an index.
Any ideas? Thanks :)
You'll probably find it won't make a difference with modern compilers. Most of them would probably perform common subexpresion elimination of the expressions that don't change within the loop. It's not wise to assume that there's a simple one-to-one mapping between your C statements and assembly code. I've seen gcc pump out code that would put my assembler skills to shame.
But this is neither a C nor C++ question since the ISO standard doesn't mandate how it's done. The best way to check for sure is to generate the assembler code with something like gcc -S and examine the two cases in detail.
You'll also get more return on your investment if you steer away from this sort of micro-optimisation and concentrate more on the macro level, such as algorithm selection and such.
And, as with all optimisation questions, measure, don't guess! There are too many variables which can affect it, so you should be benchmarking different approaches in the target environment, and with realistic data.
It is not likely to be a hugely performance critical difference. You could profile doing each option multiple times and see. Ensure you have your compiler optimisations set in the test.
With regards to storing the doubles, you might get some performance hit by using const. How big is your array?
With regards to using pointer arithmetic, this can be faster, yes.
You can instantly optimise if you know left < right in your rect (surely it must be). If x < left it can't also be > right so you can put in an "else".
Your big optimisation, if there is one, would come from not having to loop through all the items in your array and not have to perform 4 checks on all of them.
For example, if you indexed or sorted your array on x and y, you would be able, using binary search, to find all values that have x < left and loop through just those.
I think the second case is likely to be faster because you are not dereferencing the pointer to pRect on every loop iteration.
Practically, a compiler doing optimisation may notice this and there might be no difference in the code that is generated, but the possibility of pRect being an alias of an item in p[] could prevent this.
An optimizing compiler will see that the structure accesses are loop invariant and so do a Loop-invariant code motion, making your two cases look the same.
I will be surprised if even a totally non-optimized compile (- O0) will produce differentcode for the two cases presented. In order to perform any operation on a modern processor, the data need to loaded into registers. So even when you declare automatic variables, these variables will not exist in main memory but rather in one of the processors floating point registers. This will be true even when you do not declare the variables yourself and therefore I expect no difference in generated machine code even for when you declare the temporary variables in your C++ code.
But as others have said, compile the code into assembler and see for yourself.

atomic swap with CAS (using gcc sync builtins)

Can the compare-and-swap function be used to swap variables atomically?
I'm using C/C++ via gcc on x86_64 RedHat Linux, specifically the __sync builtins.
Example:
int x = 0, y = 1;
y = __sync_val_compare_and_swap(&x, x, y);
I think this boils down to whether x can change between &x and x; for instance, if &x constitutes an operation, it might be possible for x to change between &x and x in the arguments. I want to assume that the comparison implicit above will always be true; my question is whether I can. Obviously there's the bool version of CAS, but then I can't get the old x to write into y.
A more useful example might be inserting or removing from the head of a linked list (gcc claims to support pointer types, so assume that's what elem and head are):
elem->next = __sync_val_compare_and_swap(&head, head, elem); //always inserts?
elem = __sync_val_compare_and_swap(&head, head, elem->next); //always removes?
Reference:
http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html
The operation might not actually store the new value into the destination because of a race with another thread that changes the value at the same moment you're trying to. The CAS primitive doesn't guarantee that the write occurs - only that the write occurs if the value is already what's expected. The primitive can't know what the correct behavior is if the value isn't what is expected, so nothing happens in that case - you need to fix up the problem by checking the return value to see if the operation worked.
So, your example:
elem->next = __sync_val_compare_and_swap(&head, head, elem); //always inserts?
won't necessarily insert the new element. If another thread inserts an element at the same moment, there's a race condition that might cause this thread's call to __sync_val_compare_and_swap() to not update head (but neither this thread's or the other thread's element is lost yet if you handle it correctly).
But, there's another problem with that line of code - even if head did get updated, there's a brief moment of time where head points to the inserted element, but that element's next pointer hasn't been updated to point to the previous head of the list. If another thread swoops in during that moment and tries to walk the list, bad things happen.
To correctly update the list change that line of code to something like:
whatever_t* prev_head = NULL;
do {
elem->next = head; // set up `elem->head` so the list will still be linked
// correctly the instant the element is inserted
prev_head = __sync_val_compare_and_swap(&head, elem->next, elem);
} while (prev_head != elem->next);
Or use the bool variant, which I think is a bit more convenient:
do {
elem->next = head; // set up `elem->head` so the list will still be linked
// correctly the instant the element is inserted
} while (!__sync_bool_compare_and_swap(&head, elem->next, elem));
It's kind of ugly, and I hope I got it right (it's easy to get tripped up in the details of thread-safe code). It should be wrapped in an insert_element() function (or even better, use an appropriate library).
Addressing the ABA problem:
I don't think the ABA problem is relevant to this "add an element to the head of a list" code. Let's say that a thread wants to add object X to the list and when it executes elem->next = head, head has value A1.
Then before the __sync_val_compare_and_swap() is executed, another set of threads comes along and:
removes A1 from the list, making head point to B
does whatever with object A1 and frees it
allocates another object, A2 that happens to to be at the same address as A1 was
adds A2 to the list so that head now points to A2
Since A1 and A2 have the same identifier/address, this is an instance of the ABA problem.
However, it doesn't matter in this case since the thread adding object X doesn't care that the head points to a different object than it started out with - all it cares about is that when X is queued:
the list is consistent,
no objects on the list have been lost, and
no objects other than X have been added to the list (by this thread)
Nope. The CAS instruction on x86 takes a value from a register, and compares/writes it against a value in memory.
In order to atomically swap two variables, it'd have to work with two memory operands.
As for whether x can change between &x and x? Yes, of course it can.
Even without the &, it could change.
Even in a function such as Foo(x, x), you could get two different values of x, since in order to call the function, the compiler has to:
take the value of x, and store it in the first parameter's position, according to the calling convention
take the value of x, and store it in the second parameter's position, according to the calling convention
between those two operations, another thread could easily modify the value of x.
It seems like you're looking for the interlocked-exchange primitive, not the interlocked-compare-exchange. That will unconditionally atomically swap the holding register with the target memory location.
However, you still have a problem with race conditions between assignments to y. Sometimes y is a local, in which case this will be safe, but if both x and y are shared you have a major problem and will need a lock to resolve it.