Multi-Threaded Binary Tree Algorithm - c++

So I've tried one method that locks each node as it looks at it, but this requires ALOT of locking and unlocking... which of course requires quite a bit of overhead. I was wondering if anyone knew of a more efficient algorithm. Here is my first attempt:
typedef struct _treenode{
struct _treenode *leftNode;
struct _treenode *rightNode;
int32_t data;
pthread_mutex_t mutex;
}TreeNode;
pthread_mutex_t _initMutex = PTHREAD_MUTEX_INITIALIZER;
int32_t insertNode(TreeNode **_trunk, int32_t data){
TreeNode **current;
pthread_mutex_t *parentMutex = NULL, *currentMutex = &_initMutex;
if(_trunk != NULL){
current = _trunk;
while(*current != NULL){
pthread_mutex_lock(&(*current)->mutex);
currentMutex = &(*current)->mutex;
if((*current)->data < data){
if(parentMutex != NULL)
pthread_mutex_unlock(parentMutex);
pthreadMutex = currentMutex;
current = &(*current)->rightNode;
}else if((*current)->data > data){
if(parentMutex != NULL)
pthread_mutex_unlock(parentMutex);
parentMutex = currentMutex;
current = &(*current)->leftNode;
}else{
pthread_mutex_unlock(currentMutex);
if(parentMutex != NULL)
pthread_mutex_unlock(parentMutex);
return 0;
}
}
*current = malloc(sizeof(TreeNode));
pthread_mutex_init(&(*current)->mutex, NULL);
pthread_mutex_lock(&(*current)->mutex);
(*current)->leftNode = NULL;
(*current)->rightNode = NULL;
(*current)->data = data;
pthread_mutex_unlock(&(*current)->mutex);
pthread_mutex_unlock(currentMutex);
}else{
return 1;
}
return 0;
}
int main(){
int i;
TreeNode *trunk = NULL;
for(i=0; i<1000000; i++){
insertNode(&trunk, rand() % 50000);
}
}

You don't need to lock every node you visit. You can do something like this. Lock a node when you're about to do an insertion. Do your insertion and unlock. If another thread happens to need to insert at the same point it and the node is locked it should wait before traversing down any further. Once the node is unlocked it can then continue traversing the updated part of the tree.

Another straightforward way is to have 1 lock for the complete tree.
You have a more serialized access to the tree, but you only have one mutex and you lock only once.
If the serialization is an issue, you use a read/write lock. so at least reading can be done in parallel.

Use a read-write lock. Locking on individual nodes will become exceptionally difficult if you later decide to switch your tree implementation. Here's a little demo code using pthreads:
typedef struct {
pthread_rwlock_t rwlock;
TreeNode *root_node;
} Tree;
void Tree_init(Tree *tree) {
pthread_rwlock_init(&tree->rwlock, NULL);
tree->root_node = NULL;
}
int32_t Tree_insert(Tree *tree, int32_t data) {
pthread_rwlock_wrlock(&tree->rwlock);
int32_t ret = _insertNode(&tree->root_node, data);
pthread_rwlock_unlock(&tree->rwlock);
return ret;
}
int32_t Tree_locate(Tree *tree) {
pthread_rwlock_rdlock(&tree->rwlock);
int32_t ret = _locateNode(&tree->root_node);
pthread_rwlock_unlock(&tree->rwlock);
return ret;
}
void Tree_destroy(Tree *tree) {
pthread_rwlock_destroy(&tree->rwlock);
// yada yada
}

Lock the whole tree. There's no other way that will not get you into trouble sooner or later. Of course, if there's a lot of concurrent reads and writes, you will get a lot of blocking and slow everything down horribly.
Java introduced a concurrent skip list in version 1.6. Skip lists work like trees, but are (supposedly) a bit slower. However, they are based on singly linked lists, and therefore can theoretically be modified without locking using compare-and-swap. This makes for superb multi-threaded performance.
I googled "skip list" C++ compare-and-swap and came up with some interesting info but no C++ code. However, Java is open source, so you can get the algorithm if you are desperate enough. The Java class is: java.util.concurrent.ConcurrentSkipListMap.

Related

Trie structure, lock-free inserting

I tried to implement lock free Trie structure, but I am stuck on inserting nodes. At first I believed it was easy (my trie structure would not have any delete methods) but even swapping one pointer atomically can be tricky.
I want to swap pointer to point to structure(TrieNode) atomically only when it was nullptr so as to be sure that I do not lose other nods that other thread could insert inbetween.
struct TrieNode{
int t =0;
std::shared_ptr<TrieNode> child{nullptr};
};
std::shared_ptr<TrieNode> root;
auto p = std::atomic_load(&root);
auto node = std::make_shared<TrieNode>();
node->t=1;
auto tmp = std::shared_ptr<TrieNode>{nullptr};
std::cout<<std::atomic_compare_exchange_strong( &(p->child), &tmp,node)<<std::endl;
std::cout<<node->t;
With this code I get exit code -1073741819 (0xC0000005).
EDIT: Thank you for all your coments. Maybe I did not specify my problem so I want to address it now.After around 10 hours of coding last day I changed few things. Now I use ordinarry pointers and for now it is working. I did not test it for now if its race free with multiple threads inserting words. I plan to do it today.
const int ALPHABET_SIZE =4;
enum Alphabet {A,T,G,C,END};
class LFTrie{
private:
struct TrieNode{
std::atomic<TrieNode*> children[ALPHABET_SIZE+1];
};
std::atomic<TrieNode*> root = new TrieNode();
public:
void Insert(std::string word){
auto p =root.load();
int index;
for(int i=0; i<=word.size();i++){
if(i==word.size())
index = END;
else
index = WhatIndex(word[i]);
auto expected = p->children[index].load();
if(!expected){
auto node = new TrieNode();
if(! p->children[index].compare_exchange_strong(expected,node))
delete node;
}
p = p->children[index];
}
}
};
Now I believe it will work with many threads inserting different words . And yes, in this solution I discard node if there next pointer is not null. Sorry for the trouble (I am not native speaker).
CAS pattern should be something like:
auto expected = p->child;
while( !expected ){
if (success at CAS(&p->child, &expected, make_null_replace() ))
break;
}
if you aren't paying attention to the return value/expected and testing that you are replacing null, stored locally, you are in trouble.
On failure, you need to throw away the new node you made.

Stack overflow? Interesting behaviour during very deep recursion

While I was making my assignment on BST, Linked Lists and AVL I noticed.. actually it is as in the title.
I believe it is somehow related to stack overflow, but could not find why it is happening.
Creation of the BST and Linked list
Searching for all elements in Linked list and BST
And probably most interesting...
Comparison of the height of BST and AVL
(based on array of unique random integers)
On every graph something interesting begins around 33k elements.
Optimization O2 in MS Visual Studio 2019 Community.
Search function of Linked list is not recursive.
Memory for each "link" was allocated with "new" operator.
X axis ends on 40k elements because when it is about 43k then stack overflow error happens.
Do you know why does it happen? Actually, I'm curious what is happening. Looking forward to your answers! Stay healthy.
Here is some related code although it is not exactly the same, I can assure it works the same and it could be said some code was based on it.
struct tree {
tree() {
info = NULL;
left = NULL;
right = NULL;
}
int info;
struct tree *left;
struct tree *right;
};
struct tree *insert(struct tree*& root, int x) {
if(!root) {
root= new tree;
root->info = x;
root->left = NULL;
root->right = NULL;
return(root);
}
if(root->info > x)
root->left = insert(root->left,x); else {
if(root->info < x)
root->right = insert(root->right,x);
}
return(root);
}
struct tree *search(struct tree*& root, int x) {
struct tree *ptr;
ptr=root;
while(ptr) {
if(x>ptr->info)
ptr=ptr->right; else if(x<ptr->info)
ptr=ptr->left; else
return ptr;
}
int bstHeight(tree*& tr) {
if (tr == NULL) {
return -1;
}
int lefth = bstHeight(tr->left);
int righth = bstHeight(tr->right);
if (lefth > righth) {
return lefth + 1;
} else {
return righth + 1;
}
}
AVL tree is a BST read inorder and then, array of the elements is inserted into tree object through bisection.
Spikes in time could be, and I am nearly sure they are, because of using up some cache of the CPU (L2 for example). Some leftover data was stored somewhere in slower memory.
The answer is thanks to #David_Schwartz
Spike in the height of the BST tree is actually my own fault. For the "array of unique random" integers I used array of already sorted unique items, then mixing them up by swapping elements with the rand() function. I have totally forgotten how devastating could it be if expected to random larger numbers.
Thanks #rici for pointing it out.

C++ : undetectable change of variable, artificial "volatile" mutex not works

How it is possible, that is such a line just after if statement with unequal, variables are already equal in pull() method? I have already added Mutex variable, but it not helped.
int fQ::pull(void){ // pull element from the queue
while(MutexF);
MutexF = 1;
if (last != first){
fQueue[first++]();
first%=lengthQ;
MutexF = 0;
return 0;
}
else{
MutexF = 0;
return 1;
}
}
STL containers are to heavy for me, I preparing it for a tiny MCU, that's why, I tried to avoid all this complex staff like (std::mutex, std::atomic, std::mutex etc.). Those multitheading is needed only for test purpose, instead of testing with the tiny MCU's interrupts, for a while. I supposed not use any stl/thread libraries at all
photo of the error
https://github.com/WeSpeakEnglish/nortos/blob/master/C_plus_plus_implementation/main.cpp
https://github.com/WeSpeakEnglish/nortos/blob/master/C_plus_plus_implementation/nortos.h
first, you'd better use std::atomic and/or std::mutex for synchronization purposes. At least use std::flag. volatile has issues in general - it isn't suited for atomic operations - it has a different purpose altogether.
Second in your code, there is a bug and I don't know to solve it properly with volatile.
while(MutexF);
MutexF = 1;
Imagine, someone set MutexF to 0, then two threads simultaneously exited the while loop before setting MutexF=1. What do you think is gonna happen?
Perhaps you can synchronize two thread - one for pull and one for push in this manner - but you'd better abandon such approach.
#include <mutex> // std::mutex
typedef void(*FunctionPointer)(void);
class fQ {
private:
std::atomic<int> first;
std::atomic<int> last;
FunctionPointer * fQueue;
int lengthQ;
std::mutex mtx;
public:
fQ(int sizeQ);
~fQ();
int push(FunctionPointer);
int pull(void);
};
fQ::fQ(int sizeQ){ // initialization of Queue
fQueue = new FunctionPointer[sizeQ];
last = 0;
first = 0;
lengthQ = sizeQ;
}
fQ::~fQ(){ // initialization of Queue
delete [] fQueue;
}
int fQ::push(FunctionPointer pointerF){ // push element from the queue
mtx.lock();
if ((last+1)%lengthQ == first){
mtx.unlock();
return 1;
}
fQueue[last++] = pointerF;
last = last%lengthQ;
mtx.unlock();
return 0;
}
int fQ::pull(void){ // pull element from the queue
mtx.lock();
if (last != first){
fQueue[first++]();
first = first%lengthQ;
mtx.unlock();
return 0;
}
else{
mtx.unlock();
return 1;
}
}

What are the correct memory orders to use when inserting a node at the beginning of a lock free singly linked list?

I have a simple linked list. There is no danger of the ABA problem, I'm happy with Blocking category and I don't care if my list is FIFO, LIFO or randomized. At long as the inserting succeeds without making others fails.
The code for that looks something like this:
class Class {
std::atomic<Node*> m_list;
...
};
void Class::add(Node* node)
{
node->next = m_list.load(std::memory_order_acquire);
while (!m_list.compare_exchange_weak(node->next, node, std::memory_order_acq_rel, std::memory_order_acquire));
}
where I more or less randomly filled in the used memory_order's.
What are the right memory orders to use here?
I've seen people use std::memory_order_relaxed in all places, one guy on SO used that too, but then std::memory_order_release for the success case of compare_exchange_weak -- and the genmc project uses memory_order_acquire / twice memory_order_acq_rel in a comparable situation, but I can't get genmc to work for a test case :(.
Using the excellent tool from Michalis Kokologiannakis genmc, I was able to verify the required memory orders with the following test code. Unfortunately, genmc currently requires C code, but that doesn't matter for figuring out what the memory orders need to be of course.
// Install https://github.com/MPI-SWS/genmc
//
// Then test with:
//
// genmc -unroll 5 -- genmc_sll_test.c
// These header files are replaced by genmc (see /usr/local/include/genmc):
#include <pthread.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
#include <stdatomic.h>
#include <stdio.h>
#define PRODUCER_THREADS 3
#define CONSUMER_THREADS 2
struct Node
{
struct Node* next;
};
struct Node* const deleted = (struct Node*)0xd31373d;
_Atomic(struct Node*) list;
void* producer_thread(void* node_)
{
struct Node* node = (struct Node*)node_;
// Insert node at beginning of the list.
node->next = atomic_load_explicit(&list, memory_order_relaxed);
while (!atomic_compare_exchange_weak_explicit(&list, &node->next,
node, memory_order_release, memory_order_relaxed))
;
return NULL;
}
void* consumer_thread(void* param)
{
// Replace the whole list with an empty list.
struct Node* head = atomic_exchange_explicit(&list, NULL, memory_order_acquire);
// Delete each node that was in the list.
while (head)
{
struct Node* orphan = head;
head = orphan->next;
// Mark the node as deleted.
assert(orphan->next != deleted);
orphan->next = deleted;
}
return NULL;
}
pthread_t t[PRODUCER_THREADS + CONSUMER_THREADS];
struct Node n[PRODUCER_THREADS]; // Initially filled with zeroes -->
// none of the Node's is marked as deleted.
int main()
{
// Start PRODUCER_THREADS threads that each append one node to the queue.
for (int i = 0; i < PRODUCER_THREADS; ++i)
if (pthread_create(&t[i], NULL, producer_thread, &n[i]))
abort();
// Start CONSUMER_THREAD threads that each delete all nodes that were added so far.
for (int i = 0; i < CONSUMER_THREADS; ++i)
if (pthread_create(&t[PRODUCER_THREADS + i], NULL, consumer_thread, NULL))
abort();
// Wait till all threads finished.
for (int i = 0; i < PRODUCER_THREADS + CONSUMER_THREADS; ++i)
if (pthread_join(t[i], NULL))
abort();
// Count number of elements still in the list.
struct Node* l = list;
int count = 0;
while (l)
{
++count;
l = l->next;
}
// Count the number of deleted elements.
int del_count = 0;
for (int i = 0; i < PRODUCER_THREADS; ++i)
if (n[i].next == deleted)
++del_count;
assert(count + del_count == PRODUCER_THREADS);
//printf("count = %d; deleted = %d\n", count, del_count);
return 0;
}
The output of which is
$ genmc -unroll 5 -- genmc_sll_test.c
Number of complete executions explored: 6384
Total wall-clock time: 1.26s
Replacing either the memory_order_release or memory_order_acquire with memory_order_relaxed causes an assertion.
In fact, it can be checked that using exclusive memory_order_relaxed when just inserting nodes is sufficient to get them all cleanly in the list (although in a 'random' order - there is nothing sequential consistent, so the order in which they are added is not necessarily the same as that the threads try to add them, if such correlation exists for other reasons).
However, the memory_order_release is required so that when head is read with memory_order_acquire we can be certain that all non-atomic next pointers are visible in the "consumer" thread.
Note there is no ABA problem here because values used for head and next cannot be "reused" before they are deleted by the 'consumer_thread' function, which is the only place where these node are allowed to be deleted (therefore), implying that there can only be one consumer thread (this test code does NOT check for the ABA problem, so it also works using 2 CONSUMER_THREADS).
The actual code is a garbage collection mechanism where multiple "producer" threads add pointers to a singly linked list when those can be deleted, but where it is only safe to actually do so in one specific thread (in that case there is only one "consumer" thread thus, which performs this garbage collection at a well-known place in a main-loop).

lock free arena allocator implementation - correct?

for a simple pointer-increment allocator (do they have an official name?) I am looking for a lock-free algorithm. It seems trivial, but I'd like to get soem feedback whether my implementaiton is correct.
not threadsafe implementation:
byte * head; // current head of remaining buffer
byte * end; // end of remaining buffer
void * Alloc(size_t size)
{
if (end-head < size)
return 0; // allocation failure
void * result = head;
head += size;
return head;
}
My attempt at a thread safe implementation:
void * Alloc(size_t size)
{
byte * current;
do
{
current = head;
if (end - current < size)
return 0; // allocation failure
} while (CMPXCHG(&head, current+size, current) != current));
return current;
}
where CMPXCHG is an interlocked compare exchange with (destination, exchangeValue, comparand) arguments, returning the original value
Looks good to me - if another thread allocates between the get-current and cmpxchg, the loop attempts again. Any comments?
Your current code appears to work. Your code behaves the same as the below code, which is a simple pattern that you can use for implementing any lock-free algorithm that operates on a single word of data without side-effects
do
{
original = *data; // Capture.
result = DoOperation(original); // Attempt operation
} while (CMPXCHG(data, result, original) != original);
EDIT: My original suggestion of interlocked add won't quite work here because you support trying to allocate and failing if not enough space left. You've already modified the pointer and causing subsequent allocs to fail if you used InterlockedAdd.