linux c++ synchronization method both inter and intra process - c++

The question is brought up when I developing a registry system (c/c++, 2.6.32-642.6.2.el6.x86_64 #1 SMP) used to bookmark information for each database, which requires locking for both inter and intra process. Normally, lockf(), flock(), fcntl() are obvious candidates for the inter process locking, but then I find out that they do not work as expected for intra-process locking(multi threads in same process).
I tested it using the following program:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <fcntl.h> /* For O_RDWR */
#include <unistd.h> /* For open(), creat() */
#include <errno.h>
int counter = 0;
void* counterThread(void* ptr)
{
int lockfd = 0;
int tmpCounter = 0;
lockfd = open("/tmp/lockfile.txt", O_CREAT|O_WRONLY, 0666);
if(lockfd == -1)
{
printf("lockfile could not be created, errno:%d\n", errno);
return NULL;
}
if(lockf(lockfd, F_LOCK, 0) == -1)
{
printf("lockfile could not be locked, errno:%d\n", errno);
return NULL;
}
counter++;
tmpCounter = counter;
if(lockf(lockfd, F_ULOCK, 0) == -1)
{
printf("lockfile could not be unlocked, errno:%d\n", errno);
return NULL;
}
close(lockfd);
printf("counter is %d, lockfile is %d\n", tmpCounter, lockfd);
}
int main()
{
int threadNum = 30000;
pthread_t threads[30000];
int i = 0;
int rv = 0;
for(; i < threadNum; i++)
{
rv = pthread_create( &threads[i], NULL, &counterThread, NULL);
if(rv != 0)
{
printf("failed to create pthread %d\n", i);
return -1;
}
}
for(i = 0; i < threadNum; i++)
pthread_join(threads[i], NULL);
return 0;
}
The output would be:
counter is 1, lockfile is 4
counter is 2, lockfile is 3
counter is 3, lockfile is 5
counter is 4, lockfile is 6
counter is 7, lockfile is 4
...
counter is 29994, lockfile is 3
counter is 29995, lockfile is 3
counter is 29996, lockfile is 3
counter is 29997, lockfile is 3
counter is 29998, lockfile is 3
The output sequence is random and sometimes missing some numbers inside, meaning there is definitely a race condition happening. I think the reason is probably that fd opened for the same file in the same process is somehow optimized to be reused. Because all these locking mechanism is implemented in granularity of fd, the locking does not work in this case.
Given the background, I would like to ask the following question:
Is there any means I could force open to return different fd for different threads to same process to make the locking works?
Is there any good practice or convenient API in Linux to do both inter and intra process locking? What I could think of is the following means to implement it(not verified yet), but I would like to know some easier ways:
(1) Implement mutex and semaphore to serialize the access to all these lockfile APIs for the critical resources
(2) shm_open a shared memory, mmap it in different processes and add semaphore/mutex inside to lock the critical resources
Thanks in advance:)

Related

Problem with multi-threading and waiting on events

I have a problem with my code:
#define _CRT_SECURE_NO_WARNINGS
#include <iostream>
#include <windows.h>
#include <string.h>
#include <math.h>
HANDLE event;
HANDLE mutex;
int runner = 0;
DWORD WINAPI thread_fun(LPVOID lpParam) {
int* data = (int*)lpParam;
for (int j = 0; j < 4; j++) { //this loop necessary in order to reproduce the issue
if ((data[2] + 1) == data[0]) { // if it is last thread
while (1) {
WaitForSingleObject(mutex, INFINITE);
if (runner == data[0] - 1) { // if all other thread reach event break
ReleaseMutex(mutex);
break;
}
printf("Run:%d\n", runner);
ReleaseMutex(mutex);
Sleep(10);
}
printf("Check Done:<<%d>>\n", data[2]);
runner = 0;
PulseEvent(event); // let all other threads continue
}
else { // if it is not last thread
WaitForSingleObject(mutex, INFINITE);
runner++;
ReleaseMutex(mutex);
printf("Wait:<<%d>>\n", data[2]);
WaitForSingleObject(event, INFINITE); // wait till all other threads reach this stage
printf("Exit:<<%d>>\n", data[2]);
}
}
return 0;
}
int main()
{
event = CreateEvent(NULL, TRUE, FALSE, NULL);
mutex = CreateMutex(NULL, FALSE, NULL);
SetEvent(event);
int data[3] = {2,8}; //0 amount of threads //1 amount of numbers
HANDLE t[10000];
int ThreadData[1000][3];
for (int i = 0; i < data[0]; i++) {
memcpy(ThreadData[i], data, sizeof(int) * 2); // copy amount of threads and amount of numbers to the threads data
ThreadData[i][2] = i; // creat threads id
LPVOID ThreadsData = (LPVOID)(&ThreadData[i]);
t[i] = CreateThread(0, 0, thread_fun, ThreadsData, 0, NULL);
if (t[i] == NULL)return 0;
}
while (1) {
DWORD res = WaitForMultipleObjects(data[0], t, true, 1000);
if (res != WAIT_TIMEOUT) break;
}
for (int i = 0; i < data[0]; i++)CloseHandle(t[i]); // close all threads
CloseHandle(event); // close event
CloseHandle(mutex); //close mutex
printf("Done");
}
The main idea is to wait until all threads except one reach the event and wait there, meanwhile the last thread must release them from waiting.
But the code doesn't work reliably. 1 in 10 times, it ends correctly, and 9 times just gets stuck in while(1). In different tries, printf in while (printf("Run:%d\n", runner);) prints different numbers of runners (0 and 3).
What can be the problem?
As we found out in the comments section, the problem was that although the event was created in the initial state of being non-signalled
event = CreateEvent(NULL, TRUE, FALSE, NULL);
it was being set to the signalled state immediately afterwards:
SetEvent(event);
Due to this, at least on the first iteration of the loop, when j == 0, the first worker thread wouldn't wait for the second worker thread, which caused a race condition.
Also, the following issues with your code are worth mentioning (although these issues were not the reason for your problem):
According to the Microsoft documentation on PulseEvent, that function should not be used, as it can be unreliable and is mainly provided for backward-compatibility. According to the documentation, you should use condition variables instead.
In your function thread_fun, the last thread is locking and releasing the mutex in a loop. This can be bad, because mutexes are not guaranteed to be fair and it is possible that this will cause other threads to never be able to acquire the mutex. Although this possibility is mitigated by you calling Sleep(10); once in every loop iteration, it is still not the ideal solution. A better solution would be to use a condition variable, so that the thread only checks for changes of the variable runner when another thread actually signals a possible change. Such a solution would also be better for performance reasons.

Cuda: how to reset GPU after "sticky" error? [duplicate]

I have a working app which uses Cuda / C++, but sometimes, because of memory leaks, throws exception. I need to be able to reset the GPU on live, my app is a server so it has to stay available.
I tried something like this, but it doesnt seems to work:
try
{
// do process using GPU
}
catch (std::exception &e)
{
// catching exception from cuda only
cudaSetDevice(0);
CUDA_RETURN_(cudaDeviceReset());
}
My idea is to reset the device each times I get an exception from the GPU, but I cannot manage to make it working. :(
Btw, for some reasons, I cannot fix every problems of my Cuda code, I need a temporary solution. Thanks !
The only method to restore proper device functionality after a non-recoverable ("sticky") CUDA error is to terminate the host process that initiated (i.e. issued the CUDA runtime API calls that led to) the error.
Therefore, for a single-process application, the only method is to terminate the application.
It should be possible to design a multi-process application, where the initial ("parent") process makes no usage of CUDA whatsoever, and spawns a child process that uses the GPU. When the child process encounters an unrecoverable CUDA error, it must terminate.
The parent process can, optionally, monitor the child process. If it determines that the child process has terminated, it can re-spawn the process and restore CUDA functional behavior.
Sticky vs. non-sticky errors are covered elsewhere, such as here.
An example of a proper multi-process app that uses e.g. fork() to spawn a child process that uses CUDA is available in the CUDA sample code simpleIPC. Here is a rough example assembled from the simpleIPC example (for linux):
$ cat t477.cu
/*
* Copyright 1993-2015 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/
// Includes
#include <stdio.h>
#include <assert.h>
// CUDA runtime includes
#include <cuda_runtime_api.h>
// CUDA utilities and system includes
#include <helper_cuda.h>
#define MAX_DEVICES 1
#define PROCESSES_PER_DEVICE 1
#define DATA_BUF_SIZE 4096
#ifdef __linux
#include <unistd.h>
#include <sched.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <linux/version.h>
typedef struct ipcDevices_st
{
int count;
int results[MAX_DEVICES];
} ipcDevices_t;
// CUDA Kernel
__global__ void simpleKernel(int *dst, int *src, int num)
{
// Dummy kernel
int idx = blockIdx.x * blockDim.x + threadIdx.x;
dst[idx] = src[idx] / num;
}
void runTest(int index, ipcDevices_t* s_devices)
{
if (s_devices->results[0] == 0){
simpleKernel<<<1,1>>>(NULL, NULL, 1); // make a fault
cudaDeviceSynchronize();
s_devices->results[0] = 1;}
else {
int *d, *s;
int n = 1;
cudaMalloc(&d, n*sizeof(int));
cudaMalloc(&s, n*sizeof(int));
simpleKernel<<<1,1>>>(d, s, n);
cudaError_t err = cudaDeviceSynchronize();
if (err != cudaSuccess)
s_devices->results[0] = 0;
else
s_devices->results[0] = 2;}
cudaDeviceReset();
}
#endif
int main(int argc, char **argv)
{
ipcDevices_t *s_devices = (ipcDevices_t *) mmap(NULL, sizeof(*s_devices),
PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, 0, 0);
assert(MAP_FAILED != s_devices);
// We can't initialize CUDA before fork() so we need to spawn a new process
s_devices->count = 1;
s_devices->results[0] = 0;
printf("\nSpawning child process\n");
int index = 0;
pid_t pid = fork();
printf("> Process %3d\n", pid);
if (pid == 0) { // child process
// launch our test
runTest(index, s_devices);
}
// Cleanup and shutdown
else { // parent process
int status;
waitpid(pid, &status, 0);
if (s_devices->results[0] < 2) {
printf("first process launch reported error: %d\n", s_devices->results[0]);
printf("respawn\n");
pid_t newpid = fork();
if (newpid == 0) { // child process
// launch our test
runTest(index, s_devices);
}
// Cleanup and shutdown
else { // parent process
int status;
waitpid(newpid, &status, 0);
if (s_devices->results[0] < 2)
printf("second process launch reported error: %d\n", s_devices->results[0]);
else
printf("second process launch successful\n");
}
}
}
printf("\nShutting down...\n");
exit(EXIT_SUCCESS);
}
$ nvcc -I/usr/local/cuda/samples/common/inc t477.cu -o t477
$ ./t477
Spawning child process
> Process 10841
> Process 0
Shutting down...
first process launch reported error: 1
respawn
Shutting down...
second process launch successful
Shutting down...
$
For windows, the only changes need should be to use a windows IPC mechanism for host interprocess communication.

Understanding unix child processes that use semaphore and shared memory

I'm going to do my best to ask this question with the understanding that I have.
I'm doing a programming assignment (let's just get that out of the way now) that uses C or C++ on a Unix server to fork four children and use semaphore and shared memory to update a global variable. I'm not sure I have an issue yet, but my lack of understanding has me questioning my structure. Here it is:
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/sem.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#define NUM_REPEATS 10
#define SEM_KEY 1111
#define SHM_KEY 2222
int globalCounter = 0;
/***** Test function for confriming a process type ******/
int checkProcessType(const char *whoami)
{
printf("I am a %s. My pid is:%d my ppid is %d\n",
whoami, getpid(), getppid() );
for(int i = 1; i<=3; i++){
printf("%s counting %d\n", whoami, i);
}
return 1;
}
void
int main (void) {
pid_t process_id; // PID (child or zero)
int sharedMemID; // Shared memory ID
int sharedMemSize; // shared memory size
struct my_mem * sharedMemPointer; // pointer to the attached shared memory
// Definition of shared memory //
struct my_mem {
long counter;
int parent;
int child;
};
// Gathering size of shared memory in bytes //
sharedMemSize = sizeof(my_mem);
if(sharedMemSize <= 0){
perror("error collection shared memory size: Exiting...\n");
exit(0);
}
// Creating Shared Memory //
sharedMemID = shmget(SHM_KEY, sharedMemSize, 0666 | IPC_CREAT);
if (sharedMemID < 0) {
perror("Creating shared memory has failed: Exiting...");
exit(0);
}
// Attaching Shared Memory //
sharedMemPointer = (struct my_mem *)shmat(sharedMemID, NULL, 0);
if (sharedMemPointer == (struct my_mem*) -1) {
perror("Attaching shared memory has failed. Exiting...\n");
exit(0);
}
// Initializing Shared Memory //
sharedMemPointer->counter = 0;
sharedMemPointer->parent = 0;
sharedMemPointer->child = 0;
pid_t adder, reader1, reader2, reader3;
adder = fork();
if(adder > 0)
{
// In parent
reader1 = fork();
if(reader1 > 0)
{
// In parent
reader2 = fork();
if(reader2 > 0)
{
//In parent
reader3 = fork();
if (reader3 > 0)
{
//In parent
}
else if (reader3 < 0)
{
// Error
perror("fork() error");
}
else
{
// In reader3
}
}
else if(reader2 < 0)
{
//Error
perror("fork() error");
}
else
{
// In reader2
}
}
else if(reader1 < 0)
{
// Error
perror("fork() error");
}
else
{
// In reader1
}
}
else if(adder < 0 )
{
// Error
perror("fork() error");
}
else
{
// In adder
//LOOP here for global var in critical section
}
}
Just some info of what I'm doing (I think), I'm creating a hunk of shared memory that will contain a variable, lets call it counter that will strictly be updated by adder and by the parent which becomes a subtractor after all child processes are active. I'm still trying to figure out the semaphore stuff that I will be using so adder and subtractor execute in critical section, but my main question is this.
How can I know where I am in this structure? My adder should have a loop that will do some job (update global var), and the parent/subtractor should have a loop for its job (also update global var). And all the readers can look at any time. Does the loop placement for parent/subtractor matter? I basically have 3 locations I know I'll be in parent. But since all children need to be created first does it have to be in the last conditional after my third fork where I know I'm in parent? When I use my test method I get scattered outputs, meaning child one can be after parent's output, then child three, etc. It's never in any order, and from what I understand of fork that's expected.
I really have like three questions going on, but I need to first wrap my head around the structure. So let me just try to say this again concisely without any junk cause I'm hung up on loop and critical section placement that isn't even written up yet.
More directly, when does parent know the existence of all children and with this structure can one child do a task and somehow come back to it (i.e. adder/first child adding to global variable once, exits, and some other child can do its thing etc).
I still feel like I'm not asking the right thing, and I believe this is due to still trying to grasp concepts. Hopefully my stammering will kind of show what I'm stuck on conceptually. If not I can clarify.

Port program that uses CreateEvent and WaitForMultipleObjects to Linux

I need to port a multiprocess application that uses the Windows API functions SetEvent, CreateEvent and WaitForMultipleObjects to Linux. I have found many threads concerning this issue, but none of them provided a reasonable solution for my problem.
I have an application that forks into three processes and manages thread workerpool of one process via these Events.
I had multiple solutions to this issue. One was to create FIFO special files on Linux using mkfifo on linux and use a select statement to awaken the threads. The Problem is that this solution will operate differently than WaitForMultipleObjects. For Example if 10 threads of the workerpool will wait for the event and I call SetEvent five times, exactly five workerthreads will wake up and do the work, when using the FIFO variant in Linux, it would wake every thread, that i in the select statement and waiting for data to be put in the fifo. The best way to describe this is that the Windows API kind of works like a global Semaphore with a count of one.
I also thought about using pthreads and condition variables to recreate this and share the variables via shared memory (shm_open and mmap), but I run into the same issue here!
What would be a reasonable way to recreate this behaviour on Linux? I found some solutions doing this inside of a single process, but what about doing this with between multiple processes?
Any ideas are appreciated (Note: I do not expect a full implementation, I just need some more ideas to get myself started with this problem).
You could use a semaphore (sem_init), they work on shared memory. There's also named semaphores (sem_open) if you want to initialize them from different processes. If you need to exchange messages with the workers, e.g. to pass the actual tasks to them, then one way to resolve this is to use POSIX message queues. They are named and work inter-process. Here's a short example. Note that only the first worker thread actually initializes the message queue, the others use the attributes of the existing one. Also, it (might) remain(s) persistent until explicitly removed using mq_unlink, which I skipped here for simplicity.
Receiver with worker threads:
// Link with -lrt -pthread
#include <fcntl.h>
#include <mqueue.h>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
void *receiver_thread(void *param) {
struct mq_attr mq_attrs = { 0, 10, 254, 0 };
mqd_t mq = mq_open("/myqueue", O_RDONLY | O_CREAT, 00644, &mq_attrs);
if(mq < 0) {
perror("mq_open");
return NULL;
}
char msg_buf[255];
unsigned prio;
while(1) {
ssize_t msg_len = mq_receive(mq, msg_buf, sizeof(msg_buf), &prio);
if(msg_len < 0) {
perror("mq_receive");
break;
}
msg_buf[msg_len] = 0;
printf("[%lu] Received: %s\n", pthread_self(), msg_buf);
sleep(2);
}
}
int main() {
pthread_t workers[5];
for(int i=0; i<5; i++) {
pthread_create(&workers[i], NULL, &receiver_thread, NULL);
}
getchar();
}
Sender:
#include <fcntl.h>
#include <stdio.h>
#include <mqueue.h>
#include <unistd.h>
int main() {
mqd_t mq = mq_open("/myqueue", O_WRONLY);
if(mq < 0) {
perror("mq_open");
}
char msg_buf[255];
unsigned prio;
for(int i=0; i<255; i++) {
int msg_len = sprintf(msg_buf, "Message #%d", i);
mq_send(mq, msg_buf, msg_len, 0);
sleep(1);
}
}

C++ windows threading and mutex issue

I am a bit rusty with threaded programs especially in windows.
I have created a simple mex file in Matlab that is meant to read a number of files with each file being read in its own thread.
The file doesnt do anything really useful but is a precursor to a more complicated version that will use all of the functionality ive put into this file.
Here is the code:
#include <windows.h>
#include "mex.h"
#include <fstream>
typedef unsigned char uchar;
typedef unsigned int uint;
using namespace std;
int N;
int nThreads;
const int BLOCKSIZE = 1024;
char * buffer;
char * out;
HANDLE hIOMutex;
DWORD WINAPI runThread(LPVOID argPos) {
int pos = *(reinterpret_cast<int*>(argPos));
DWORD dwWaitResult = WaitForSingleObject( hIOMutex, INFINITE );
if (dwWaitResult == WAIT_OBJECT_0){
char buf[20];
sprintf(buf, "test%i.dat", pos);
ifstream ifs(buf, ios::binary);
if (!ifs.fail()) {
mexPrintf("Running thread:%i\n", pos);
for (int i=0; i<N/BLOCKSIZE;i++) {
if (ifs.eof()){
mexPrintf("File %s exited at i=%i\n", buf, (i-1)*BLOCKSIZE);
break;
}
ifs.read(&buffer[pos*BLOCKSIZE], BLOCKSIZE);
}
}
else {
mexPrintf("Could not open file %s\n", buf);
}
ifs.close();
ReleaseMutex( hIOMutex);
}
else
mexPrintf("The Mutex failed in thread:%i \n", pos);
return TRUE;
}
// 0 - N is data size
// 1 - nThreads is number of threads
// 2 - this is the output array
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray*prhs[] ) {
N = mxGetScalar(prhs[0]);
nThreads = mxGetScalar(prhs[1]);
out = (char*)mxGetData(prhs[2]);
buffer = (char*)malloc(BLOCKSIZE*nThreads);
hIOMutex= CreateMutex(NULL, FALSE, NULL);
HANDLE *hArr = (HANDLE*)malloc(sizeof(HANDLE)*nThreads);
int *tInd = (int*)malloc(sizeof(int)*nThreads);
for (int i=0;i<nThreads;i++){
tInd[i]=i;
hArr[i] = CreateThread( NULL, 0, runThread, &tInd[i], 0, NULL);
if (!hArr[i]) {
mexPrintf("Failed to start thread:%i\n", i);
break;
}
}
WaitForMultipleObjects( nThreads, hArr, TRUE, INFINITE);
for (int i=0;i<nThreads;i++)
CloseHandle(hArr[i]);
CloseHandle(hIOMutex);
mexEvalString("drawnow");
mexPrintf("Finished all threads.\n");
free(hArr);
free(tInd);
free(buffer);
I compile it like this in Matlab:
mex readFile.cpp
And then run it like this:
out = zeros(1024*1024,1,'uint8');
readFile(1024*1024,nFiles,out);
The problem is that when I set nFiles to be less than or equal to 64 everything works as expected and I get the following output:
Running thread:0
.
.
.
Running thread:62
Running thread:63
Finished all threads.
However when I set nFiles to 65 or larger I get:
Running thread:0
Running thread:1
Running thread:2
Running thread:3
The Mutex failed in thread:59
The Mutex failed in thread:60
The Mutex failed in thread:61
.
.
.
(up to nFiles-1)
Finished all threads.
I have also tested it without threading and it works fine.
I cannot see what Im doing wrong or why the cutoff to using the mutex would be so arbitrary so I am assuming there is something I am not taking into account.
Can anyone see where I have a blatant mistake relating to the error Im seeing?
In the documentation for WaitForMultipleObjects, "The maximum number of object handles is MAXIMUM_WAIT_OBJECTS.", which is 64 on most systems.
This is also (almost) a duplicate of this thread. The summary is really just that yes, the limit is 64, and also to use the information in the remarks section of WaitForMultipleObjects to build up a tree of threads to wait on.