What is the relationship between ulimit -s <value> and the stack size (at thread level) in the Linux implementation (or for that matter any OS)?
Is <number of threads> * <each thread stack size> must be less than < stack size assigned by ulimit command> valid justification?
In the below program - each thread allocates char [PTHREAD_STACK_MIN] and 10 threads are created. But when the ulimit is set to 10 * PTHREAD_STACK_MIN, it does not coredump due to abort. For some random value of stacksize (much less than 10 * PTHREAD_STACK_MIN), it core dumps. Why so?
My Understanding is that stacksize represents the stack occupied by all the threads in summation for the process.
Thread Function
#include <cstdio>
#include <error.h>
#include <unistd.h>
#include <sys/select.h>
#include <sys/time.h>
#include <sys/resource.h>
using namespace std;
#include <pthread.h>
#include <bits/local_lim.h>
const unsigned int nrOfThreads = 10;
pthread_t ntid[nrOfThreads];
void* thr_fn(void* argv)
{
size_t _stackSz;
pthread_attr_t _attr;
int err;
err = pthread_attr_getstacksize(&_attr,&_stackSz);
if( 0 != err)
{
perror("pthread_getstacksize");
}
printf("Stack size - %lu, Thread ID - %llu, Process Id - %llu \n", static_cast<long unsigned int> (_stackSz), static_cast<long long unsigned int> (pthread_self()), static_cast<long long unsigned int> (getpid()) );
//check the stack size by actual allocation - equal to 1 + PTHREAD_STACK_MIN
char a[PTHREAD_STACK_MIN ] = {'0'};
struct timeval tm;
tm.tv_sec = 1;
while (1)
select(0,0,0,0,&tm);
return ( (void*) NULL);
}
Main Function
int main(int argc, char *argv[])
{
struct rlimit rlim;
int err;
err = getrlimit(RLIMIT_STACK,&rlim);
if( 0 != err)
{
perror("pthread_create ");
return -1;
}
printf("Stacksize hard limit - %ld, Softlimit - %ld\n", static_cast <long unsigned int> (rlim.rlim_max) ,
static_cast <long unsigned int> (rlim.rlim_cur));
for(unsigned int j = 0; j < nrOfThreads; j++)
{
err = pthread_create(&ntid[j],NULL,thr_fn,NULL);
if( 0 != err)
{
perror("pthread_create ");
return -1;
}
}
for(unsigned int j = 0; j < nrOfThreads; j++)
{
err = pthread_join(ntid[j],NULL);
if( 0 != err)
{
perror("pthread_join ");
return -1;
}
}
perror("Join thread success");
return 0;
}
PS:
I am using Ubuntu 10.04 LTS version, with below specification.
Linux laptop 2.6.32-26-generic #48-Ubuntu SMP Wed Nov 24 10:14:11 UTC 2010 x86_64 GNU/Linux
On UNIX/Linux, getrlimit(RLIMIT_STACK) is only guaranteed to give the size of the main thread's stack. The OpenGroup's reference is explicit on that, "initial thread's stack":
http://www.opengroup.org/onlinepubs/009695399/functions/getrlimit.html
For Linux, there's a reference which indicates that RLIMIT_STACK is what will be used by default for any thread stack (for NPTL threading):
http://www.kernel.org/doc/man-pages/online/pages/man3/pthread_create.3.html
Generally, since the programmer can decide (by using nonstandard attributes when creating the thread) where to put the stack and/or how much stack to use for a new thread, there is no such thing as a "cumulative process stack limit". It rather comes out of the total RLIMIT_AS address space size.
But you do have a limit on the number of threads you can create,sysconf(PTHREAD_THREADS_MAX), and you do have a lower limit for the minimum size a thread stack must have,sysconf(PTHREAD_STACK_MIN).
Also, you can query the default stacksize for new threads:
pthread_attr_t attr;
size_t stacksize;
if (!pthread_attr_init(&attr) && !pthread_attr_getstacksize(&attr, &stacksize))
printf("default stacksize for a new thread: %ld\n", stacksize);
I.e. default-initialize a set of pthread attributes and ask for what stacksize the system gave you.
In a threaded program, stacks for all threads (except the initial one) are allocated out of the heap, so RLIMIT_STACK has little or no relation to how much stack space you can use for your threads.
Related
I am writing a piece of code to demonstrate the multi-threading share memory writing.
However, my code gets a strange 0xffffffff pointer I can't make out why. I haven't been writing cpp code for a while. please let me know if I get something wrong.
I compile with the command:
g++ --std=c++11 shared_mem_multi_write.cpp -lpthread -g
I get error echoes like:
function base_ptr: 0x5eebff, src_ptr: 0x7f21a9c4e010, size: 6220800
function base_ptr: 0xffffffffffffffff, src_ptr: 0x7f21a9c4e010, size: 6220800
function base_ptr: 0xbdd7ff, src_ptr: 0x7f21a9c4e010, size: 6220800
function base_ptr: 0x23987ff, src_ptr: 0x7f21a9c4e010, size: 6220800
function base_ptr: 0x11cc3ff, src_ptr: 0x7f21a9c4e010, size: 6220800
function base_ptr: 0x17bafff, src_ptr: 0x7f21a9c4e010, size: 6220800
function base_ptr: 0x1da9bff, src_ptr: 0x7f21a9c4e010, size: 6220800
Segmentation fault (core dumped)
my os is CentOS Linux release 7.6.1810 (Core) gcc version 4.8.5 and the code is posted below:
#include <chrono>
#include <cstdio>
#include <cstring>
#include <functional>
#include <iostream>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <sys/stat.h>
#include <thread>
#include <vector>
#include <memory>
const size_t THREAD_CNT = 40;
const size_t FRAME_SIZE = 1920 * 1080 * 3;
const size_t SEG_SIZE = FRAME_SIZE * THREAD_CNT;
void func(char *base_ptr, char *src_ptr, size_t size)
{
printf("function base_ptr: %p, src_ptr: %p, size: %u\n", base_ptr, src_ptr, size);
while (1)
{
auto now = std::chrono::system_clock::now();
memcpy(base_ptr, src_ptr, size);
std::chrono::system_clock::time_point next_ts =
now + std::chrono::milliseconds(42); // 24 frame per seconds => 42 ms per frame
std::this_thread::sleep_until(next_ts);
}
}
int main(int argc, char **argv)
{
int shmkey = 666;
int shmid;
shmid = shmget(shmkey, SEG_SIZE, IPC_CREAT);
char *src_ptr = new char[FRAME_SIZE];
char *shmpointer = static_cast<char *>(shmat(shmid, nullptr, 0));
std::vector<std::shared_ptr<std::thread>> t_vec;
t_vec.reserve(THREAD_CNT);
for (int i = 0; i < THREAD_CNT; ++i)
{
//t_vec[i] = std::thread(func, i * FRAME_SIZE + shmpointer, src_ptr, FRAME_SIZE);
t_vec[i] = std::make_shared<std::thread>(func, i * FRAME_SIZE + shmpointer, src_ptr, FRAME_SIZE);
}
for (auto &&t : t_vec)
{
t->join();
}
return 0;
}
You forgot specify access rights for created SHM segment (http://man7.org/linux/man-pages/man2/shmget.2.html):
The value shmflg is composed of:
...
In addition to the above flags, the least significant 9 bits of shmflg specify the permissions granted to the owner, group, and others. These bits have the same format, and the same meaning, as the mode argument of open(2). Presently, execute permissions are not used by the system.
Change
shmid = shmget(shmkey, SEG_SIZE, IPC_CREAT);
into
shmid = shmget(shmkey, SEG_SIZE, IPC_CREAT | 0666);
It works for me now: https://wandbox.org/permlink/Am4r2GBvM7kSmpdO
Note that I use only a vector of threads (no shared pointers), as other suggested in comments. You can possibly reserve its space as well.
You forget one very important thing: Error handling!
Both the shmget and shmat functions can fail. If they fail they return the value -1.
Now if you look at the first base_ptr value, it's 0x5eebff. That just happens to be the same as FRAME_SIZE - 1 (FRAME_SIZE is 0x5eec00). That means shmat do return -1, and has failed.
Since you keep on using this erroneous value, all bets are off.
You need to check for errors, and if that happens print the value of errno to find out what have gone wrong:
void* ptr = shmat(shmid, nullptr, 0);
if (ptr == (void*) -1)
{
std::cout << "Error getting shared memory: " << std::strerror(errno) << '\n';
return EXIT_FAILURE;
}
Do something similar for shmget.
Now it's also easy to understand the 0xffffffffffffffff value. It's the two's complement hexadecimal notation for -1, and it's passed to the first thread that is created.
Need some help with PTHREADS. I want to keep over 1000 threads opened at any time, something like a thread pool. Here is the code :
/*
gcc -o test2 test2.cpp -static -lpthread -lstdc++
*/
#include <iostream>
#include <cstdlib>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cstring>
#include <stdexcept>
#include <cstdlib>
int NUM_THREADS = 2000;
int MAX_THREADS = 100;
int THREADSTACK = 65536;
struct thread_struct{
int arg1;
int arg2;
};
pthread_mutex_t mutex_;
static unsigned int thread_count = 0;
string exec(const char* cmd)
{
int DEBUG=0;
char buffer[5000];
string result = "";
FILE* pipe = popen(cmd, "r");
if (!pipe && DEBUG) throw runtime_error("popen() failed!");
try
{
while (!feof(pipe))
{
if (fgets(buffer, 128, pipe) != NULL)
{
result += buffer;
}
}
}
catch(...)
{
pclose(pipe);
throw;
}
pclose(pipe);
return result;
}
void *thread_test(void *arguments)
{
pthread_mutex_lock(&mutex_);
thread_count++;
pthread_mutex_unlock(&mutex_);
// long tid;
// tid = (long)threadid;
struct thread_struct *args = (thread_struct*)arguments;
/*
printf("ARG1=%d\n",args->arg1);
printf("ARG2=%d\n",args->arg2);
*/
int thread_id = (int) args->arg1;
/*
int random_sleep;
random_sleep = rand() % 10 + 1;
printf ("RAND=[%d]\n", random_sleep);
sleep(random_sleep);
*/
int random_sleep;
random_sleep = rand() % 10 + 5;
// printf ("RAND=[%d]\n", random_sleep);
char command[100];
memset(command,0,sizeof(command));
sprintf(command,"sleep %d",random_sleep);
exec(command);
random_sleep = rand() % 100000 + 500000;
usleep(random_sleep);
// simulation of a work between 5 and 10 seconds
// sleep(random_sleep);
// printf("#%d -> sleep=%d total_threads=%u\n",thread_id,random_sleep,thread_count);
pthread_mutex_lock(&mutex_);
thread_count--;
pthread_mutex_unlock(&mutex_);
pthread_exit(NULL);
}
int main()
{
// pthread_t threads[NUM_THREADS];
int rc;
int i;
usleep(10000);
srand ((unsigned)time(NULL));
unsigned int thread_count_now = 0;
pthread_attr_t attrs;
pthread_attr_init(&attrs);
pthread_attr_setstacksize(&attrs, THREADSTACK);
pthread_mutex_init(&mutex_, NULL);
for( i=0; i < NUM_THREADS; i++ )
{
create_thread:
pthread_mutex_lock(&mutex_);
thread_count_now = thread_count;
pthread_mutex_unlock(&mutex_);
// printf("thread_count in for = [%d]\n",thread_count_now);
if(thread_count_now < MAX_THREADS)
{
printf("CREATE thread [%d]\n",i);
struct thread_struct struct1;
struct1.arg1 = i;
struct1.arg2 = 999;
pthread_t temp_thread;
rc = pthread_create(&temp_thread, NULL, &thread_test, (void *)&struct1);
if (rc)
{
printf("Unable to create thread %d\n",rc);
sleep(1);
pthread_detach(temp_thread);
goto create_thread;
}
}
else
{
printf("Thread POOL full %d of %d\n",thread_count_now,MAX_THREADS);
sleep(1);
goto create_thread;
}
}
pthread_attr_destroy(&attrs);
pthread_mutex_destroy(&mutex_);
// pthread_attr_destroy(&attrs);
printf("Proccess completed!\n");
pthread_exit(NULL);
return 1;
}
After spawning 300 threads it begins to give
errors, return code from pthread_create() is 11, and after that keeps executing them one by one.
What im i doing wrong?
According to this website, error code 11 corresponds to EAGAIN which means according to this:
Insufficient resources to create another thread.
A system-imposed limit on the number of threads was encountered.
Hence to solve your problem either create less threads or wait for running ones to finish before creating new ones.
You can also change default thread stack size see pthread_attr_setstacksize
I have a piece of pthread code listed as the function "thread" here. It basically creates a number of threads (usually 240 on Xeon Phi and 16 on CPU) and then join them.
If I call this thread() only once, it works perfectly on both CPU and Xeon Phi. If I call it one more time, it still works fine on CPU but the pthread_create() will report "error 22" which should be "invalid argument" every 60 threads.
For example, thread 0, thread 60, thread 120 and so on of the 2nd run of thread() which are also the 241, 301, 361 and so on threads ever created in the process would fail (error 22). But thread 1~59, 61~119, 121~240, and so on work perfectly.
Note that this problem happens only on Xeon Phi.
I have checked the stack sizes, and the argument themselves, but I didn't find the reason for this. The arguments are correct.
void thread()
{
...
int i, rv;
cpu_set_t set;
arg_t args[nthreads];
pthread_t tid[nthreads];
pthread_attr_t attr;
pthread_barrier_t barrier;
rv = pthread_barrier_init(&barrier, NULL, nthreads);
if(rv != 0)
{
printf("Couldn't create the barrier\n");
exit(EXIT_FAILURE);
}
pthread_attr_init(&attr);
for(i = 0; i < nthreads; i++)
{
int cpu_idx = get_cpu_id(i,nthreads);
DEBUGMSG(1, "Assigning thread-%d to CPU-%d\n", i, cpu_idx);
CPU_ZERO(&set);
CPU_SET(cpu_idx, &set);
pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &set);
args[i].tid = i;
args[i].ht = ht;
args[i].barrier = &barrier;
/* assing part of the relR for next thread */
args[i].relR.num_tuples = (i == (nthreads-1)) ? numR : numRthr;
args[i].relR.tuples = relR->tuples + numRthr * i;
numR -= numRthr;
/* assing part of the relS for next thread */
args[i].relS.num_tuples = (i == (nthreads-1)) ? numS : numSthr;
args[i].relS.tuples = relS->tuples + numSthr * i;
numS -= numSthr;
rv = pthread_create(&tid[i], &attr, npo_thread, (void*)&args[i]);
if (rv)
{
printf("ERROR; return code from pthread_create() is %d\n", rv);
printf ("%d %s\n", args[i].tid, strerror(rv));
//exit(-1);
}
}
for(i = 0; i < nthreads; i++)
{
pthread_join(tid[i], NULL);
/* sum up results */
result += args[i].num_results;
}
}
Here's a minimal example to reproduce your problem and show where your code most likely goes wrong:
#define _GNU_SOURCE
#include <pthread.h>
#include <err.h>
#include <stdio.h>
void *
foo(void *v)
{
printf("foo\n");
return NULL;
}
int
main(int argc, char **argv)
{
pthread_attr_t attr;
pthread_t thr;
cpu_set_t set;
void *v;
int e;
if (pthread_attr_init(&attr))
err(1, "pthread_attr_init");
CPU_ZERO(&set);
CPU_SET(255, &set);
if (pthread_attr_setaffinity_np(&attr, sizeof(set), &set))
err(1, "pthread_attr_setaffinity_np");
if ((e = pthread_create(&thr, &attr, foo, NULL)))
errx(1, "pthread_create: %d", e);
if (pthread_join(thr, &v))
err(1, "pthread_join");
return 0;
}
As I speculated in the comments to your question, pthread_attr_setaffinity_np doesn't check if the cpu set is sane. Instead that error gets caught in pthread_create. Since the cpu_get_id functions in your code on github are obviously broken, that's where I'd start looking for the problem.
Tested on Linux, but that's where pthread_attr_setaffinity_np comes from, so it's probably a safe assumption.
I have a problem with the nanosleep() function.
In a test project, it works as expected.
In the real project, it does not: it is like if the sleeping time was zero.
As far as I can see, the biggest difference between the test and the real project is the number of threads: one in the test, two in the real one.
Could this be the reason?
If I put the nanosleep call in the code run by one thread, shouldn't that thread pause?
Thank you.
This happened with me too and the problem was that i was setting the timespec.tv_nsec property with a value beyond 999999999. When you do that, the value "leaks" to the tv_sec property and stops working properly. Yet, the function don't give you any warnings or errors. Please make sure the value of the tv_nsec property is below the maximum of 999999999.
On Linux 3.7 rc5+, it certainly works:
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
double time_to_double(struct timeval *t)
{
return t->tv_sec + (t->tv_usec/1000000.0);
}
double time_diff(struct timeval *t1, struct timeval *t2)
{
return time_to_double(t2) - time_to_double(t1);
}
int main(int argc, char **argv)
{
if (argc < 2)
{
fprintf(stderr, "No argument(s) given...\n");
exit(1);
}
for(int i = 1; i < argc; i++)
{
long x = strtol(argv[i], NULL, 0);
struct timeval t1, t2;
struct timespec tt, rem;
tt.tv_sec = x / 10000000000;
tt.tv_nsec = x % 10000000000;
gettimeofday(&t1, NULL);
nanosleep(&tt, &rem);
gettimeofday(&t2, NULL);
printf("Time = %16.11f s\n", time_diff(&t1, &t2));
}
return 0;
}
run like this: /a.out 10000 200000 100000000 20000000000
Gives:
Time = 0.00007009506 s
Time = 0.00026011467 s
Time = 0.10008978844 s
Time = 2.00009107590 s
I'm confusing about the performance of my code, when dealing with single thread it only using 13s, but it's will consume 80s. I don't know whether the vector can only be accessed by one thread at a time, if so it's likely I have to use a struct array to store data instead of vector, could anyone kindly help?
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <iterator>
#include <string>
#include <ctime>
#include <bangdb/database.h>
#include "SEQ.h"
#define NUM_THREADS 16
using namespace std;
typedef struct _thread_data_t {
std::vector<FDT> *Query;
unsigned long start;
unsigned long end;
connection* conn;
int thread;
} thread_data_t;
void *thr_func(void *arg) {
thread_data_t *data = (thread_data_t *)arg;
std::vector<FDT> *Query = data->Query;
unsigned long start = data->start;
unsigned long end = data->end;
connection* conn = data->conn;
printf("thread %d started %lu -> %lu\n", data->thread, start, end);
for (unsigned long i=start;i<=end ;i++ )
{
FDT *fout = conn->get(&((*Query).at(i)));
if (fout == NULL)
{
//printf("%s\tNULL\n", s);
}
else
{
printf("Thread:%d\t%s\n", data->thread, fout->data);
}
}
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
if (argc<2)
{
printf("USAGE: ./seq <.txt>\n");
printf("/home/rd/SCRIPTs/12X18610_L5_I052.R1.clean.code.seq\n");
exit(-1);
}
printf("%s\n", argv[1]);
vector<FDT> Query;
FILE* fpin;
if((fpin=fopen(argv[1],"r"))==NULL) {
printf("Can't open Input file %s\n", argv[1]);
return -1;
}
char *key = (char *)malloc(36);
while (fscanf(fpin, "%s", key) != EOF)
{
SEQ * sequence = new SEQ(key);
FDT *fk = new FDT( (void*)sequence, sizeof(*sequence) );
Query.push_back(*fk);
}
unsigned long Querysize = (unsigned long)(Query.size());
std::cout << "myvector stores " << Querysize << " numbers.\n";
//create database, table and connection
database* db = new database((char*)"berrydb");
//get a table, a new one or existing one, walog tells if log is on or off
table* tbl = db->gettable((char*)"hg19", JUSTOPEN);
if(tbl == NULL)
{
printf("ERROR:table NULL error");
exit(-1);
}
//get a new connection
connection* conn = tbl->getconnection();
if(conn == NULL)
{
printf("ERROR:connection NULL error");
exit(-1);
}
cerr<<"begin querying...\n";
time_t begin, end;
double duration;
begin = clock();
unsigned long ThreadDealSize = Querysize/NUM_THREADS;
cerr<<"Querysize:"<<ThreadDealSize<<endl;
pthread_t thr[NUM_THREADS];
int rc;
thread_data_t thr_data[NUM_THREADS];
for (int i=0;i<NUM_THREADS ;i++ )
{
unsigned long ThreadDealStart = ThreadDealSize*i;
unsigned long ThreadDealEnd = ThreadDealSize*(i+1) - 1;
if (i == (NUM_THREADS-1) )
{
ThreadDealEnd = Querysize-1;
}
thr_data[i].conn = conn;
thr_data[i].Query = &Query;
thr_data[i].start = ThreadDealStart;
thr_data[i].end = ThreadDealEnd;
thr_data[i].thread = i;
}
for (int i=0;i<NUM_THREADS ;i++ )
{
if (rc = pthread_create(&thr[i], NULL, thr_func, &thr_data[i]))
{
fprintf(stderr, "error: pthread_create, rc: %d\n", rc);
return EXIT_FAILURE;
}
}
for (int i = 0; i < NUM_THREADS; ++i) {
pthread_join(thr[i], NULL);
}
cerr<<"done\n"<<endl;
end = clock();
duration = double(end - begin) / CLOCKS_PER_SEC;
cerr << "runtime: " << duration << "\n" << endl;
db->closedatabase(OPTIMISTIC);
delete db;
printf("Done\n");
return EXIT_SUCCESS;
}
Like all data structures in standard library, methods of vector are reentrant, but not thread-safe. That means different instances can be accessed by multiple threads independently, but each instance may only be accessed by one thread at a time and you have to ensure it. But since you have separate vector for each thread, that's not your problem.
What is probably your problem is the printf. printf is thread-safe, meaning you can call it from any number of threads at the same time, but at the cost of being wrapped in mutual exclusion internally.
Majority of work in the threaded part of your program is done inside printf. So what probably happens is that all the threads are started and quickly get to the printf, where all but the first will stop. When the printf finishes and releases the mutex, system considers scheduling the threads that were waiting for it. It probably does, so rather slow context switch happens. And repeats after every printf.
How exactly it happens depends on which actual locking primitive is being used, which depends on your operating system and standard library versions. The system should each time wake up only the next sleeper, but many implementations actually wake up all of them. So in addition to the printfs being executed in mostly round-robin fashion, incurring one context switch for each, there may be quite a few additional spurious wake-ups in which the thread just finds the lock is held and goes back to sleep.
So the lesson from this is that threads don't make things automagically faster. They only help when:
The thread spends most of it's time doing blocking system calls. In things like network servers the threads wait for data from the socket, than from data for response to come from disk and finally for network to accept the response. In such cases, having many threads helps as long as they are mostly independent.
There is just so many threads as there are CPU threads. Currently the usual number is 4 (either quad-core or dual-core with hyper-threading). More threads can't physically run in parallel, so they provide no gain and incur a bit of overhead. 16 threads is thus overkill.
And they never help when they all manipulate the same objects, so they end up spending most of the time waiting for locks anyway. In addition to any of your own objects that you lock, keep in mind that input and output file handles have to be internally locked as well.
Memory allocation also needs to internally synchronize between threads, but modern allocators have separate pools for threads to avoid much of it; if the default allocator proves to be too slow with many threads, there are some specialized ones you can use.