I am writing a piece of code to demonstrate the multi-threading share memory writing.
However, my code gets a strange 0xffffffff pointer I can't make out why. I haven't been writing cpp code for a while. please let me know if I get something wrong.
I compile with the command:
g++ --std=c++11 shared_mem_multi_write.cpp -lpthread -g
I get error echoes like:
function base_ptr: 0x5eebff, src_ptr: 0x7f21a9c4e010, size: 6220800
function base_ptr: 0xffffffffffffffff, src_ptr: 0x7f21a9c4e010, size: 6220800
function base_ptr: 0xbdd7ff, src_ptr: 0x7f21a9c4e010, size: 6220800
function base_ptr: 0x23987ff, src_ptr: 0x7f21a9c4e010, size: 6220800
function base_ptr: 0x11cc3ff, src_ptr: 0x7f21a9c4e010, size: 6220800
function base_ptr: 0x17bafff, src_ptr: 0x7f21a9c4e010, size: 6220800
function base_ptr: 0x1da9bff, src_ptr: 0x7f21a9c4e010, size: 6220800
Segmentation fault (core dumped)
my os is CentOS Linux release 7.6.1810 (Core) gcc version 4.8.5 and the code is posted below:
#include <chrono>
#include <cstdio>
#include <cstring>
#include <functional>
#include <iostream>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <sys/stat.h>
#include <thread>
#include <vector>
#include <memory>
const size_t THREAD_CNT = 40;
const size_t FRAME_SIZE = 1920 * 1080 * 3;
const size_t SEG_SIZE = FRAME_SIZE * THREAD_CNT;
void func(char *base_ptr, char *src_ptr, size_t size)
{
printf("function base_ptr: %p, src_ptr: %p, size: %u\n", base_ptr, src_ptr, size);
while (1)
{
auto now = std::chrono::system_clock::now();
memcpy(base_ptr, src_ptr, size);
std::chrono::system_clock::time_point next_ts =
now + std::chrono::milliseconds(42); // 24 frame per seconds => 42 ms per frame
std::this_thread::sleep_until(next_ts);
}
}
int main(int argc, char **argv)
{
int shmkey = 666;
int shmid;
shmid = shmget(shmkey, SEG_SIZE, IPC_CREAT);
char *src_ptr = new char[FRAME_SIZE];
char *shmpointer = static_cast<char *>(shmat(shmid, nullptr, 0));
std::vector<std::shared_ptr<std::thread>> t_vec;
t_vec.reserve(THREAD_CNT);
for (int i = 0; i < THREAD_CNT; ++i)
{
//t_vec[i] = std::thread(func, i * FRAME_SIZE + shmpointer, src_ptr, FRAME_SIZE);
t_vec[i] = std::make_shared<std::thread>(func, i * FRAME_SIZE + shmpointer, src_ptr, FRAME_SIZE);
}
for (auto &&t : t_vec)
{
t->join();
}
return 0;
}
You forgot specify access rights for created SHM segment (http://man7.org/linux/man-pages/man2/shmget.2.html):
The value shmflg is composed of:
...
In addition to the above flags, the least significant 9 bits of shmflg specify the permissions granted to the owner, group, and others. These bits have the same format, and the same meaning, as the mode argument of open(2). Presently, execute permissions are not used by the system.
Change
shmid = shmget(shmkey, SEG_SIZE, IPC_CREAT);
into
shmid = shmget(shmkey, SEG_SIZE, IPC_CREAT | 0666);
It works for me now: https://wandbox.org/permlink/Am4r2GBvM7kSmpdO
Note that I use only a vector of threads (no shared pointers), as other suggested in comments. You can possibly reserve its space as well.
You forget one very important thing: Error handling!
Both the shmget and shmat functions can fail. If they fail they return the value -1.
Now if you look at the first base_ptr value, it's 0x5eebff. That just happens to be the same as FRAME_SIZE - 1 (FRAME_SIZE is 0x5eec00). That means shmat do return -1, and has failed.
Since you keep on using this erroneous value, all bets are off.
You need to check for errors, and if that happens print the value of errno to find out what have gone wrong:
void* ptr = shmat(shmid, nullptr, 0);
if (ptr == (void*) -1)
{
std::cout << "Error getting shared memory: " << std::strerror(errno) << '\n';
return EXIT_FAILURE;
}
Do something similar for shmget.
Now it's also easy to understand the 0xffffffffffffffff value. It's the two's complement hexadecimal notation for -1, and it's passed to the first thread that is created.
Related
I would like to use a memory mapped file to write data. I am using the following test code on a ubuntu machine. The code is compiled with g++ -std=c++14 -O3 .
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>
#include <cstdlib>
#include <cstdio>
#include <cassert>
int main(){
constexpr size_t GB1 = 1 << 30;
size_t capacity = GB1 * 4;
size_t numElements = capacity / sizeof(size_t);
int fd = open("./mmapfile", O_RDWR);
assert(fd >= 0);
int error = ftruncate(fd, capacity);
assert(error == 0);
void* ptr = mmap(0, capacity, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
assert(ptr != MAP_FAILED);
size_t* data = (size_t*)ptr;
for(size_t i = 0; i < numElements; i++){
data[i] = i;
}
munmap(ptr, capacity);
}
The data is correctly being written to file. However, the htop command shows that half of the disk io bandwidth of the program is used by read accesses. My concern is that the code will not perform well if only half the bandwith can be used for writes.
Why are there read accesses in the code?
Can they be avoided or are they expected?
The read access occurs because as the pages are accessed for the first time they need to be read in from disk. The OS is not clarvoyant and doesn't know that the reads will be thrown out.
To avoid the issue, don't use mmap(). Build the blocks in buffer and write them out the old fashioned way.
I have read many SO (and other) questions, but I couldn't find one that helped me. I want to mmap two files at once and copy their content byte-by-byte (I know this seems ridiculous, but this is my minimal reproducibly example). Therefore I loop through every byte, copy it, and after the size of one page in my files, I munmap the current page and mmap the next page. Imo there should only ever be one page (4096 bytes) of each file be needed so there shouldn't be any memory problem.
Also, if the output file is too small, the memory is allocated via posix_fallocate, which runs fine. I a lack of memory space in the hard drive can't be the problem either imo.
But as soon as I am going for a bit larger files with ~140 MB, I get the cannot allocate memory error from the output-file that I am writing into. Do you guys have any idea how this is?
#include <sys/types.h>
#include <sys/mman.h>
#include <err.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <iostream>
#include <bitset>
#include <fcntl.h>
#include <sys/stat.h>
#include <math.h>
#include <errno.h>
using namespace std;
int main()
{
char file_input[] = "medium_big_file";
char file_output[] = "foo_output";
int fd_input = -1;
int fd_output = -1;
unsigned char *map_page_input, *map_page_output;
struct stat stat_input, stat_output;
if ((fd_input = open(file_input, O_RDONLY)) == -1 ||
(fd_output = open(file_output, O_RDWR|O_CREAT, 0644)) == -1) {
cerr << "Error on open()" << endl;
return EXIT_FAILURE;
}
// get file size via stat()
stat(file_input, &stat_input);
stat(file_output, &stat_output);
const size_t size_input = stat_input.st_size;
const size_t size_output = stat_output.st_size;
const size_t pagesize = getpagesize();
size_t page = 0;
size_t pos = pagesize;
if (size_output < size_input) {
if (posix_fallocate(fd_output, 0, size_input) != 0) {
cerr << "file space allocation didn't work" << endl;
return EXIT_FAILURE;
}
}
while(pos + (pagesize * (page-1)) < size_input) {
// check if input needs the next page
if (pos == pagesize) {
munmap(&map_page_input, pagesize);
map_page_input = (unsigned char*)mmap(NULL, pagesize, PROT_READ,
MAP_FILE|MAP_PRIVATE, fd_input, page * pagesize);
munmap(&map_page_output, pagesize);
map_page_output = (unsigned char*)mmap(NULL, pagesize,
PROT_READ|PROT_WRITE, MAP_SHARED, fd_output, page * pagesize);
page += 1;
pos = 0;
if (map_page_output == MAP_FAILED) {
cerr << "errno: " << strerror(errno) << endl;
cerr << "mmap failed on page " << page << endl;
return EXIT_FAILURE;
}
}
memcpy(&map_page_output[pos], &map_page_input[pos], 1);
pos += 1;
}
munmap(&map_page_input, pagesize);
munmap(&map_page_output, pagesize);
close(fd_input);
close(fd_output);
return EXIT_SUCCESS;
}
The very first iteration of the loop attempts to unmap something that was never mapped, and passes a completely uninitialized pointer to munmap. Not once, but twice.
Finally, munmap expects a pointer to the mmap-ed memory, and not a pointer to a pointer to the mmap-ed memory.
The shown code fails to check the return status from munmap. If it did, it would've discovered that every call to munmap fails (hopefully, but if the first call happens to pass an aligned pointer, a chunk of the stack might end up being unmapped, with the ensuing hilarity), so the shown code just keeps allocating more, and more pages, and running out of memory.
You must fix both bugs.
You do not check the exit code of munmap. It fails. It fails because you do not need to take the address of the address. Replace:
munmap(&map_page_input, pagesize);
with
munmap(map_page_input, pagesize);
Because munmap fails, you run out of max number of mappings per process.
munmap takes as first argument the value returned by mmap. In your code munpap receives a pointer to a variable containing it, so you are not actually unmapping the area. Just remove "&" in munmap call.
I created 50 threads to read the same file at the same time and then, in each thread, tried to write its content to new file that create with different name.
The code was supposed to generate 50 different files.
But I got unexpected results that it just generate 3~5 files.
When all the read the same file, there is no race-condition, and each thread is aimed to write its content to different file.
Can somebody help me? Thank you!
My code is listed below and it is a modification from Reference
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <string.h>
#include <iostream>
#include <vector>
#include <thread>
void copy(const char *src_path, const char *dst_path);
int main(int argc, char **argv)
{
std::vector<std::thread> vth;
char *tmp = "Copy.deb";
for (int i = 0; i < 50; ++i)
{
char src[40];
memset(src, '\0', sizeof(src));
sprintf(src, "%d", i);
strcat(src, tmp);
vth.emplace_back(std::bind(copy, "Original.deb", strdup(src)));
}
for (int i = 0; i < 50; ++i)
{
vth[i].join();
}
return 0;
}
void copy(const char *src_path, const char *dst_path)
{
FILE *src, *dst;
int buffer_size = 8 * 1024;
char buffer[buffer_size];
size_t length;
src = fopen(src_path, "rb");
dst = fopen(dst_path, "wb");
while (!feof(src))
{
length = fread(buffer, 1, buffer_size, src);
fwrite(buffer, 1, length, dst);
}
fclose(src);
fclose(dst);
}
I believe your problem is that you are passing src (which is a pointer to a local variable on your main thread's stack) to your thread's entry function, but since your copy() function runs asynchronously in a separate thread, the char src[40] array that you are passing a pointer to has already been been popped off of the main thread's stack (and likely overwritten by other data) before the copy() function gets a chance to read its contents.
The easy fix would be to make a copy of the string on the heap, so that you can guarantee the string will remain valid until the the copy() function executes and reads it:
vth.emplace_back(std::bind(copy, "Original.deb", strdup(src)));
... and be sure to have your copy() function free the heap-allocation when it's done using it:
void copy(const char *src_path, const char *dst_path)
{
FILE *src, *dst;
int buffer_size = 8 * 1024;
char buffer[buffer_size];
size_t length;
src = fopen(src_path, "rb");
dst = fopen(dst_path, "wb");
free(dst_path); // free the string previously allocated by strdup()
[...]
Note that you don't currently have the same problem with the "Original.deb" argument since "Original.deb" is a string-literal and therefore stored statically in the executable, which means it remains valid for as long as the program is running -- but if/when you change your code to not use a string-literal for that argument, you'd likely need to do something similar for it as well.
I have a debugging tool which in order to register its acquired data uses a data structure called DiskPool (code follows). At start, this data structure mmaps a certain amount of data (backed by a file on disk). Clients can allocate memory via a simple bump pointer mechanism (implemented using std::atomic<size_t>.
As the volume of acquired data is massive I have decided to have a window over a time period instead of registering and keeping all the data. To fulfil such a purpose I have to change the disk pool into a circular buffer but this should not impose a considerable overhead as this overhead affects the measurement.
I wanted to ask you if anybody has any idea? (For example, using an atomic interface of STL).
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <atomic>
#include <memory>
#include <signal.h>
#include <chrono>
#include <thread>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
class DiskPool {
char* addr_; // Initialized by mmap()
size_t len_; // Given by the user as many as memory pages as needed
std::atomic<size_t> top_; // Offset from address_
int fd_;
public:
DiskPool(size_t l, const char* file) : len_(l), top_(0),fd_(-1)
{
struct stat st;
fd_= open(file, O_CREAT|O_RDWR, S_IREAD | S_IWRITE);
if (fd_ == -1)
handle_error("open");
if (ftruncate(fd_, len_* sysconf(_SC_PAGE_SIZE)) != 0)
handle_error("ftruncate() error");
else {
fstat(fd_, &st);
printf("the file has %ld bytes\n", (long) st.st_size);
}
addr_ = static_cast<char*>( mmap(NULL, (len_* sysconf(_SC_PAGE_SIZE)),
PROT_READ | PROT_WRITE, MAP_SHARED|MAP_NORESERVE, fd_,0));
if (addr_ == MAP_FAILED)
handle_error("mmap failed.");
}
~DiskPool()
{
close(fd_);
if( munmap(addr_, len_)< 0) {
handle_error("Could not unmap file");
exit(1);}
std::cout << "Successfully unmapped the file. " << std::endl;
}
void* allocate(size_t s)
{
size_t t = std::atomic_fetch_add(&top_, s);
return addr_+t;
}
void flush() {madvise(addr_, len_, MADV_DONTNEED);}
};
As an example, I created sample code that uses this disk pool to record data at the creation and destruction of an object (AutomaticLifetimeCollector).
static const std::string RECORD_FILE = "Data.txt";
static const size_t DISK_POOL_NUMBER_OF_PAGES = 10000;
static std::shared_ptr<DiskPool> diskPool =
std::shared_ptr <DiskPool> (new DiskPool(DISK_POOL_NUMBER_OF_PAGES,RECORD_FILE.c_str()));
struct TaskRecord
{
uint64_t tid; // Thread id
uint64_t tag; // User-given identifier (“f1”)
uint64_t start_time; // nanoseconds
uint64_t stop_time;
uint64_t cpu_time;
TaskRecord(int depth, size_t tag, uint64_t start_time) :
tid(pthread_self()), tag(tag),
start_time(start_time), stop_time(0), cpu_time(0) {}
};
class AutomaticLifetimeCollector
{
TaskRecord* record_;
public:
AutomaticLifetimeCollector(size_t tag) :
record_(new(diskPool->allocate(sizeof(TaskRecord)))
TaskRecord(2, tag, (uint64_t)1000000004L))
{
}
~AutomaticLifetimeCollector() {
record_->stop_time = (uint64_t)1000000000L;
record_->cpu_time = (uint64_t)1000000002L;
}
};
inline void DelayMilSec(unsigned int pduration)
{
std::this_thread::sleep_until(std::chrono::system_clock::now() +
std::chrono::milliseconds(pduration));
}
std::atomic<bool> LoopsRunFlag {true};
void sigIntHappened(int signal)
{
std::cout<< "Application was terminated.";
LoopsRunFlag.store(false, std::memory_order_release);
}
int main()
{
signal(SIGINT, sigIntHappened);
unsigned int i = 0;
while(LoopsRunFlag)
{
AutomaticLifetimeCollector alc(i++);
DelayMilSec(2);
}
diskPool->flush();
return(0);
}
So accounting only for the handing out of variable-sized slices for a variable buffer, I believe a Compare-And-Swap loop should work.
The basic idea here is to read a value (which is atomic), do some computation with it, then write the value, if it did not change since reading. If it did change (another thread/process), the computation must be redone with the new value.
Since you have variable sized objects, I think actually simply slicing it into n array elements with (i + 1) % n won't work, as given (i + item_len) % capacity, it would split the allocation between the end and start of the buffer, and while that can be correct and working, I think maybe not what you wanted. So that means a condition, but I think the CPU should predict it pretty well.
#include <iostream>
#include <atomic>
std::atomic<size_t> next_index = 0;
const size_t len = 100; // small for demo purpose
size_t alloc(size_t required_size)
{
if (required_size > len) std::terminate(); // do something, would cause a buffer overflow
size_t i, ret_index, new_index;
i = next_index.load();
do
{
auto space = len - i;
ret_index = required_size <= space ? i : 0; // Wrap if needed
new_index = ret_index + required_size;
} while (next_index.compare_exchange_weak(i, new_index)); // succeed if value did of i not change
return ret_index;
}
int main()
{
std::cout << alloc(4) << std::endl; // 0 - 3
std::cout << alloc(8) << std::endl; // 4 - 11
std::cout << alloc(32) << std::endl; // 12 - 43
std::cout << alloc(32) << std::endl; // 44 - 75
std::cout << alloc(32) << std::endl; // 0 - 31 (76 - 107 would overflow)
std::cout << alloc(32) << std::endl; // 32 - 63
std::cout << alloc(32) << std::endl; // 64 - 95
std::cout << alloc(32) << std::endl; // 0 - 31 (96 - 127 would overflow)
}
Which should be fairly simple to plug in to your class:
void* allocate(size_t s)
{
if (s > len_ * sysconf(_SC_PAGE_SIZE)) std::terminate(); // do something, would cause a buffer overflow
size_t i, ret_index, new_index;
i = top_.load();
do
{
auto space = len_ * sysconf(_SC_PAGE_SIZE) - i;
ret_index = s <= space ? i : 0; // Wrap if needed
new_index = ret_index + s;
} while (top_.compare_exchange_weak(i, new_index)); // succeed if value did of i not change
return addr_+ ret_index;
}
len_ * sysconf(_SC_PAGE_SIZE) is in a few places, so might be the more useful value to store in len_ itself.
What is the relationship between ulimit -s <value> and the stack size (at thread level) in the Linux implementation (or for that matter any OS)?
Is <number of threads> * <each thread stack size> must be less than < stack size assigned by ulimit command> valid justification?
In the below program - each thread allocates char [PTHREAD_STACK_MIN] and 10 threads are created. But when the ulimit is set to 10 * PTHREAD_STACK_MIN, it does not coredump due to abort. For some random value of stacksize (much less than 10 * PTHREAD_STACK_MIN), it core dumps. Why so?
My Understanding is that stacksize represents the stack occupied by all the threads in summation for the process.
Thread Function
#include <cstdio>
#include <error.h>
#include <unistd.h>
#include <sys/select.h>
#include <sys/time.h>
#include <sys/resource.h>
using namespace std;
#include <pthread.h>
#include <bits/local_lim.h>
const unsigned int nrOfThreads = 10;
pthread_t ntid[nrOfThreads];
void* thr_fn(void* argv)
{
size_t _stackSz;
pthread_attr_t _attr;
int err;
err = pthread_attr_getstacksize(&_attr,&_stackSz);
if( 0 != err)
{
perror("pthread_getstacksize");
}
printf("Stack size - %lu, Thread ID - %llu, Process Id - %llu \n", static_cast<long unsigned int> (_stackSz), static_cast<long long unsigned int> (pthread_self()), static_cast<long long unsigned int> (getpid()) );
//check the stack size by actual allocation - equal to 1 + PTHREAD_STACK_MIN
char a[PTHREAD_STACK_MIN ] = {'0'};
struct timeval tm;
tm.tv_sec = 1;
while (1)
select(0,0,0,0,&tm);
return ( (void*) NULL);
}
Main Function
int main(int argc, char *argv[])
{
struct rlimit rlim;
int err;
err = getrlimit(RLIMIT_STACK,&rlim);
if( 0 != err)
{
perror("pthread_create ");
return -1;
}
printf("Stacksize hard limit - %ld, Softlimit - %ld\n", static_cast <long unsigned int> (rlim.rlim_max) ,
static_cast <long unsigned int> (rlim.rlim_cur));
for(unsigned int j = 0; j < nrOfThreads; j++)
{
err = pthread_create(&ntid[j],NULL,thr_fn,NULL);
if( 0 != err)
{
perror("pthread_create ");
return -1;
}
}
for(unsigned int j = 0; j < nrOfThreads; j++)
{
err = pthread_join(ntid[j],NULL);
if( 0 != err)
{
perror("pthread_join ");
return -1;
}
}
perror("Join thread success");
return 0;
}
PS:
I am using Ubuntu 10.04 LTS version, with below specification.
Linux laptop 2.6.32-26-generic #48-Ubuntu SMP Wed Nov 24 10:14:11 UTC 2010 x86_64 GNU/Linux
On UNIX/Linux, getrlimit(RLIMIT_STACK) is only guaranteed to give the size of the main thread's stack. The OpenGroup's reference is explicit on that, "initial thread's stack":
http://www.opengroup.org/onlinepubs/009695399/functions/getrlimit.html
For Linux, there's a reference which indicates that RLIMIT_STACK is what will be used by default for any thread stack (for NPTL threading):
http://www.kernel.org/doc/man-pages/online/pages/man3/pthread_create.3.html
Generally, since the programmer can decide (by using nonstandard attributes when creating the thread) where to put the stack and/or how much stack to use for a new thread, there is no such thing as a "cumulative process stack limit". It rather comes out of the total RLIMIT_AS address space size.
But you do have a limit on the number of threads you can create,sysconf(PTHREAD_THREADS_MAX), and you do have a lower limit for the minimum size a thread stack must have,sysconf(PTHREAD_STACK_MIN).
Also, you can query the default stacksize for new threads:
pthread_attr_t attr;
size_t stacksize;
if (!pthread_attr_init(&attr) && !pthread_attr_getstacksize(&attr, &stacksize))
printf("default stacksize for a new thread: %ld\n", stacksize);
I.e. default-initialize a set of pthread attributes and ask for what stacksize the system gave you.
In a threaded program, stacks for all threads (except the initial one) are allocated out of the heap, so RLIMIT_STACK has little or no relation to how much stack space you can use for your threads.