My processor has its page size as 4096. I need to write data into shared memory and this data has a size 7168 (7 KB).
I used the ftruncate and allocated 8192 (2*page_size) so that there would be sufficient memory.
shmem_fd = shm_open( TRIAL_SHMEM_FILE, O_RDWR, S_IRUSR | S_IWUSR);
if( shmem_fd == -1 )
{
printf("Create_shmem, open failed:%s",strerror( errno));PASLOG return false;
}
if( ftruncate( shmem_fd, 8192) == -1 )
{
printf("Create_shmem, ftruncate failed:%s",strerror( errno));PASLOG return false;
}
I am writing the structure as below. [767*10]bytes is lesser than [2*page_size]. But the below code causes a segmentation fault.
If I try to write [767*5] which is within [1-page_size] there is no crash. I am unable to know the actual cause of the crash. Is there a different way to proceed?
// data to be written into shared memory
list_data item[10]; // struct size is 767 bytes
for (uiCounter=DEFAULT_VALUE_ZERO; uiCounter < 10; ++uiCounter)
{
memset(&item[uiCounter], 0, sizeof(list_data));
}
list_data* list_shmem;
list_shmem = (list_data *) mmap(NULL, sizeof(list_data) * 10, PROT_READ | PROT_WRITE, MAP_SHARED, shmem_fd, 0 );
if(list_shmem == MAP_FAILED)
{
printf("mmap failsed: %s", strerror(errno));
return false;
}
// write to shared mem
for (uiCounter = DEFAULT_VALUE_ZERO; uiCounter < 10; ++uiCounter)
{
memcpy ( list_shmem, &item[uiCounter], sizeof(person) );
++list_shmem;
}
munmap(list_shmem, sizeof(list_data) * 10);
There are a couple of issues with your code:
You pass a wrong address to munmap in:
list_data* list_shmem;
list_shmem = (list_data *) mmap(...);
for (uiCounter = DEFAULT_VALUE_ZERO; uiCounter < 10; ++uiCounter)
{
memcpy ( list_shmem, &item[uiCounter], sizeof(person) );
++list_shmem; // <---- invalidates list_shmem original value
}
munmap(list_shmem, sizeof(list_data) * 10);
You specify wrong size to memcpy in:
memcpy ( list_shmem, &item[uiCounter], sizeof(person) );
A fix is:
memcpy ( list_shmem, &item[uiCounter], sizeof(item[uiCounter]) );
One fix for both issues would be to use standard algorithm std::copy instead of the hand-coded loop:
std::copy(item + DEFAULT_VALUE_ZERO, item + 10, list_shmem);
Bonus point:
list_data item[10]; // struct size is 767 bytes
for (uiCounter=DEFAULT_VALUE_ZERO; uiCounter < 10; ++uiCounter)
{
memset(&item[uiCounter], 0, sizeof(list_data));
}
Is the same as:
list_data item[10] = {};
Related
I am trying the simulate the Error Scenario of a Process in Linux that Heap is not enough to allocate the memory in a C++ Linux Application.
But Eventhough I use the "setrlimit" to reduce the Heap Memory available to the Process, still the heap memory is getting allocated successfully.
struct rlimit the_limit = { 1, 1 };
if (-1 == setrlimit(RLIMIT_DATA, &the_limit)) {
perror("setrlimit failed");
}
try
{
char *n = new char[5600];
if (n==NULL)
{
cout <<"\nAllocation Failure\n";
}
}
catch (std::bad_alloc& ba)
{
std::cerr << "bad_alloc caught: " << ba.what() << '\n';
}
Most C++ standard libs including the one supplied with g++ start off with some heap memory preallocated.
5600 is a small request and as such, on my Linux system it gets satisfied from the preallocated memory as evidenced
from an strace:
Modified example:
#include <stdio.h>
#include <sys/resource.h>
int main()
{
struct rlimit the_limit = { 1, 1 };
if (-1 == setrlimit(RLIMIT_DATA, &the_limit)) { perror("setrlimit failed"); }
puts("ALLOC");
#if __cplusplus
try { char *n = new char[5600]; } catch (...) { perror("alloc failure"); }
#else
{ char *n = malloc(1); if(!n) perror("alloc failure"); }
#endif
}
End of example's strace:
...
write(1, "ALLOC\n", 6ALLOC
) = 6
exit_group(0) = ?
Either increasing the request size, e.g. in my case to at least 1<<16, or switching to plain C, causes the allocation request to be served from the OS, and then the limit does apply:
End of strace with an 1<<16 allocation request:
write(1, "ALLOC\n", 6ALLOC
) = 6
brk(0x561bcc5d4000) = 0x561bcc5b2000
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
dup(2) = 3
fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
fstat(3, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 14), ...}) = 0
write(3, "alloc failure: Cannot allocate m"..., 38alloc failure: Cannot allocate memory
) = 38
close(3) = 0
exit_group(0) = ?
Note that generic allocator implementations generally use sbrk and/or
mmap to get memory directly from the OS, and as you can glean from the setrlimit manpage, RLIMIT_DATA will only apply to a mmap-backed allocation iff you're on a Linux >= 4.7.
So I have this litte code, It loops through memory regions, saves them to a byte array, then uses it and finally deletes it (deallocate it). This all happens in a non-main thread, therefore the use of CriticalSections.
Code looks like this:
SIZE_T addr_min = (SIZE_T)sysInfo.lpMinimumApplicationAddress;
SIZE_T addr_max = (SIZE_T)sysInfo.lpMaximumApplicationAddress;
while (addr_min < addr_max)
{
MEMORY_BASIC_INFORMATION mbi = { 0 };
if (!::VirtualQueryEx(hndl, (LPCVOID)addr_min, &mbi, sizeof(mbi)))
{
continue;
}
if (mbi.State == MEM_COMMIT && ((mbi.Protect & PAGE_GUARD) == 0) && ((mbi.Protect & PAGE_NOACCESS) == 0))
{
SIZE_T region_size = mbi.RegionSize;
PVOID Base_Address = mbi.BaseAddress;
BYTE * dump = new BYTE[region_size + 1];
EnterCriticalSection(...);
memset(dump, 0x00, region_size + 1);
//this is where it crashes, same thing with memcpy
//Access violation reading "dump"'s address:
//memmove(unsigned char * dst=0x42aff024, unsigned char *
//src=0x7a768000, unsigned long count=1409024)
std::memmove(dump, Base_Address, region_size);
LeaveCriticalSection(...);
//Do Stuff with dump, that only involves reading from it
if (dump){
delete[] dump;
dump = NULL;
}
}
addr_min += mbi.RegionSize;
}
Code works fine most of the time. But sometimes it just crashes in memcpy/memmove. Under the Visual Studio Debugger it shows that the crash is because there is a error reading "dump", how is that possible if I just define and allocated memory for it. Thanks!
Also, could it be because memory can change in the middle of memcpy?
Need to write an application in C/C++ on Linux that receives a stream of bytes from a socket and process them. The total bytes could be close to 1TB. If I have unlimited amount memory, I will just put it all in the memory, so my application can easily process data. It's much easy to do many things on flat memory space, such as memmem(), memcmp() ... On a circular buffer, the application has to be extra smart to be aware of the circular buffer.
I have about 8G of memory, but luckily due to locality, my application never needs to go back by more than 1GB from the latest data it received. Is there a way to have a 1TB buffer, with only the latest 1GB data mapped to physical memory? If so, how to do it?
Any ideas? Thanks.
Here's an example. It sets up a full terabyte mapping, but initially inaccessible (PROT_NONE). You, the programmer, maintain a window that can only extend and move upwards in memory. The example program uses a one and a half gigabyte window, advancing it in steps of 1,023,739,137 bytes (the mapping_use() makes sure the available pages cover at least the desired region), and does actually modify every page in every window, just to be sure.
#define _GNU_SOURCE
#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <stdio.h>
typedef struct mapping mapping;
struct mapping {
unsigned char *head; /* Start of currently accessible region */
unsigned char *tail; /* End of currently accessible region */
unsigned char *ends; /* End of region */
size_t page; /* Page size of this mapping */
};
/* Discard mapping.
*/
void mapping_free(mapping *const m)
{
if (m && m->ends > m->head) {
munmap(m->head, (size_t)(m->ends - m->head));
m->head = NULL;
m->tail = NULL;
m->ends = NULL;
m->page = 0;
}
}
/* Move the accessible part up in memory, to [from..to).
*/
int mapping_use(mapping *const m, void *const from, void *const to)
{
if (m && m->ends > m->head) {
unsigned char *const head = ((unsigned char *)from <= m->head) ? m->head :
((unsigned char *)from >= m->ends) ? m->ends :
m->head + m->page * (size_t)(((size_t)((unsigned char *)from - m->head)) / m->page);
unsigned char *const tail = ((unsigned char *)to <= head) ? head :
((unsigned char *)to >= m->ends) ? m->ends :
m->head + m->page * (size_t)(((size_t)((unsigned char *)to - m->head) + m->page - 1) / m->page);
if (head > m->head) {
munmap(m->head, (size_t)(head - m->head));
m->head = head;
}
if (tail > m->tail) {
#ifdef USE_MPROTECT
mprotect(m->tail, (size_t)(tail - m->tail), PROT_READ | PROT_WRITE);
#else
void *result;
do {
result = mmap(m->tail, (size_t)(tail - m->tail), PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_FIXED | MAP_PRIVATE | MAP_NORESERVE, -1, (off_t)0);
} while (result == MAP_FAILED && errno == EINTR);
if (result == MAP_FAILED)
return errno = ENOMEM;
#endif
m->tail = tail;
}
return 0;
}
return errno = EINVAL;
}
/* Initialize a mapping.
*/
int mapping_create(mapping *const m, const size_t size)
{
void *base;
size_t page, truesize;
if (!m || size < (size_t)1)
return errno = EINVAL;
m->head = NULL;
m->tail = NULL;
m->ends = NULL;
m->page = 0;
/* Obtain default page size. */
{
long value = sysconf(_SC_PAGESIZE);
page = (size_t)value;
if (value < 1L || (long)page != value)
return errno = ENOTSUP;
}
/* Round size up to next multiple of page. */
if (size % page)
truesize = size + page - (size % page);
else
truesize = size;
/* Create mapping. */
do {
errno = ENOTSUP;
base = mmap(NULL, truesize, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_NORESERVE, -1, (off_t)0);
} while (base == MAP_FAILED && errno == EINTR);
if (base == MAP_FAILED)
return errno;
/* Success. */
m->head = base;
m->tail = base;
m->ends = (unsigned char *)base + truesize;
m->page = page;
errno = 0;
return 0;
}
static void memtouch(void *const ptr, const size_t size)
{
if (ptr && size > 0) {
unsigned char *mem = (unsigned char *)ptr;
const size_t step = 2048;
size_t n = size / (size_t)step - 1;
mem[0]++;
mem[size-1]++;
while (n-->0) {
mem += step;
mem[0]++;
}
}
}
int main(void)
{
const size_t size = (size_t)1024 * (size_t)1024 * (size_t)1024 * (size_t)1024;
const size_t need = (size_t)1500000000UL;
const size_t step = (size_t)1023739137UL;
unsigned char *base;
mapping map;
size_t i;
if (mapping_create(&map, size)) {
fprintf(stderr, "Cannot create a %zu-byte mapping: %m.\n", size);
return EXIT_FAILURE;
}
printf("Have a %zu-byte mapping at %p to %p.\n", size, (void *)map.head, (void *)map.ends);
fflush(stdout);
base = map.head;
for (i = 0; i <= size - need; i += step) {
printf("Requesting %p to %p .. ", (void *)(base + i), (void *)(base + i + need));
fflush(stdout);
if (mapping_use(&map, base + i, base + i + need)) {
printf("Failed (%m).\n");
fflush(stdout);
return EXIT_FAILURE;
}
printf("received %p to %p.\n", (void *)map.head, (void *)map.tail);
fflush(stdout);
memtouch(base + i, need);
}
mapping_free(&map);
return EXIT_SUCCESS;
}
The approach is twofold. First, an inaccessible (PROT_NONE) mapping is created to reserve the necessary virtual contiguous address space. If we omit this step, it would make it possible for a malloc() call or similar to acquire pages within this range, which would defeat the entire purpose; a single terabyte-long mapping.
Second, when the accessible window extends into the region, either mprotect() (if USE_MPROTECT is defined), or mmap() is used to make the required pages accessible. Pages no longer needed are completely unmapped.
Compile and run using
gcc -Wall -Wextra -std=c99 example.c -o example
time ./example
or, to use mmap() only once and mprotect() to move the window,
gcc -DUSE_MPROTECT=1 -Wall -Wextra -std=c99 example.c -o example
time ./example
Note that you probably don't want to run the test if you don't have at least 4GB of physical RAM.
On this particular machine (i5-4200U laptop with 4GB of RAM, 3.13.0-62-generic kernel on Ubuntu x86_64), quick testing didn't show any kind of performance difference between mprotect() and mmap(), in execution speed or resident set size.
If anyone bothers to compile and run the above, and finds that one of them has a repeatable benefit/drawback (resident set size or time used), I'd very much like to know about it. Please also define your kernel and CPU used.
I'm not sure which details I should expand on, since this is pretty straightforward, really, and the Linux man pages project man 2 mmap and man 2 mprotect pages are quite descriptive. If you have any questions on this approach or program, I'd be happy to try and elaborate.
Recently I've been playing about with using shared memory for IPC. One thing I've been trying to implement is a simple ring buffer with 1 process producing and 1 process consuming. Each process has its own sequence number to track its position. These sequence numbers are updated using atomic ops to ensure the correct values are visible to the other process. The producer will block once the ring buffer is full. The code is lock free in that no semaphores or mutexes are used.
Performance wise I'm getting roughly 20 million messages per second on my rather modest VM - Pretty happy with that :)
What I'm curious about how 'correct' my code is. Can anyone spot any inherent issues / race conditions? Here's my code. Thanks in advance for any comments.
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <time.h>
#include <unistd.h>
#include <string.h>
#define SHM_ID "/mmap-test"
#define BUFFER_SIZE 4096
#define SLEEP_NANOS 1000 // 1 micro
struct Message
{
long _id;
char _data[128];
};
struct RingBuffer
{
size_t _rseq;
char _pad1[64];
size_t _wseq;
char _pad2[64];
Message _buffer[BUFFER_SIZE];
};
void
producerLoop()
{
int size = sizeof( RingBuffer );
int fd = shm_open( SHM_ID, O_RDWR | O_CREAT, 0600 );
ftruncate( fd, size+1 );
// create shared memory area
RingBuffer* rb = (RingBuffer*)mmap( 0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0 );
close( fd );
// initialize our sequence numbers in the ring buffer
rb->_wseq = rb->_rseq = 0;
int i = 0;
timespec tss;
tss.tv_sec = 0;
tss.tv_nsec = SLEEP_NANOS;
while( 1 )
{
// as long as the consumer isn't running behind keep producing
while( (rb->_wseq+1)%BUFFER_SIZE != rb->_rseq%BUFFER_SIZE )
{
// write the next entry and atomically update the write sequence number
Message* msg = &rb->_buffer[rb->_wseq%BUFFER_SIZE];
msg->_id = i++;
__sync_fetch_and_add( &rb->_wseq, 1 );
}
// give consumer some time to catch up
nanosleep( &tss, 0 );
}
}
void
consumerLoop()
{
int size = sizeof( RingBuffer );
int fd = shm_open( SHM_ID, O_RDWR, 0600 );
if( fd == -1 ) {
perror( "argh!!!" ); return;
}
// lookup producers shared memory area
RingBuffer* rb = (RingBuffer*)mmap( 0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0 );
// initialize our sequence numbers in the ring buffer
size_t seq = 0;
size_t pid = -1;
timespec tss;
tss.tv_sec = 0;
tss.tv_nsec = SLEEP_NANOS;
while( 1 )
{
// while there is data to consume
while( seq%BUFFER_SIZE != rb->_wseq%BUFFER_SIZE )
{
// get the next message and validate the id
// id should only ever increase by 1
// quit immediately if not
Message msg = rb->_buffer[seq%BUFFER_SIZE];
if( msg._id != pid+1 ) {
printf( "error: %d %d\n", msg._id, pid ); return;
}
pid = msg._id;
++seq;
}
// atomically update the read sequence in the ring buffer
// making it visible to the producer
__sync_lock_test_and_set( &rb->_rseq, seq );
// wait for more data
nanosleep( &tss, 0 );
}
}
int
main( int argc, char** argv )
{
if( argc != 2 ) {
printf( "please supply args (producer/consumer)\n" ); return -1;
} else if( strcmp( argv[1], "consumer" ) == 0 ) {
consumerLoop();
} else if( strcmp( argv[1], "producer" ) == 0 ) {
producerLoop();
} else {
printf( "invalid arg: %s\n", argv[1] ); return -1;
}
}
Seems correct to me at a first glance. I realize that you are happy with the performance but a fun experiment might be to use something more light weight than a __sync_fetch_and_add. AFAIK it is a full memory barrier, which is expensive. Since there is a single producer and a single consumer, a release and a corresponding acquire operation should give you better performance. Facebook's Folly library has a single producer single consumer queue that uses the new C++11 atomics here: https://github.com/facebook/folly/blob/master/folly/ProducerConsumerQueue.h
If I create a SM from 64 bit application and open it on 32 bit application it fails.
//for 64 bit
shared_memory_object( create_only, "test" , read_write) ;
// for 32 bit
shared_memory_object (open_only, "test", read_write);
file created by 64bit application is at path as below:
/private/tmp/boost_interprocess/AD21A54E000000000000000000000000/test
where as file searched by 32 bit application is at path
/private/tmp/boost_interprocess/AD21A54E00000000/test
Thus 32 bit applications cannot read the file.
I am using boost 1.47.0 on Mac OS X.
Is it a bug? Do I have to do some settings use some Macros in order to fix it? Has any one encountered this problem before?
Is it important that the shared memory be backed by a file? If not, you might consider using the underlying Unix shared memory APIs: shmget, shmat, shmdt, and shmctl, all declared in sys/shm.h. I have found them to be very easy to use.
// create some shared memory
int id = shmget(0x12345678, 1024 * 1024, IPC_CREAT | 0666);
if (id >= 0)
{
void* p = shmat(id, 0, 0);
if (p != (void*)-1)
{
initialize_shared_memory(p);
// detach from the shared memory when we are done;
// it will still exist, waiting for another process to access it
shmdt(p);
}
else
{
handle_error();
}
}
else
{
handle_error();
}
Another process would use something like this to access the shared memory:
// access the shared memory
int id = shmget(0x12345678, 0, 0);
if (id >= 0)
{
// find out how big it is
struct shmid_ds info = { { 0 } };
if (shmctl(id, IPC_STAT, &info) == 0)
printf("%d bytes of shared memory\n", (int)info.shm_segsz);
else
handle_error();
// get its address
void* p = shmat(id, 0, 0);
if (p != (void*)-1)
{
do_something(p);
// detach from the shared memory; it still exists, but we can't get to it
shmdt(p);
}
else
{
handle_error();
}
}
else
{
handle_error();
}
Then, when all processes are done with the shared memory, use shmctl(id, IPC_RMID, 0) to release it back to the system.
You can use the ipcs and ipcrm tools on the command line to manage shared memory. They are useful for cleaning up mistakes when first writing shared memory code.
All that being said, I am not sure about sharing memory between 32-bit and 64-bit programs. I recommend trying the Unix APIs and if they fail, it probably cannot be done. They are, after all, what Boost uses in its implementation.
I found the solution to the problem and as expected it is a bug.
This Bug is present in tmp_dir_helpers.hpp file.
inline void get_bootstamp(std::string &s, bool add = false)
{
...
std::size_t char_counter = 0;
long fields[2] = { result.tv_sec, result.tv_usec };
for(std::size_t field = 0; field != 2; ++field){
for(std::size_t i = 0; i != sizeof(long); ++i){
const char *ptr = (const char *)&fields[field];
bootstamp_str[char_counter++] = Characters[(ptr[i]&0xF0)>>4];
bootstamp_str[char_counter++] = Characters[(ptr[i]&0x0F)];
}
...
}
Where as it should have been some thing like this..
**long long** fields[2] = { result.tv_sec, result.tv_usec };
for(std::size_t field = 0; field != 2; ++field){
for(std::size_t i = 0; i != sizeof(**long long**); ++i)
I have created a ticket in boost for this bug.
Thank you.