Invalidating memory page - c++

I am implementing a reader of huge compressed raster files. Decompression is performed partially on the fly. Only requested regions of the raster are decompressed and stored in memory cache. Reader works similarly as memory mapping of a file but the data is not mapped to memory 1:1, it is decompressed.
It is implemented using anonymous memory mapping:
char* raster_cache = static_cast<char*>(mmap(0, UNCOMPRESSED_RASTER_SIZE, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0));
Reading of an area which is not cached yet emits segmentation violation signal which is caught and handled using libsigsegv (see my previous question):
struct CacheHandlerData
{
std::mutex mutex;
// other data needed for decompression
};
int cache_sigsegv_handler(void* fault_address, void* user_data)
{
void* page_address = reinterpret_cast<void*>(reinterpret_cast<uintptr_t>(fault_address) & ~(PAGE_SIZE - 1));
CacheHandlerData* data = static_cast<CacheHandlerData*>(user_data);
std::lock_guard<std::mutex> lock(data->mutex);
unsigned char cached = 0;
mincore(page_address, 1, &cached);
if (!cached)
{
mprotect(page_address, PAGE_SIZE, PROT_WRITE);
// decompress whole page
mprotect(page_address, PAGE_SIZE, PROT_READ);
}
return 1;
}
The problem is that cached pages stay in memory forever. Because i write to the pages, they are marked as dirty and never invalidated.
QUESTION: Is there some possibility to mark pages as not dirty?
In case the system is running out of memory, the pages would be removed from memory similarly to a normal disk cache. It would also be needed to call mprotect(page_address, PAGE_SIZE, PROT_NONE) for the removed pages in order to cause a segmentation violation when the page is accessed again.
Thank you.
EDIT: I could use temporary file backed mapping instead of anonymous one. Pages would be swapped to disk in case the system is out of memory. But this solution loses benefits from using compressed data (smaller disk size, probably faster reading).

Related

Using memcpy on mmap'ed region crashes, a for loop does not

I have an NVIDIA Tegra TK1 processor module on a carrier board with a PCI-e slot connecting to it. In that PCIe slot is an FPGA board which exposes some registers and a 64K memory area via PCIe.
On the ARM CPU of the Tegra board, a minimal Linux installation is running.
I am using /dev/mem and the mmap function to obtain user-space pointers to the register structs and the 64K memory area.
The distinct register files and the memory block are all assigned addresses which are aligned and do not overlap with regards to 4KB memory pages.
I explicitly map whole pages with mmap, using the result of getpagesize(), which also is 4096.
I can read/write from/to those exposed registers just fine.
I can read from the memory area (64KB), doing uint32 word-by-word reads in a for loop, just fine. I.e. read contents are correct.
But if I use std::memcpy on the same address range, though, the Tegra CPU freezes, always. I do not see any error message, if GDB is attached I also don't see a thing in Eclipse when trying to step over the memcpy line, it just stops hard. And I have to reset the CPU using the hardware reset button, as the remote console is frozen.
This is debug build with no optimization (-O0), using gcc-linaro-6.3.1-2017.05-i686-mingw32_arm-linux-gnueabihf. I was told the 64K region is accessible byte-wise, I did not try that explicitly.
Is there an actual (potential) problem that I need to worry about, or is there a specific reason why memcpy does not work and maybe should not be used in the first place in this scenario - and I can just carry on using my for loops and think nothing of it?
EDIT: Another effect has been observed: The original code snippet was missing a "vital" printf in the copying for loop, that came before the memory read. That removed, I don't get back valid data. I now updated the code snippet to have an extra read from the same address instead of the printf, which also yields correct data. The confusion intensifies.
Here the (I think) important excerpts of what's going on. With minor modifications, to make sense as shown, in this "de-fluffed" form.
// void* physicalAddr: PCIe "BAR0" address as reported by dmesg, added to the physical address offset of FPGA memory region
// long size: size of the physical region to be mapped
//--------------------------------
// doing the memory mapping
//
const uint32_t pageSize = getpagesize();
assert( IsPowerOfTwo( pageSize ) );
const uint32_t physAddrNum = (uint32_t) physicalAddr;
const uint32_t offsetInPage = physAddrNum & (pageSize - 1);
const uint32_t firstMappedPageIdx = physAddrNum / pageSize;
const uint32_t lastMappedPageIdx = (physAddrNum + size - 1) / pageSize;
const uint32_t mappedPagesCount = 1 + lastMappedPageIdx - firstMappedPageIdx;
const uint32_t mappedSize = mappedPagesCount * pageSize;
const off_t targetOffset = physAddrNum & ~(off_t)(pageSize - 1);
m_fileID = open( "/dev/mem", O_RDWR | O_SYNC );
// addr passed as null means: we supply pages to map. Supplying non-null addr would mean, Linux takes it as a "hint" where to place.
void* mapAtPageStart = mmap( 0, mappedSize, PROT_READ | PROT_WRITE, MAP_SHARED, m_fileID, targetOffset );
if (MAP_FAILED != mapAtPageStart)
{
m_userSpaceMappedAddr = (volatile void*) ( uint32_t(mapAtPageStart) + offsetInPage );
}
//--------------------------------
// Accessing the mapped memory
//
//void* m_rawData: <== m_userSpaceMappedAddr
//uint32_t* destination: points to a stack object
//int length: size in 32bit words of the stack object (a struct with only U32's in it)
// this crashes:
std::memcpy( destination, m_rawData, length * sizeof(uint32_t) );
// this does not, AND does yield correct memory contents - but only with a preceding extra read
for (int i=0; i<length; ++i)
{
// This extra read makes the data gotten in the 2nd read below valid.
// Commented out, the data read into destination will not be valid.
uint32_t tmp = ((const volatile uint32_t*)m_rawData)[i];
(void)tmp; //pacify compiler
destination[i] = ((const volatile uint32_t*)m_rawData)[i];
}
Based on the description, it looks like your FPGA code is not responding correctly to load instructions that are reading from locations on your FPGA and it is causing the CPU to lock up. It's not crashing it is permanently stalled, hence the need for the hard reset. I had this problem also when debugging my PCIE logic on an FPGA.
Another indication that your logic is not responding correctly is that you need an extra read in order to get the right responses.
Your loop is doing 32-bit loads but memcpy is doing at least 64-bit loads, which changes how your logic responds. For example, it will need to use two TLPs with 32 bits of response if the first 128 bits of the completion and the next 32 bits in the second 128 bit TLP of the completion.
What I found super-useful was to add logic to log all the PCIE transactions into an SRAM and to be able to dump the SRAM out to see how the logic was behaving or misbehaving. We have a nifty utility, pcieflat, that prints one PCIE TLP per line. It even has documentation.
When the PCIE interface is not working well enough, I stream the log to a UART in hex which can be decoded by pcieflat.
This tool is also useful for debugging performance problems -- you can look at how well your DMA reads and writes are pipelined.
Alternatively, if you have integrated logic analyzer or similar on the FPGA, you can trace the activity that way. But it's nicer to have the TLPs parsed according to PCIE protocol.

Deny an access to shared memory from non-forked processes

I need to create a shared memory segment which contains some secret data. I use shmget and shmat functions to access the segment with 0600 rights. I want to share this segment of memory with forked processes only. I tried to create another application which tried to access this segment and it was unsuccessful, so it looks like it's working as I want.
But when I run the application which created the segment again, it can access the segment. How is it possible? Is a good idea to store secret data into shared memory?
You can mmap() shared and anonymous memory region by providing MAP_SHARED and MAP_ANONYMOUS flags in parent process. That memory will be accessible only to that process and its children. As memory segment is anonymous, no other processes will be able to refer to it, let alone access/map it:
void *shared_mem = mmap(NULL, n_bytes, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0);
Parent process should create shared memory segment using mmap(). That memory segment is inherited by any child process created by fork(). Child process can simply use shared_mem pointer inherited from parent to refer to that memory segment:
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
int main()
{
void *shared_mem = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0);
pid_t pid = fork();
if (pid > 0) {
// parent
// use shared_mem here
} else if (pid == 0) {
// child
// use shared_mem here
} else {
// error
}
return 0;
}
The shared seg doesn't belong to a process, it belongs to a user. Effectively settings 0600 only allows that user for RW (and root), however any other process running as this user will have the same access.
Create a specific user, to be "used" (logged in) only for this purpose.
Is it a good idea to have secret data in a shared memory segment?
Think of the segment as a file - maybe a bit less easy to access (need to know IPC) - except it will disappear when the system shuts down.
Is it a good idea to store secrets in a file? Maybe not if the data is clear text.
In a file or in a shared mem segment, data encryption would be an improvement.
See this page for detailed explanations on how you can control a shmem segment.
OTOH if what you need is only to have a process exchanges information with her children, see processes piping. In this case the secret data is stored within the processes heap/stack memory, and is more difficult to reach by external processes owned by the same user. But the user "owning" the process may also read a process memory (via a core dump for instance) and search for the secret data. Much less easier, but still possible.
Note that in this case, if the secret data is available in the parent process before fork() is performed, children will automatically inherit of it.
Again, anyway, think encryption.

Successive calls to mmap, any caching?

I read in a vector as in:
int readBytes(string filename, vector<uint32_t> &v)
{
// fstat file, get filesize, etc.
uint32_t *filebuf = (uint32_t*)mmap(0,filesize,PROT_READ,
MAP_FILE|MAP_PRIVATE,
fhand,0);
v = std::vector<uint32_t>(filebuf,filebuf+numrecords);
munmap(filebuf, filesize);
}
in main() I have two successive calls (purely as a test):
vector<uint32_t> v(10000);
readBytes(filename, v);
readBytes(filename, v);
// ...
The second call almost always gives a faster clock time:
Profile time [1st call]: 0.000214141 sec
Profile time [2nd call]: 0.000094109 sec
A look at the system calls indicates the memory chunks are differend:
mmap(NULL, 40000, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fe843ac8000
mmap(NULL, 40000, PROT_READ, MAP_PRIVATE, 4, 0) = 0x7fe843ac7000
Why is the second call faster? Coincidence? What, if anything, is cached?
Assuming you're talking about something *NIX-ish, there's probably a page cache, whose job is precisely to cache this sort of data to get this speedup. Unless something else came along between calls to evict those pages from the cache, they'll still be there.
So, the first call potentially has to:
allocate pages
map the pages into your process address space
copy the data from those pages into your vector (possibly faulting the data from disk as it goes)
the second call probably finds the pages still in the cache, and only has to:
map the pages into your process address space
copy the data from those pages into your vector (they're pre-faulted this time, so it's a simple memory operation)
In fact, I've skipped a step: the open/fstat step in your comment is probably also accelerated, via the inode cache.
Remember that your program sees virtual memory. There is a mapping table ("page tables") that maps virtual addresses seen by your program to the real physical memory. And the OS will ensure that the two mmap() calls map two different virtual addresses seen by your program to the same physical memory. So the data only has to be loaded from disk once.
More detal:
First mmap(): OS just records the mapping
When you actually try to read the data: A "page fault" happens, since the data isn't in memory. The OS catches that, reads data from disk to its disk cache, and updates the page tables so that your program can read directly from that disk cache, then it resumes your program automatically.
First munmap(): OS disables the mapping, and updates your page tables so you can't read the file any more. Note that the file is still in the OS's disk cache.
Second mmap(): OS just records the mapping
When you actually try to read the data: A "page fault" happens, since the data isn't mapped. The OS catches that, notices that the data is already in its disk cache, and updates the page tables so that your program can read directly from that disk cache, then it resumes your program automatically.
Second munmap(): OS disables the mapping, and updates your page tables so you can't read the file any more. Note that the file is still in the OS's disk cache.

Minimize memory usage with Boost's file_mapping and mapped_region?

For this problem I am loading a large three-dimensional volume from file into a program, but only need to look at three planes (x,y,z) at a time usually. I am currently using Boost::Interprocess::File_Mapping to create a map of the file (32 GB) and loading it onto my system which has 24 GB of RAM. The current method uses a single Boost::Interprocess::Mapped_Region for the file. The memory usage quickly approaches 99%.
I am new to the world of memory mapped file i/o and want to know how best to segment the file to reduce the amount of memory usage. Would creating reduced sized regions (each Z plane for instance) improve the results? I would like to use as little memory as possible without causing adverse effects.
Am I going about this the correct way, or is there a more straightforward method for performing this?
On Windows, it normally works OK. I've created a test application (sorry I hate boost because I think its quality is appaling, my sample uses ATL instead, but underlying Windows API are the same):
HRESULT TestMain( LPCTSTR strFileName )
{
CAtlFile file;
HRESULT hr = file.Create( strFileName, GENERIC_READ, FILE_SHARE_READ, OPEN_EXISTING );
if( FAILED( hr ) )
return hr;
CAtlFileMapping<BYTE> mapping;
hr = mapping.MapFile( file );
if( FAILED( hr ) )
return hr;
size_t sz = mapping.GetMappingSize();
BYTE res = 0;
for( size_t i = 0; i < sz; i++ )
res ^= mapping[ i ];
printf( "Read the complete file, %Iu bytes, the XOR is %.2X\n", sz, int( res ) );
return S_OK;
}
When asked to read a 12GB file on my machine with 8GB RAM, I saw the effect you're describing (resource monitor memory data for my process: commit 25 MB, private 20 MB, working set and shareable 6.5 GB which is amount of my free RAM). However, multiple sources on the Internets say those numbers mean nothing and don't affect performance, because unused physical pages will be discarded as soon as any process requests more memory, and this process is very cheap (unless of course you're writing to your memory mapped file).
Or, if you're really unhappy about this behavior, you can free unused portions yourself, by calling VirtualUnlock, as described here: https://stackoverflow.com/a/1882478/126995
Or, you can only map the portions of the file you need.
But the best you can do about it - optimize the layout of your data. If in your data file you're keeping voxels as double voxels[x][y][z], store them as struct { double voxels[8][8][8] } blocks[x/8][y/8][z/8] instead. This way, the block size is exactly 4kb which is a page size, and if you only need to access e.g. XZ plane, you'll save a lot of I/O bandwidth, by orders of magnitude. Just don't mess up with misalignment, i.e. if you have a header before your data, make sure the size of the header is 4kb*n where n is integer.

Memory access time slow with VirtualAllocExNuma on Windows 7/64

In our application we are running on a dual Xeon server with memory configured as 12gb local to each processor and a memory bus connecting the two Xeon's. For performance reasons, we want to control where we allocate a large (>6gb) block of memory. Below is simplified code -
DWORD processorNumber = GetCurrentProcessorNumber();
UCHAR nodeNumber = 255;
GetNumaProcessorNode((UCHAR)processorNumber, &nodeNumber );
// get amount of physical memory available of node.
ULONGLONG availableMemory = MAXLONGLONG;
GetNumaAvailableMemoryNode(nodeNumber, &availableMemory )
// make sure that we don't request too much. Initial limit will be 75% of available memory
_allocateAmt = qMin(requestedMemory, availableMemory * 3 / 4);
// allocate the cached memory region now.
HANDLE handle = (HANDLE)GetCurrentProcess ();
cacheObject = (char*) VirtualAllocExNuma (handle, 0, _allocateAmt,
MEM_COMMIT | MEM_RESERVE ,
PAGE_READWRITE| PAGE_NOCACHE , nodeNumber);
The code as is, works correctly using VS2008 on Win 7/64.
In our application this block of memory functions as a cache store for static objects (1-2mb ea) that are normally stored on the hard drive. My problem is that when we transfer data into the cache area using memcpy, it takes > 10 times as longer than if we allocate memory using new char[xxxx]. And no other code changes.
We are at a loss to understand why this is happening. Any suggestions as to where to look?
PAGE_NOCACHE is murder on perf, it disables the CPU cache. Was that intentional?