Memory Based Data Server for local IPC - c++

I am going to be running an app(s) that require about 200MB of market data each time it runs.
This is trivial amount of data to store in memory these days, so for speed thats what i want to do.
Over the course of a days session I will probably run, re-run, re-write and re-run etc etc one or more applications over and over.
SO, the question is how to hold the data in memory all day such that even if the app crashes I do not have to reload the data by opening the data file on disk and re-loading the data?
My initial idea is to write a data server app that does nothing more than read the data into shared memory so that it is available for use. If I do that I guess I could use memory mapping for the IPC by calling
CreateFile()
CreateFileMapping()
MapViewOfFile()
Is there a better IPC/approach?

If you have enough memory and nothing else asks for memory, that might reduce your startup time. To guarantee access to the memory, you probably want to have a memory mapped file in named shared memory, as described here. You can have a simple program create the share and manage it so you can guarantee it remains in memory.

Just memory map the data file. Unless your computer is low on memory, the file will stay in file cache even when the program exits. The next time it starts up, access will be fast.
If your in-memory data is different from the on-disk data, just use two files. On restart, check a timestamp and a file revision written into the memory file to compare to the disk file and that way your program will know which one has the most recent data.

Related

How do on-disk databases handle file reading and writing on a file system level?

Suppose I were to write my own database in c++ and suppose I would use a binary tree or a hash map as the underlying datastructure. How would I handle updates to this datastructure?
1) Should I first create the binary tree and then somehow persist it onto a disk? And every time data has to be updated I need to open this file and update it? Wouldn't that be a costly operation?
2) Is there a way to directly work on the binary tree without loading it into memory and then persisting again?
3) How does SQLite and Mysql deal with it?
4) My main question is, how do databases persist huge amounts of data and concurrently make updates to it without opening and closing the file each time.
Databases see the disk or file as one big bock device and manage blocks in M-way Balanced Trees. They insert/update/delete records in these blocks and flush dirty blocks to disk again. They manage allocation tables of free blocks so the database does not need to be rewritten on each access. As RAM memory is expensive but fast, pages are kept in a RAM cache. Separate indexes (either separate files or just blocks) manage quick access based on keys. Blocks are often the native allocation size of the underlying filesystem (e.g. cluster size). Undo/redo logs are maintained for atomicity. etc.
Much more to be told and this question actually belongs on Computer Science Stack Exchange. For more information read Horowitz & Sahni, "Fundamentals of datastructures", p.496.
As to your questions:
You open it once and keep open while your database manager is running. You allocate storage as needed and maintain an M-way tree as described above.
Yes. You read blocks that you keep in a cache.
and 4: See above.
Typically, you would not do file I/O to access the data. Use mmap to map the data into the virtual address space of the process and let the OS block cache take care of the reads and writes.

Memory mapped IO concept details

I'm attempting to figure out what the best way is to write files in Windows. For that, I've been running some tests with memory mapping, in an attempt to figure out what is happening and how I should organize things...
Scenario: The file is intended to be used in a single process, in multiple threads. You should see a thread as a worker that works on the file storage; some of them will read, some will write - and in some cases the file will grow. I want my state to survive both process and OS crashes. Files can be large, say: 1 TB.
After reading a lot on MSDN, I whipped up a small test case. What I basically do is the following:
Open a file (CreateFile) using FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH.
Build a mmap file handle (CreateFileMapping) on the file, using some file growth mechanism.
Map the memory regions (MapViewOfFile) using a multiple of the sector size (from STORAGE_PROPERTY_QUERY). The mode I intend to use is READ+WRITE.
So far I've been unable to figure out how to use these construct exactly (tools like diskmon won't work for good reasons) so I decided to ask here. What I basically want to know is: how I can best use these constructs for my scenario?
If I understand correctly, this is more or less the correct approach; however, I'm unsure as to the exact role of CreateFileMapping vs MapViewOfFile and if this will work in multiple threads (e.g. the way writes are ordered when they are flushed to disk).
I intend to open the file once per process as per (1).
Per thread, I intend to create a mmap file handle as per (2) for the entire file. If I need to grow the file, I will estimate how much space I need, close the handle and reopen it using CreateFileMapping.
While the worker is doing its thing, it needs pieces of the file. So, I intend to use MapViewOfFile (which seems limited to 2 GB) for each piece, process it annd unmap it again.
Questions:
Do I understand the concepts correctly?
When is data physically read and written to disk? So, when I have a loop that writes 1 MB of data in (3), will it write that data after the unmap call? Or will it write data the moment I hit memory in another page? (After all, disks are block devices so at some point we have to write a block...)
Will this work in multiple threads? This is about the calls themselves - I'm not sure if they will error if you have -say- 100 workers.
I do understand that (written) data is immediately available in other threads (unless it's a remote file), which means I should be careful with read/write concurrency. If I intend to write stuff, and afterwards update a single-physical-block) header (indicating that readers should use another pointer from now on) - then is it guaranteed that the data is written prior to the header?
Will it matter if I use 1 file or multiple files (assuming they're on the same physical device of course)?
Memory mapped files generally work best for READING; not writing. The problem you face is that you have to know the size of the file before you do the mapping.
You say:
in some cases the file will grow
Which really rules out a memory mapped file.
When you create a memory mapped file on Windoze, you are creating your own page file and mapping a range of memory to that page file. This tends to be the fastest way to read binary data, especially if the file is contiguous.
For writing, memory mapped files are problematic.

Shared memory or named pipes in ram?

I want to communicate between two different programs. A modded ambilight program which outputs led information and my own program that reads this information.
I read about named pipes and shared memory. But for me it is not clear where the data is stored. Due to the fact that i will exchange a lot of data i do not want to write this data to disk every time. I am using a raspberry Pi and the sd card should last for some more time ;)
So the basic question is: with what methode can i exchange information to the other end without writing to the disk? I am not sure if shared memory is written to ram, i want to make this clear.
As another idea i read about is /dev/shm which should be a ram disk. Can i also use named pipes for this location and will the information than be saved in ram?
whats the best way to go? thanks :)
I read about named pipes and shared memory. But for me it is not clear
where the data is stored.
In both cases, data is stored in memory (named pipes look like they reside on filesystem, but actual data is stored on memory).
What method is better, it depends on actual application. Pipes have fairly limited buffer (most likely 64kb) and writing to it will block when buffer is full. Shared memory can be arbitrarily large, but on the downside, shared memory is, well, just like that - plain memory. You have to take care about synchronization etc yourself.
Shared memory and named pipes (and unix domain sockets) IPC won't write to your sdcard unless you allocate more memory than the available physical RAM which is either 256MB or 512MB depending on your raspberrypi model. If you do so it will start swapping and will probably slow down.

How to asynchronously flush a memory mapped file?

I am using memory mapped files to have read-/write-access to a large number of image files (~10000 x 16 MB) under Windows 7 64bit. My goals are:
Having as much data cached as possible.
Being able to allocate new images and write to those as fast as possible.
Therefore I am using memory mapped files to access the files. Caching works well, but the OS is not flushing dirty pages until I am nearly out of physical memory. Because of that allocating and writing to new files is quite slow once the physical memory is filled.
One solution would be to regularly use FlushViewOfFile(), but this function does not return until the data has been writen to disk.
Is there a way to asynchroniously flush a file mapping? The only solution I found is to Unmap() and MapViewOfFile() again, but using this approach I can not be sure to get the same data pointer again. Can someone suggest a better approach?
Edit:
Reading the WINAPI documentation a little longer, it seems that I found a suitable solution to my problem:
Calling VirtualUnlock() on a memory range that is not locked results in a flushing of dirty pages.
I heard that FlushViewOfFile() function does NOT wait until it physically write to file.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366563(v=vs.85).aspx
The FlushViewOfFile function does not flush the file metadata, and it does not wait to return until the changes are flushed from the underlying hardware disk cache and physically written to disk.
After call "FlushFileBuffers( ... )" then your data will be physically written to disk.

Are memory mapped files bad for constantly changing data?

I have a service that is responsible for collecting a constantly updating stream of data off the network. The intent is that the entire data set must be available for use (read only) at any time. This means that the newest data message that arrives to the oldest should be accessible to client code.
The current plan is to use a memory mapped file on Windows. Primarily because the data set is enormous, spanning tens of GiB. There is no way to know which part of the data will be needed, but when its needed, the client might need to jump around at will.
Memory mapped files fit the bill. However I have seen it said (written) that they are best for data sets that are already defined, and not constantly changing. Is this true? Can the scenario that I described above work reasonably well with memory mapped files?
Or am I better off keeping a memory mapped file for all the data up to some number of MB of recent data, so that the memory mapped file holds almost 99% of the history of the incoming data, but I store the most recent, say 100MB in a separate memory buffer. Every time this buffer becomes full, I move it to the memory mapped file and then clear it.
Any data set that is defined and doesn't change is best!
Memory mapped files generally win over anthing else - most OSs will cache the accesses in RAM anyway.
And the performance will be predictable, you don't fall off a cliff when you start to swap.
Sounds like a database fits your description. Paging is something most commercial ones do well out of the box.
From your problem statement, I see following requirements:
data must be always available
data is written once, I assume it is append only, never overwritten.
data read access pattern is random, i.e jumping around
there also appears to have an implicit latency requirement
Seems to me, memory mapped file is chosen to address 3) + 4). If your data size can be fit into memory, this may well be a reasonable solution. However, if your data size is too large to fit in memory, memory mapped file may result in performance issue due to frequent page fault.
You did not describe how "jumping around" is done. If it is possible to build an index, you may be able to save data into multiple files, keep index in memory, use index to load data and serve, and also cache most frequent used data. The basic idea is similar to disk based hash. This is probably a more scalable solution.
Since you tagged this Win32 I'm assuming you're working on a 32 bit machine, in which case you simply don't have enough address space to memory map all of your data set. This means you will have to create and destroy mappings into the file as you "jump around", which is going to make this less efficient than you might expect.
In practice, you typically have a bit more than 1 GB of contiguous address space to memory map the file into on a 32 bit windows box, and you can end up with less if you fragment your address space.
That being said, doing this with memory maps does have a benefit if you are memory (not address space) constrained, since when you memory map a file as read only (as opposed to explicitly reading it into memory) the OS will not have a second copy in the file system cache.
The file can be mapped as readonly in one thread that presents the data and have a background worker thread which has the file mapped as readwrite to do the appending.