How do in-memory databases (such as redis) communicate with other applications?

How do in-memory databases (such as redis) communicate with other applications? - in-memory-database

I'm in the process of implementing a local computer database that allows me to transfer information and data between C and fortran programs (C for control flow, fortran for matrix computation).
I understand the idea of an in memory database, but how do programs get data from it? Do I need to open local ports and just have a regular connection to it? Is there another, lower level system call or something that allows me to directly communicate with programs?
In my head I am going back and forth between making a big C program that runs the database inside of it as well as the fortran matrix computation (not directly on the database), and just storing it to a binary file and reopening it between programs.
I also understand that using someone else's software would be easier and faster, but I want to do it myself to increase my understanding and programming chops.

I can't speak to redis, but I can tell you about my company's implementation: eXtremeDB.
eXtremeDB is written mostly in C (C++ for the SQL, some assembly e.g. for spinlocks). We offer native and SQL APIs for many languages that can be used interchangeably.
For a mixed language scenario like you describe, the database would be created in shared (named) memory that is mapped to each process' address space. As such, it is an 'embedded' database. The database runtime itself is shared libraries that get linked with the application. So, this fits your description "a big C program that runs the database inside". You can substitute 'fortran' for 'C' as appropriate.
Accordingly, the processes have direct, very fast, access to the stored data through the published interfaces (i.e. no inter-process communication overhead compared to a client/server architecture) . The database runtime controls concurrent access. "access" can be through SQL (SELECT * FROM table WHERE...) or through a native API. The native API is faster, of course. And for a roll-your-own approach, much more tractable (implementing an SQL engine is kind of a big deal).
You'd probably want to implement a 'load' and 'store' interface to save and reload an in-memory database between runs. This is pretty simple; The in-memory database will exist in a contiguous piece of memory (e.g. use the shared memory ops of your OS to allocate 5MB of shared memory and map it to the local address space), which can just be streamed out to persistent media. This implies that you'll create a sub-allocator in your database run-time to dole out smaller chunks for the storage of objects. If there are relationships between objects in the shared memory database, make sure they are stored as offsets versus direct pointer references, because the there's no guarantee that the database will be mapped to the same starting memory address on a subsequent run.

Related

QsharedMemory with no Qt application

I have an application A, and I want to share some information with an application B.
Application A write information each ~150ms.
Application B read information at any times.
I searched and found QSharedMemory, which looks great, but the application B will not be developed by my company, so I can't choose the programming langage.
Is QSharedMemory a good idea ?
How can I do that ?

QSharedMemory is a thin wrapper around named and unnamed platform shared memory. When named, there's simply a file that the other application can memory-map and use from any programming language, as long as said language supports binary buffers.
I do wonder if it wouldn't be easier, though, if you used a pipe for IPC. QLocalSocket encapsulates that on Qt's end, and the other side simply uses a native pipe.
Shared memory makes sense only in certain scenarios, like, say, pushing images that may not change all that much between applications - where the cost of pushing the entire image all the time would be prohibitive in the light of small-on-average bandwidth of changes. The image doesn't need to mean a visual image, it may be an industrial process image, etc.
In many cases, shared memory is a premature pseudo-optimization that makes things much harder than necessary, and can, in case of a multitude of communicating processes, become a pessimization - you do pay the cost in virtual memory for each shared memory segment.

Sounds like you need to implement a simple server, using local sockets it should be pretty fast in terms of bandwidth and easy to develop. The server will act to store data from A and deliver it to B upon request.
Obviously, it won't work "with no application" in between. Whether you go for shared memory or a local socket, you will need some server code to run at all time to service A and B. If A is running all the time, it can well be a part of it, but it can also be standalone.
It would be preferable to use a local socket, because the API for that is more portable across different programming languages, in that case A and B can be implemented in arbitrary languages and frameworks and communicate at the socket protocol level. With QSharedMemory it won't be as portable in your scenario.

how to cache 1000s of large C++ objects

Environment:
Windows 8 64 bit, Windows 2008 server 64 bit
Visual Studio (professional) 2012 64 bits
list L; //I have 1000s of large CMyObject in my program that I cache, which is shared by different threads in my windows service program.
For our SaaS middleware product, we cache in memory 1000s of large C++ objects (read only const objects, each about 4MB in size), which runs the system out of memory. Can we associate a disk file (or some other persistent mechanism that is OS managed) to our C++ objects? There is no need for sharing / inter-process communication.
The disk file will suffice if it works for the duration of the process (our windows service program). The read-only const C++ objects are shared by different threads in the same windows service.
I was even considering using object databases (like mongoDB) to store the objects, which will then be loaded / unloaded at each use. Though faster than reading our serialized file (hopefully), it will still spoil the performance.
The purpose is to retain caching of C++ objects for performance reason and avoid having to load / unload the serialized C++ object every time. It would be great if this disk file is OS managed and requires minimal tweaking in our code.
Thanks in advance for your responses.

The only thing which is OS managed in the manner you describe is swap file. You can create a separate application (let it be called "cache helper"), which loads all the objects into memory and waits for requests. Since it does not use it's memory pages, OS will eventually displace the pages to the swap file, recalling it only if/when needed.
Communication with the applciation can be done through named pipes or sockets.
Disadvantages of such approach are that the performance of such cache will be highly volatile, and it may degrade performance of the whole server.
I'd recommend to write your own caching algorithm/application, as you may later need to adjust its properties.

One solution is of course to simply load every object, and let the OS deal with swapping it in from/out to disk as required. (Or dynamically load, but never discard unless the object is absolutely being destroyed). This approach will work well if there are are number of objects that are more frequently used than others. And the loading from swapspace is almost certainly faster than anything you can write. The exception to this is if you do know beforehand what objects are more likely or less likely to be used next, and can "throw out" the right objects in case of low memory.
You can certainly also use a memory mapped file - this will allow you to read from and write to the file as if it was memory (and the OS will cache the content in RAM as memory is available). On WIndows, you will be using CreateFileMapping or OpenFileMapping to create/open the filemapping, and then MapViewOfFile to map the file into memory. When finished, use UnmapViewOfFile to "unmap" the memory, and then CloseHandle to close the FileMapping.
The only worry about a filemapping is that it may not appear at the same address in memory next time around, so you can't have pointers within the filemapping and load the same data as binary next time. It would of course work fine to create a new filemapping each time.

So your thousands of massive objects have constructor, destructor, virtual functions and pointers. This means you can't easily page them out. The OS can do it for you though, so your most practical approach is simply to add more physical memory, possibly an SSD swap volume, and use that 64-bit address space. (I don't know how much is actually addressable on your OS, but presumably enough to fit your ~4G of objects).
Your second option is to find a way to just save some memory. This might be using a specialized allocator to reduce slack, or removing layers of indirection. You haven't given enough information about your data for me to make concrete suggestions on this.
A third option, assuming you can fit your program in memory, is simply to speed up your deserialization. Can you change the format to something you can parse more efficiently? Can you somehow deserialize objects quickly on-demand?
The final option, and the most work, is to manually manage a swapfile. It would be sensible as a first step to split your massive polymorphic classes into two: a polymorphic flyweight (with one instance per concrete subtype), and a flattened aggregate context structure. This aggregate is the part you can swap in and out of your address space safely.
Now you just need a memory-mapped paging mechanism, some kind of cache tracking which pages are currently mapped, possibly a smart pointer replacing your raw pointer with a page+offset which can map data in on-demand, etc. Again, you haven't given enough information on your data structure and access patterns to make more detailed suggestions.

Sharing data between applications - share memory vs D-Bus vs File operation

Consider a scenario in which two applications have to share data among them. I can think of three ways-
Shared memory ( Boost I am allowed to use )
D-Bus ( glib / Qt implementation allowed )
File operations on a common file between the two application.
Q1. Which should be my approach considering data to be shared is going to be very large ( some 10K song names for example ).
Q2. Will doing file operation affect speed, compared to the others, as Hard disk would be involved ?
Q3. Is there any other approach available with better speed?
Language of implementation - C++

You may want to consider using the QtSql module to use a database, specifically SQLite.
The SQLite database is a cross-platform in-process database engine. This allows you to store structured data easily and access it concurrently and safely between processes, the processes can even be written in different languages.
SQLite works fine with millions of records, is very fast and reliable.
The main problem is with processes both writing as it uses database level locking, so no other process can read or write to the database during the write operation.
The other benefit to using QtSql is that in the future you can easily make the programs work across a network by using a database server such as PostgreSQL or MySQL.

Store huge amount of data in memory

I am looking for a way to store several gb's of data in memory. The data is loaded into a tree structure. I want to be able to access this data through my main function, but I'm not interested in reloading the data into the tree every time I run the program. What is the best way to do this? Should I create a separate program for loading the data and then call it from the main function, or are there better alternatives?
thanks
Mads

I'd say the best alternative would be using a database - which would be then your "separate program for loading the data".

If you are using a POSIX compliant system, then take a look into mmap.
I think Windows has another function to memory map a file.

You could probably solve this using shared memory, to have one process that it long-lived build the tree and expose the address for it, and then other processes that start up can get hold of that same memory for querying. Note that you will need to make sure the tree is up to being read by multiple simultaneous processes, in that case. If the reads are really just pure reads, then that should be easy enough.

You should look into a technique called a Memory mapped file.

I think the best solution is to configure a cache server and put data there.
Look into Ehcache:
Ehcache is an open source, standards-based cache used to boost
performance, offload the database and simplify scalability. Ehcache is
robust, proven and full-featured and this has made it the most
widely-used Java-based cache.
It's written in Java, but should support any language you choose:
The Cache Server has two apis: RESTful resource oriented, and SOAP.
Both support clients in any programming language.

You must be running a 64 bit system to use more than 4 GB's of memory. If you build the tree and set it as a global variable, you can access the tree and data from any function in the program. I suggest you perhaps try an alternative method that requires less memory consumption. If you post what type of program, and what type of tree you're doing, I can perhaps give you some help in finding an alternative method.
Since you don't want to keep reloading the data...file storage and databases are out of question, but several gigs of memory seem like such a hefty price.
Also note that on Windows systems, you can access the memory of another program using ReadProcessMemory(), all you need is a pointer to use for the location of the memory.

You may alternatively implement the data loader as an executable program and the main program as a dll loaded and unloaded on demand. That way you can keep the data in the memory and be able to modify the processing code w/o reloading all the data or doing cross-process memory sharing.
Also, if you can operate on the raw data from the disk w/o making any preprocessing of it (e.g. putting it in a tree, manipulating pointers to its internals), you may want to memory-map the data and avoid loading unused portions of it.

Fastest small datastore on Windows

My app keeps track of the state of about 1000 objects. Those objects are read from and written to a persistent store (serialized) in no particular order.
Right now the app uses the registry to store each object's state. This is nice because:
It is simple
It is very fast
Individual object's state can be read/written without needing to read some larger entity (like pulling out a snippet from a large XML file)
There is a decent editor (RegEdit) which allow easily manipulating individual items
Having said that, I'm wondering if there is a better way. SQLite seems like a possibility, but you don't have the same level of multiple-reader/multiple-writer that you get with the registry, and no simple way to edit existing entries.
Any better suggestions? A bunch of flat files?

If what you mean by 'multiple-reader/multiple-writer' is that you keep a lot of threads writing to the store concurrently, SQLite is threadsafe (you can have concurrent SELECTs and concurrent writes are handled transparently). See the [FAQ [1]] and grep for 'threadsafe'
[1]: http://www.sqlite.org/faq.html/ FAQ

If you do begin to experiment with SQLite, you should know that "out of the box" it might not seem as fast as you would like, but it can quickly be made to be much faster by applying some established optimization tips:
SQLite optimization
Depending on the size of the data and the amount of RAM available, one of the best performance gains will occur by setting sqlite to use an all-in-memory database rather than writing to disk.
For in-memory databases, pass NULL as the filename argument to sqlite3_open and make sure that TEMP_STORE is defined appropriately
On the other hand, if you tell sqlite to use the harddisk, then you will get a similar benefit to your current usage of RegEdit to manipulate the program's data "on the fly."
The way you could simulate your current RegEdit technique with sqlite would be to use the sqlite command-line tool to connect to the on-disk database. You can run UPDATE statements on the sql data from the command-line while your main program is running (and/or while it is paused in break mode).

I doubt any sane person would go this route these days, however some of what you describe could be done with Window's Structured/Compound Storage. I only mention this since you're asking about Windows - and this is/was an official Windows way to do this.
This is how DOC files were put together (but not the new DOCX format). From MSDN it'll appear really complicated, but I've used it, it isn't the worst API in Win32.
it is not simple
it is fast, I would guess it's faster then the registry.
Individual object's state can be read/written without needing to read some larger entity.
There is no decent editor, however there are some real basic stuff (VC++ 6.0 had the "DocFile Viewer" under Tools. (yeah, that's what that thing did) I found a few more online.
You get a file instead of registry keys.
You gain some old-school Windows developer geek-cred.
Other random thoughts:
I think XML is the way to go (despite the random access issue). Heck, INI files may work. The registry gives you very fine grain security if you need it - people seem to forget this when the claim using files are better. An embedded DB seems like overkill if I'm understanding what you're doing.

Do you need to persist the objects on each change event or just in memory and store on shutdown? If so, just load them up and serialize them at the end, assuming your app runs for a long time (and you don't share that state with another program) then in memory is going to be a winner.
If you've got fixed size structures then you could consider just using a memory mapped file and allocate memory from that?

If the only thing you do is serialize/deserialize individual objects (no fancy queries), then use a btree database, for example Berkeley DB. It is very fast at storing and retrieving chunks of data by key (I assume your objects have some id that can be used as a key) and access by multiple processes is supported.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js