Preloading data into RAM for fast transaction - c++

My thinking is that, if we preload clients' data(account no, netbalance) in advance, and whenever a transaction is processed the txn record is written into RAM in FIFO data structure, and update also the clients' data in RAM, then after a certain period the record will be written into database in disk to prevent data lost from RAM due to volatility.
By doing so the time in I/O should be saved and hance less time for seeking clients' data for the aim (faster transaction).
I have heard about in-memory database but I do not know if my idea is same as that thing. Also, is there better idea than what I am thinking?

In my opinion, there are several aspects to think about / research to get a step forward. Pre-Loading and Working on data is usually faster than being bound to disk / database page access schemata. However, you are instantly loosing durability. Therefore, three approaches are valid in different situations:
disk-synchronous (good old database way, after each transaction data is guaranteed to be in permanent storage)
in-memory (good as long as the system is up and running, faster by orders of magnitude, risk of loosing transaction data on errors)
delayed (basically in-memory, but from time to time data is flushed to disk)
It is worth noting that delayed is directly supported on Linux through Memory-Mapped files, which are - on the one hand side - often as fast as usual memory (unless reading and accessing too many pages) and on the other hand synced to disk automatically (but not instantly).
As you tagged C++, this is possibly the simplest way of getting your idea running.
Note, however, that when assuming failures (hardware, reboot, etc.) you won't have transactions at all, because it is non-trivial to concretely tell, when the data is actually written.
As a side note: Sometimes, this problem is solved by writing (reliably) to a log file (sequential access, therefore faster than directly to the data files). Search for the word Compaction in the context of databases: This is the operation to merge a log with the usually used on-disk data structures and happens from time to time (when the log gets too large).
To the last aspect of the question: Yes, in-memory databases work in main memory. Still, depending on the guarantees (ACID?) they give, some operations still involve hard disk or NVRAM.

Related

C++: Is it more efficient to store data or continually read it

Ok so I'm working on a game project. Just finished rebuilding a game engine I designed some time ago. I'm looking at making a proprietary file type to store data rather than using a database like sqlite.
Looking at making this work with the game as efficiently and quickly as possible right off the bat without going too deep into it. And then improving over time.
My question is: Is it more efficient overall to load the data from the file and store it in a data manager class to be reused? Or is it more efficient overall to continually pull from the file?
Assuming the file follows some form of consistent structure for it's data. And we're looking at the largest "table" being something like 30 columns with roughly 1000 rows of data.
Here's a handy chart of "Latency Numbers Every Computer Programmer Should Know"
The far right hand side of the chart (red) has the time it takes to read 1 MB from disk. The green column has the same value read from RAM.
What this shows us is that you should do almost anything to avoid having to directly interact with the disk. Keeping data in RAM is good. Keeping data on disk is bad. (Memory mapped files might provide a way to handle this.)
This aside, reinventing the wheel is almost always the wrong solution. Sqlite works and works well. If it's not ideally suited for your needs, there are other file types out there.
If you're "looking at making this work with the game as efficiently and quickly as possible right off the bat without going too deep into it. And then improving over time", you'll find that's easiest to do if you reuse preexisting solutions to common problems.
Keeping reading from a file is generally not a good idea; modern operating systems do keep large IO caches (so if you keep reading the same stuff it won't really hit the disk), but syscalls are of course way more onerous than straight accessing memory - although, whether this is actually going to be a performance problem for your specific case is impossible to judge with the information you provided. On the other hand, if you have a lot of data to access keeping it all in memory can be wasteful, slow to load and, when under memory pressure, lead to paging.
The easy way out of this conundrum is to map the file in memory; the data is automatically fetched from disk when required and, unless the system is under memory pressure, frequently accessed pages remain cached in RAM, guaranteeing you fast access.
Of course this is feasible only if the data you need to map is smaller than the address space, but given the example you provided (30 columns/1000 rows, which is really small) it shouldn't be a problem at all.
If you can hold the data in RAM then it is more efficient. This is because it is quicker for your computer to access values that are in RAM, a cache or the CPU's registers than it is to get it from the hard drive. Reading from the hard drive requires alot of time from the drivers of the operating system; therefore holding the data is more efficient

SQLite C++ API Transactions slow

I have a problem with SQLite. It seems that every call takes ~300ms to execute. After some testing I noticed that the delay is caused by transactions. 8 normal inserts with implicit transactions take about 2 seconds, however, if I start a transaction before the inserts and commit it after, I can do almost a million inserts in the same time. Calls affected include DROP TABLE, CREATE TABLE, INSERT and I assume others, too (probably all that implicitly begin a transaction).
Some more info:
Downloaded the source amalgamation from the SQLite website (3200100)
Compiled it using Visual Studio into a static library (Not using any compiler flags, although I have been playing around with them without luck)
I am using sqlite3_open16 followed by sqlite3_prepare16_v3 and then sqlite3_step to start execution and/or receive the first result
No multithreading, no access from multiple processes, database file is exclusively opened by this program
If I create the file on my SSD (960 EVO) instead the "transaction delay" goes from 300ms down to 10ms. Still an absurdly high value, though, but I feel like the speed of my disk shouldn't influence whatever is slowing the transactions down?
The function that is blocking is sqlite3_step (It also annoys me that I have to call a function with that name just to execute a DROP TABLE, for example, but not that it matters)
Edit: During the transaction, the CPU usage is 100%.
On a side note, is it possible to "help" SQLite with organizing data if you know that every single row of your table will be exactly, say, 64 Byte?
I hope you can help me with this or possibly recommend an alternative (relational, c++ api, file based, highly performant)
Thank you very much!
SQLite makes lots of effort to ensure it doesn't suffer data corruption, so with an implicit transaction, you are limited by your hard disk speed.
With a transaction, the data is written to other locations, and only committed to disk once, and is much faster
From sqlite speed
With synchronization turned on, SQLite executes an fsync() system call (or the equivalent) at key points to make certain that critical data has actually been written to the disk drive surface.
When creating a transaction, the data is written to other files, and only when all the data is committed, will the fsync cost be paid, and all together. That is a price for that part of the configuration. A positive from this, is I have never suffered from sqlite data loss through corruption.
I feel like the speed of my disk shouldn't influence whatever is slowing the transactions down?
This is an important trade-off. If you want improved data integrity, then the speed of your disk is relevant.
How long does committing a transaction take?
From sqlite faq :19 why are transactions slow
SQLite will easily do 50,000 or more INSERT statements per second on an average desktop computer. But it will only do a few dozen transactions per second.
You can :-
Use transactions to bind more work. The cost is per transaction, so can be bulked up.
Use temporary tables. Temporary tables do not suffer the performance, and will run at full speed.
NOT RECOMMENDED. Use PRAGMA synchronous=OFF to disable the synchronous write.

Which Key value, Nosql database can ensure no data loss in case of a power failure?

At present, we are using Redis as an in-memory, fast cache. It is working well. The problem is, once Redis is restarted, we need to re-populate it by fetching data from our persistent store. This overloads our persistent
store beyond its capacity and hence the recovery takes a long time.
We looked at Redis persistence options. The best option (without compromising performance) is to use AOF with 'appendfsync everysec'. But with this option, we can loose last second data. That is not acceptable. Using AOF with 'appednfsync always' has a considerable performance penalty.
So we are evaluating single node Aerospike. Does it guarantee no data loss in case of power failures? i.e. In response to a write operation, once Aerospike sends success to the client, the data should never be lost, even if I pull the power cable of the server machine. As I mentioned above, I believe Redis can give this guarantee with the 'appednfsync always' option. But we are not considering it as it has the considerable performance penalty.
If Aerospike can do it, I would want to understand in detail how persistence works in Aerospike. Please share some resources explaining the same.
We are not looking for a distributed system as strong consistency is a must for us. The data should not be lost in node failures or split brain scenarios.
If not aerospike, can you point me to another tool that can help achieve this?
This is not a database problem, it's a hardware and risk problem.
All databases (that have persistence) work the same way, some write the data directly to the physical disk while others tell the operating system to write it. The only way to ensure that every write is safe is to wait until the disk confirms the data is written.
There is no way around this and, as you've seen, it greatly decreases throughput. This is why databases use a memory buffer and write batches of data from the buffer to disk in short intervals. However, this means that there's a small risk that a machine issue (power, disk failure, etc) happening after the data is written to the buffer but before it's written to the disk will cause data loss.
On a single server, you can buy protection through multiple power supplies, battery backup, and other safeguards, but this gets tricky and expensive very quickly. This is why distributed architectures are so common today for both availability and redundancy. Distributed systems do not mean you lose consistency, rather they can help to ensure it by protecting your data.
The easiest way to solve your problem is to use a database that allows for replication so that every write goes to at least 2 different machines. This way, one machine losing power won't affect the write going to the other machine and your data is still safe.
You will still need to protect against a power outage at a higher level that can affect all the servers (like your entire data center losing power) but you can solve this by distributing across more boundaries. It all depends on what amount of risk is acceptable to you.
Between tweaking the disk-write intervals in your database and using a proper distributed architecture, you can get the consistency and performance requirements you need.
I work for Aerospike. You can choose to have your namespace stored in memory, on disk or in memory with disk persistence. In all of these scenarios we perform favourably in comparison to Redis in real world benchmarks.
Considering storage on disk when a write happens it hits a buffer before being flushed to disk. The ack does not go back to the client until that buffer has been successfully written to. It is plausible that if you yank the power cable before the buffer flushes, in a single node cluster the write might have been acked to the client and subsequently lost.
The answer is to have more than one node in the cluster and a replication-factor >= 2. The write then goes to the buffer on the client and the replica and has to succeed on both before being acked to the client as successful. If the power is pulled from one node, a copy would still exist on the other node and no data would be lost.
So, yes, it is possible to make Aerospike as resilient as it is reasonably possible to be at low cost with minimal latencies. The best thing to do is to download the community edition and see what you think. I suspect you will like it.
I believe aerospike would serves your purpose, you can configure it for hybrid storage at namespace(i.e. DB) level in aerospike.conf
which is present at /etc/aerospike/aerospike.conf
For details please refer official documentation here: http://www.aerospike.com/docs/operations/configure/namespace/storage/
I believe you're going to be at the mercy of the latency of whatever the storage medium is, or the latency of the network fabric in the case of cluster, regardless of what DBMS technology you use, if you must have a guarantee that the data won't be lost. (N.B. Ben Bates' solution won't work if there is a possibility that the whole physical plant loses power, i.e. both nodes lose power. But, I would think an inexpensive UPS would substantially, if not completely, mitigate that concern.) And those latencies are going to cause a dramatic insert/update/delete performance drop compared to a standalone in-memory database instance.
Another option to consider is to use NVDIMM storage for either the in-memory database or for the write-ahead transaction log used to recover from. It will have the absolute lowest latency (comparable to conventional DRAM). And, if your in-memory database will fit in the available NVDIMM memory, you'll have the fastest recovery possible (no need to replay from a transaction log) and comparable performance to the original IMDB performance because you're back to a single write versus 2+ writes for adding a write-ahead log and/or replicating to another node in a cluster. But, your in-memory database system has to be able to support direct recovery of an in-memory database (not just from a transaction log). But, again, two requirements for this to be an option:
1. The entire database must fit in the NVDIMM memory
2. The database system has to be able to support recovery of the database directly after system restart, without a transaction log.
More in this white paper http://www.odbms.org/wp-content/uploads/2014/06/IMDS-NVDIMM-paper.pdf

Caching strategies for Windows end-user applications?

I'm working on what is essentially the runtime for a large administrative application. The actual logic that is being executed, as well as the screens being shown and the data operated upon is stored in a central database. In order to improve performance, the runtime keeps data queried from the database in various caches.
However, it is not always clear how these caches should be managed. Currently, some caches are flushed whenever the runtime goes idle, whereas other caches are never flushed, or only flushed if some configurable but arbitrary limit is reached. We'd obviously want to keep as much data as possible in memory, yet I'm unsure how to do this in a way that plays nicely with Citrix, something that's very important to our customers.
I've been looking into using a resource notification (CreateMemoryResourceNotification()) and flushing caches if it signals that memory is running low, but I'm afraid that using just that would make things behave very badly when running 20+ instances under Citrix, with one instance gobbling up all memory and the rest constantly flushing their caches.
I could set hard limits on cache size with CreateJobObject() and friends, but that could cause the runtime to fail with out-of-memory errors should an instance have a legitimate need for a lot of memory.
I could prevent such problems by using a separate heap for cached data, but there's not a clear separation between cached and non-cached data, so that seems awfully fragile.
TL;DR: anyone got any good ideas for managing in-memory caches under Windows?
I've been looking into using a resource notification (CreateMemoryResourceNotification()) and flushing caches if it signals that memory is running low, but I'm afraid that using just that would make things behave very badly when running 20+ instances under Citrix, with one instance gobbling up all memory and the rest constantly flushing their caches.
I could set hard limits on cache size with CreateJobObject() and friends, but that could cause the runtime to fail with out-of-memory errors should an instance have a legitimate need for a lot of memory.
Can't you make a hybrid solution of some kind, so that the runtime tries to keep its cache limited to a fixed size, but with the possibility to grow bigger if there is a legitimate need to do so and then try to shrink the cache to a reasonable size if the occasion is there?
Preventing one instance from gobbling up all memory while the others are repeatedly flushing their caches can maybe avoided by distributing the memory resource notification to all instances when it arrives. This way they all take a good look at their caches when one instance gets the notification.
And last, of course sometimes a trade-off between performance and memory usage has to be made. Here again, if the instances can communicate in some way, they may be able to adjust their maximum cache size based on the number of instances and the amount of memory available on the machine they run on. This way, if more instances are started, they all give in a little bit to accommodate the newcomer, without the risk of overloading the memory of the server.
What strategy are you going to use to determine what needs to be cached? Are you going to keep a last-used timestamp and flushing old items when room needs to be made for new ones?

how we can ensure caching to reduce file-system write cycle for SQLite databases

I would like to implement caching in SQLite Database. My primary objective is to write data to RAM and when the Cache is filled I want to flush all the data to disk database. I would like to know whether it is possible at all? if possible can I have some sample codes?
Thanks
SQLite already does its own cacheing, which is likely to be more efficient than anything you can implement - you can read about the interface to it here. You may be interested in other optimisations - there is a FAQ here.
You might want to checkout the SQLite fine-tuning commands (pragmas)
Since sqlite is transactional, it relies on fsync to ensure a particular set of statements have completed when a transaction is committed. The speed and implementation of fsync varies from platform to platform.
So, by batching several statements within a transaction, you can get a significant increase in speed since several blocks of data will be written before fsync is called.
An older sqlite article here illustrates this difference between doing several INSERTs inside and outside transactions.
However, if you are writing an application needing concurrent access to data, note that when sqlite starts a write transaction, all reads (select statements) will be blocked. You may want to explore using your in memory cache to retrieve data while a write transaction is taking place.
With that said, it's also possible that sqlite's caching scheme will handle that for you.
Why do you want to do this? Are you running into performance issues? Or do you want to prevent other connections from seeing data until you commit it to disk?
Regarding syncing to disk, there is a tradeoff between database integrity and speed. Which you want depends on your situation.
Use transactions. Advantages: High reliability and simple. Disadvantages: once you start a transaction, no one else can write to the database until you COMMIT or ROLLBACK. This is usually the best solution. If you have a lot of work to do at once, begin a transaction, write everything you need, then COMMIT. All your changes will be cached in RAM until you COMMIT, at which time the database will explicitly sync to disk.
Use PRAGMA journal_mode=MEMORY and/or PRAGMA synchronous=OFF. Advantages: High speed and simple. Disadvantages: The database is no longer safe against power loss and program crashes. You can lose your entire database with these options. However, they avoid explicitly syncing to disk as often.
Write your changes to an in-memory database and manually sync when you want. Advantages: High speed and reliable. Disadvantages: Complicated, and another program can write to the database without you knowing about it. By writing to an in-memory database, you never need to sync to disk until you want to. Other programs can write to the database file, and if you're not careful you can overwrite those changes. This option is probably too complicated to be worth it.