Business logic/rules - processing in the database or in memory - business-rules

because of my over abundance of hubris, I am working on a program that would process some of the data better than the current system we are currently using. My question is when the implementing business rules (i.e. if this piece of data matches this pattern send to this que) is it best practice to:
simply load all the rules into memory from a DB when the program starts
positive: very fast
negative: this program would have allot of rules so could be a memory hog
have all the rules into a database and allow the matching to be done in the database positive: not using a ton of memory negative: lots of database calls
have a flag in memory that would call to a specific rule in the database. positive: not a ton of memory negative: still a lot of database calls
Any thoughts

You forgot about the hybrid of your two extremes -- a smarter cache (smarter than everything in memory).
Initialize the cache with no rules (or a few of the most popular).
The app requests a rule from the cache.
If it exists in the cache, return it.
If not, load it from the database, store it in cache, and return it to the user.

As with everything performance-related, you need to try options and measure their performance. It is very hard to tell in advance which one will work best for you.
One of the latest trends is in-memory databases. Doing BI and analytics in the database which holds the entire needed dataset in memory. We're talking about gigabytes of RAM.
You could consider this option knowing it is not exotic any more. RAM is cheap these days. Maybe it will work for you. You need to try it out.

Related

Preloading data into RAM for fast transaction

My thinking is that, if we preload clients' data(account no, netbalance) in advance, and whenever a transaction is processed the txn record is written into RAM in FIFO data structure, and update also the clients' data in RAM, then after a certain period the record will be written into database in disk to prevent data lost from RAM due to volatility.
By doing so the time in I/O should be saved and hance less time for seeking clients' data for the aim (faster transaction).
I have heard about in-memory database but I do not know if my idea is same as that thing. Also, is there better idea than what I am thinking?
In my opinion, there are several aspects to think about / research to get a step forward. Pre-Loading and Working on data is usually faster than being bound to disk / database page access schemata. However, you are instantly loosing durability. Therefore, three approaches are valid in different situations:
disk-synchronous (good old database way, after each transaction data is guaranteed to be in permanent storage)
in-memory (good as long as the system is up and running, faster by orders of magnitude, risk of loosing transaction data on errors)
delayed (basically in-memory, but from time to time data is flushed to disk)
It is worth noting that delayed is directly supported on Linux through Memory-Mapped files, which are - on the one hand side - often as fast as usual memory (unless reading and accessing too many pages) and on the other hand synced to disk automatically (but not instantly).
As you tagged C++, this is possibly the simplest way of getting your idea running.
Note, however, that when assuming failures (hardware, reboot, etc.) you won't have transactions at all, because it is non-trivial to concretely tell, when the data is actually written.
As a side note: Sometimes, this problem is solved by writing (reliably) to a log file (sequential access, therefore faster than directly to the data files). Search for the word Compaction in the context of databases: This is the operation to merge a log with the usually used on-disk data structures and happens from time to time (when the log gets too large).
To the last aspect of the question: Yes, in-memory databases work in main memory. Still, depending on the guarantees (ACID?) they give, some operations still involve hard disk or NVRAM.

Which Key value, Nosql database can ensure no data loss in case of a power failure?

At present, we are using Redis as an in-memory, fast cache. It is working well. The problem is, once Redis is restarted, we need to re-populate it by fetching data from our persistent store. This overloads our persistent
store beyond its capacity and hence the recovery takes a long time.
We looked at Redis persistence options. The best option (without compromising performance) is to use AOF with 'appendfsync everysec'. But with this option, we can loose last second data. That is not acceptable. Using AOF with 'appednfsync always' has a considerable performance penalty.
So we are evaluating single node Aerospike. Does it guarantee no data loss in case of power failures? i.e. In response to a write operation, once Aerospike sends success to the client, the data should never be lost, even if I pull the power cable of the server machine. As I mentioned above, I believe Redis can give this guarantee with the 'appednfsync always' option. But we are not considering it as it has the considerable performance penalty.
If Aerospike can do it, I would want to understand in detail how persistence works in Aerospike. Please share some resources explaining the same.
We are not looking for a distributed system as strong consistency is a must for us. The data should not be lost in node failures or split brain scenarios.
If not aerospike, can you point me to another tool that can help achieve this?
This is not a database problem, it's a hardware and risk problem.
All databases (that have persistence) work the same way, some write the data directly to the physical disk while others tell the operating system to write it. The only way to ensure that every write is safe is to wait until the disk confirms the data is written.
There is no way around this and, as you've seen, it greatly decreases throughput. This is why databases use a memory buffer and write batches of data from the buffer to disk in short intervals. However, this means that there's a small risk that a machine issue (power, disk failure, etc) happening after the data is written to the buffer but before it's written to the disk will cause data loss.
On a single server, you can buy protection through multiple power supplies, battery backup, and other safeguards, but this gets tricky and expensive very quickly. This is why distributed architectures are so common today for both availability and redundancy. Distributed systems do not mean you lose consistency, rather they can help to ensure it by protecting your data.
The easiest way to solve your problem is to use a database that allows for replication so that every write goes to at least 2 different machines. This way, one machine losing power won't affect the write going to the other machine and your data is still safe.
You will still need to protect against a power outage at a higher level that can affect all the servers (like your entire data center losing power) but you can solve this by distributing across more boundaries. It all depends on what amount of risk is acceptable to you.
Between tweaking the disk-write intervals in your database and using a proper distributed architecture, you can get the consistency and performance requirements you need.
I work for Aerospike. You can choose to have your namespace stored in memory, on disk or in memory with disk persistence. In all of these scenarios we perform favourably in comparison to Redis in real world benchmarks.
Considering storage on disk when a write happens it hits a buffer before being flushed to disk. The ack does not go back to the client until that buffer has been successfully written to. It is plausible that if you yank the power cable before the buffer flushes, in a single node cluster the write might have been acked to the client and subsequently lost.
The answer is to have more than one node in the cluster and a replication-factor >= 2. The write then goes to the buffer on the client and the replica and has to succeed on both before being acked to the client as successful. If the power is pulled from one node, a copy would still exist on the other node and no data would be lost.
So, yes, it is possible to make Aerospike as resilient as it is reasonably possible to be at low cost with minimal latencies. The best thing to do is to download the community edition and see what you think. I suspect you will like it.
I believe aerospike would serves your purpose, you can configure it for hybrid storage at namespace(i.e. DB) level in aerospike.conf
which is present at /etc/aerospike/aerospike.conf
For details please refer official documentation here: http://www.aerospike.com/docs/operations/configure/namespace/storage/
I believe you're going to be at the mercy of the latency of whatever the storage medium is, or the latency of the network fabric in the case of cluster, regardless of what DBMS technology you use, if you must have a guarantee that the data won't be lost. (N.B. Ben Bates' solution won't work if there is a possibility that the whole physical plant loses power, i.e. both nodes lose power. But, I would think an inexpensive UPS would substantially, if not completely, mitigate that concern.) And those latencies are going to cause a dramatic insert/update/delete performance drop compared to a standalone in-memory database instance.
Another option to consider is to use NVDIMM storage for either the in-memory database or for the write-ahead transaction log used to recover from. It will have the absolute lowest latency (comparable to conventional DRAM). And, if your in-memory database will fit in the available NVDIMM memory, you'll have the fastest recovery possible (no need to replay from a transaction log) and comparable performance to the original IMDB performance because you're back to a single write versus 2+ writes for adding a write-ahead log and/or replicating to another node in a cluster. But, your in-memory database system has to be able to support direct recovery of an in-memory database (not just from a transaction log). But, again, two requirements for this to be an option:
1. The entire database must fit in the NVDIMM memory
2. The database system has to be able to support recovery of the database directly after system restart, without a transaction log.
More in this white paper http://www.odbms.org/wp-content/uploads/2014/06/IMDS-NVDIMM-paper.pdf

How Memcached ensure that its pages in memory wont be swapped out by other pages?

So I am trying to speed up my database-driven website and I came across memcached which I can apparently use to "cache" frequent key/values from the database in the memory. My question is, how can data be forced to stay in the memory, because normally if the server is running other application, there is a possibility that the memcached hashtable (or part of it) will be written back to the hard disk if a page from another application replaces a page from the memcached. So in industry, do they have separate machines that only run the memcached? or do they tweak the operating systems internals so that pages from the memcached wont be swapped out?
The question can be generalized to any other application, knowing how to force data to stay in the memory can be very beneficial.
By default memcached pages will be swapped out to the hard disk if your machine is using more physical memory than it has available. You can specify the -k parameter however to specify that the pages memcached is using be locked into physical memory with mlockall(). See this blog post for more information on how to do this.
http://threebrothers.org/brendan/blog/using-memcached-k-prevent-paging/
The OS writers have written good algorithms for paging over the years; algo that have stood the test of time. The paging algorithm tries to optimize the pages it swaps out.
'it would be very beneficial to to force the data to stay in memory' - sure about that? Would you rather have your Apache code being swapped out, rather than a page of data that was not used for an hour? If Apache swaps out, who would use that data?
It is difficult to design something that beats native OS paging algo. Can be done, but usually requires more than just locking memory for 1 program.
I would recommend setting memcache size appropriately so that the machine does not use paging.
If you run memcache solely on the system, which many people do, there is not much need to fiddle with -k switch. Or that machine can be set to swap = 0.

Decreasing performance writing large binary file

In one of our softwares we are creating records and storing them in a binary file. Once the writing operation is completed we read back this binary file. The issue is if this binary file is less than 100 MB then its performance is good enough, but once this file grows larger its performance is hit.
So, I thought of splitting this large binary file ( > 100 MB) into smaller ones ( < 100 MB). But it seems this solution is not gaining the performance. So, I was just thinking what can be the better approach to handle this scenario?
It will be really great help from you guys to comment on this.
Thanks
Maybe you could try using an Sqlite database instead.
It is always quite the difficult to provide accurate answers with only a glimpse of the system, but have you actually tried to check the actual throughput ?
As a first solution, I would simply recommend using a dedicated disk (so there are no concurrent read/write actions from other processes), and a fast one at that. This way it would be just some cost of hardware upgrade, and we all know hardware is usually cheaper that software ;) You may even go to a RAID controller for maximizing throughput.
If you are still limited by the disk throughput, there are new technologies out there using the Flash technology: USB keys (though it may not seem very professional) or the "new" Solid State Drives may provide more throughput than a mechanical disk.
Now, if the disks approach are not fast enough or you can't get your hands on good SSDs, you have other solutions, but they involve software changes, and I propose them off the top of my hat.
A socket approach: the second utility is listening on a port and you send it the data there. On a local machine it's relatively fast, and you parallelize the work too, so even if the size of the data grows, you will still begin treating fairly quickly.
A memory mapping approach: write to a dedicated area in live memory and have the utility read from that area (Boost.Interprocess may help, there are other solutions).
Note that if the read is sequential, I find it more "natural" to try a 'pipe' approach (ala Unix) so that the two processes execute concurrently. In a traditional pipe, the data may not hit the disk after all.
A shame, isn't it, that in this age of overwhelming processing power, we are still struggling with our disk IO ?
If your App is reading the data sequential migrating to a DB would not help to increase performance. If random access is used you should consider to move the data into a DB,especially if different indices are used. You should check whether enough resources are available, if loaded completly into memory virtual memory management could have an impact to performance (swapping,paging). Depending on your OS setting a limit for file io buffers could be reached. The file system itself could be fragmented.
To get a higer quality answer you should provide informations about hardware,os,memory and file system. And the way your data file is used. Than you could get hints about kernel tuning etc.
So what is the retrieval mechanism here? How does your application know which of the smaller files to look in to find a record? If you have split up the big file without implementing some form of keyed lookup - indexing, partitioning - you have not addressed the problem, just re-arranged it.
Of course, if you have implemented some form of indexing then you have started down the road of building your own database.
Without knowing more regarding your application it would be rash for us to offer specific advice. Maybe the solution would be to apply an RDBMS solution. Possibly a NoSQL approach would be better. Perhaps you need a text indexing and retrieval engine.
So...
How often does your application need to retrieve records? How does it decide which records to get? What is your definition of poor performance? Why did you (your project) decide to use flat files rather than a database in the first place? What sort of records are we talking about?

how we can ensure caching to reduce file-system write cycle for SQLite databases

I would like to implement caching in SQLite Database. My primary objective is to write data to RAM and when the Cache is filled I want to flush all the data to disk database. I would like to know whether it is possible at all? if possible can I have some sample codes?
Thanks
SQLite already does its own cacheing, which is likely to be more efficient than anything you can implement - you can read about the interface to it here. You may be interested in other optimisations - there is a FAQ here.
You might want to checkout the SQLite fine-tuning commands (pragmas)
Since sqlite is transactional, it relies on fsync to ensure a particular set of statements have completed when a transaction is committed. The speed and implementation of fsync varies from platform to platform.
So, by batching several statements within a transaction, you can get a significant increase in speed since several blocks of data will be written before fsync is called.
An older sqlite article here illustrates this difference between doing several INSERTs inside and outside transactions.
However, if you are writing an application needing concurrent access to data, note that when sqlite starts a write transaction, all reads (select statements) will be blocked. You may want to explore using your in memory cache to retrieve data while a write transaction is taking place.
With that said, it's also possible that sqlite's caching scheme will handle that for you.
Why do you want to do this? Are you running into performance issues? Or do you want to prevent other connections from seeing data until you commit it to disk?
Regarding syncing to disk, there is a tradeoff between database integrity and speed. Which you want depends on your situation.
Use transactions. Advantages: High reliability and simple. Disadvantages: once you start a transaction, no one else can write to the database until you COMMIT or ROLLBACK. This is usually the best solution. If you have a lot of work to do at once, begin a transaction, write everything you need, then COMMIT. All your changes will be cached in RAM until you COMMIT, at which time the database will explicitly sync to disk.
Use PRAGMA journal_mode=MEMORY and/or PRAGMA synchronous=OFF. Advantages: High speed and simple. Disadvantages: The database is no longer safe against power loss and program crashes. You can lose your entire database with these options. However, they avoid explicitly syncing to disk as often.
Write your changes to an in-memory database and manually sync when you want. Advantages: High speed and reliable. Disadvantages: Complicated, and another program can write to the database without you knowing about it. By writing to an in-memory database, you never need to sync to disk until you want to. Other programs can write to the database file, and if you're not careful you can overwrite those changes. This option is probably too complicated to be worth it.