Optimal PostgreSQL isolation level for multiprocessing app - django

I have an app that spins up multiple processes to read large amounts of data from several PostgreSQL tables to do number crunching, and then stores the results in separate tables.
When I tested this with just a single process, it was blazing fast and was using almost 100% CPU, but when I tried using 8 processes on an 8 core machine, all processes registered about 1% CPU and the whole task seemed to take even longer.
When I check pg_stat_activity, I saw several connections listed as "<IDLE> in transaction". Following some advice here, I looked at pg_locks, and I'm seeing hundreds of "AccessShareLock" locks on the dozens of read-only tables. Based on the docs, I believe this is the default, but I think this is causing the processes to step on each others feet, negating any benefit to multi-processing.
Is there a more efficient isolation level to use, or better way to tune PostgreSQL to allow faster read-only access to several processes, so each doesn't need to lock the table? Specifically, I'm using Django as my ORM.

Not sure what throttles your multiple cores, but it has nothing to do with the isolation level. Even if you have concurrent write operations. Per documentation:
The main advantage of using the MVCC model of concurrency control
rather than locking is that in MVCC locks acquired for querying
(reading) data do not conflict with locks acquired for writing data,
and so reading never blocks writing and writing never blocks reading.
PostgreSQL maintains this guarantee even when providing the strictest
level of transaction isolation through the use of an innovative
Serializable Snapshot Isolation (SSI) level.
Bold emphasis mine.
Of course, reading also never blocks reading.
Maybe you need to reconfigure resource allocation on your server? Default configuration is regularly to conservative. On the other hand, some parameters should not be set too high in a multi-user environment. work_mem comes to mind. Check the list for Performance Optimization in the Postgres Wiki.
And finally:
Django as my ORM.
ORMs often try to stay platform-independent and fail to get the full potential out of a particular RDBMS. They are primitive crutches and don't play well with performance optimization.

Related

Can Redis 6 take advantage of multi-core CPUs?

Since Redis 6 supports multi-threaded IO, does it make sense to deploy Redis on machines with more than 2 cores? Will it be able to take advantage of the additional cores or are 2 cores still ideal(one for the main thread, and the other to handle bgsave and other housekeeping operations)?
Similarly, on AWS ElastiCache does it make sense to use instance types with >2 vCPUs?
Basing on release notes, I guess it does.
Here's a piece of small information from there:
Despite Redis’ well-deserved reputation for high performance, its single-threaded architecture has been controversial among engineers who wondered if Redis could be even faster. Redis 6 rings in a new era: while it retains a core single-threaded data-access interface, I/O is now threaded.
By delegating the time spent reading and writing to I/O sockets over to other threads, the Redis process can devote more cycles to manipulating, storing, and retrieving data—boosting overall performance. This improvement retains the transactional characteristics of previous versions, so you don’t have to rethink your applications to take advantage of the increased performance. Similarly, Redis’ single-threaded DEL command can now be configured to behave like the multi-thread UNLINK command that has been available since Redis version 4.
The performance of a local variable is almost always unbeatable, Finally, even a database as high performance as Redis will be much slower than accessing something from the stack or heap. Redis 6 adds a new technique for sophisticated client libraries to implement a client-side caching layer to store a subset of data in your own process. This implementation is smart enough to manage multiple updates to the same data and keep your data as in-sync as possible—while retaining the advantages of Redis with the speed of local variables.
You could also check/compare it with redis-benchmark or memtier harness for your instance/workload profile.

Which Key value, Nosql database can ensure no data loss in case of a power failure?

At present, we are using Redis as an in-memory, fast cache. It is working well. The problem is, once Redis is restarted, we need to re-populate it by fetching data from our persistent store. This overloads our persistent
store beyond its capacity and hence the recovery takes a long time.
We looked at Redis persistence options. The best option (without compromising performance) is to use AOF with 'appendfsync everysec'. But with this option, we can loose last second data. That is not acceptable. Using AOF with 'appednfsync always' has a considerable performance penalty.
So we are evaluating single node Aerospike. Does it guarantee no data loss in case of power failures? i.e. In response to a write operation, once Aerospike sends success to the client, the data should never be lost, even if I pull the power cable of the server machine. As I mentioned above, I believe Redis can give this guarantee with the 'appednfsync always' option. But we are not considering it as it has the considerable performance penalty.
If Aerospike can do it, I would want to understand in detail how persistence works in Aerospike. Please share some resources explaining the same.
We are not looking for a distributed system as strong consistency is a must for us. The data should not be lost in node failures or split brain scenarios.
If not aerospike, can you point me to another tool that can help achieve this?
This is not a database problem, it's a hardware and risk problem.
All databases (that have persistence) work the same way, some write the data directly to the physical disk while others tell the operating system to write it. The only way to ensure that every write is safe is to wait until the disk confirms the data is written.
There is no way around this and, as you've seen, it greatly decreases throughput. This is why databases use a memory buffer and write batches of data from the buffer to disk in short intervals. However, this means that there's a small risk that a machine issue (power, disk failure, etc) happening after the data is written to the buffer but before it's written to the disk will cause data loss.
On a single server, you can buy protection through multiple power supplies, battery backup, and other safeguards, but this gets tricky and expensive very quickly. This is why distributed architectures are so common today for both availability and redundancy. Distributed systems do not mean you lose consistency, rather they can help to ensure it by protecting your data.
The easiest way to solve your problem is to use a database that allows for replication so that every write goes to at least 2 different machines. This way, one machine losing power won't affect the write going to the other machine and your data is still safe.
You will still need to protect against a power outage at a higher level that can affect all the servers (like your entire data center losing power) but you can solve this by distributing across more boundaries. It all depends on what amount of risk is acceptable to you.
Between tweaking the disk-write intervals in your database and using a proper distributed architecture, you can get the consistency and performance requirements you need.
I work for Aerospike. You can choose to have your namespace stored in memory, on disk or in memory with disk persistence. In all of these scenarios we perform favourably in comparison to Redis in real world benchmarks.
Considering storage on disk when a write happens it hits a buffer before being flushed to disk. The ack does not go back to the client until that buffer has been successfully written to. It is plausible that if you yank the power cable before the buffer flushes, in a single node cluster the write might have been acked to the client and subsequently lost.
The answer is to have more than one node in the cluster and a replication-factor >= 2. The write then goes to the buffer on the client and the replica and has to succeed on both before being acked to the client as successful. If the power is pulled from one node, a copy would still exist on the other node and no data would be lost.
So, yes, it is possible to make Aerospike as resilient as it is reasonably possible to be at low cost with minimal latencies. The best thing to do is to download the community edition and see what you think. I suspect you will like it.
I believe aerospike would serves your purpose, you can configure it for hybrid storage at namespace(i.e. DB) level in aerospike.conf
which is present at /etc/aerospike/aerospike.conf
For details please refer official documentation here: http://www.aerospike.com/docs/operations/configure/namespace/storage/
I believe you're going to be at the mercy of the latency of whatever the storage medium is, or the latency of the network fabric in the case of cluster, regardless of what DBMS technology you use, if you must have a guarantee that the data won't be lost. (N.B. Ben Bates' solution won't work if there is a possibility that the whole physical plant loses power, i.e. both nodes lose power. But, I would think an inexpensive UPS would substantially, if not completely, mitigate that concern.) And those latencies are going to cause a dramatic insert/update/delete performance drop compared to a standalone in-memory database instance.
Another option to consider is to use NVDIMM storage for either the in-memory database or for the write-ahead transaction log used to recover from. It will have the absolute lowest latency (comparable to conventional DRAM). And, if your in-memory database will fit in the available NVDIMM memory, you'll have the fastest recovery possible (no need to replay from a transaction log) and comparable performance to the original IMDB performance because you're back to a single write versus 2+ writes for adding a write-ahead log and/or replicating to another node in a cluster. But, your in-memory database system has to be able to support direct recovery of an in-memory database (not just from a transaction log). But, again, two requirements for this to be an option:
1. The entire database must fit in the NVDIMM memory
2. The database system has to be able to support recovery of the database directly after system restart, without a transaction log.
More in this white paper http://www.odbms.org/wp-content/uploads/2014/06/IMDS-NVDIMM-paper.pdf

how we can ensure caching to reduce file-system write cycle for SQLite databases

I would like to implement caching in SQLite Database. My primary objective is to write data to RAM and when the Cache is filled I want to flush all the data to disk database. I would like to know whether it is possible at all? if possible can I have some sample codes?
Thanks
SQLite already does its own cacheing, which is likely to be more efficient than anything you can implement - you can read about the interface to it here. You may be interested in other optimisations - there is a FAQ here.
You might want to checkout the SQLite fine-tuning commands (pragmas)
Since sqlite is transactional, it relies on fsync to ensure a particular set of statements have completed when a transaction is committed. The speed and implementation of fsync varies from platform to platform.
So, by batching several statements within a transaction, you can get a significant increase in speed since several blocks of data will be written before fsync is called.
An older sqlite article here illustrates this difference between doing several INSERTs inside and outside transactions.
However, if you are writing an application needing concurrent access to data, note that when sqlite starts a write transaction, all reads (select statements) will be blocked. You may want to explore using your in memory cache to retrieve data while a write transaction is taking place.
With that said, it's also possible that sqlite's caching scheme will handle that for you.
Why do you want to do this? Are you running into performance issues? Or do you want to prevent other connections from seeing data until you commit it to disk?
Regarding syncing to disk, there is a tradeoff between database integrity and speed. Which you want depends on your situation.
Use transactions. Advantages: High reliability and simple. Disadvantages: once you start a transaction, no one else can write to the database until you COMMIT or ROLLBACK. This is usually the best solution. If you have a lot of work to do at once, begin a transaction, write everything you need, then COMMIT. All your changes will be cached in RAM until you COMMIT, at which time the database will explicitly sync to disk.
Use PRAGMA journal_mode=MEMORY and/or PRAGMA synchronous=OFF. Advantages: High speed and simple. Disadvantages: The database is no longer safe against power loss and program crashes. You can lose your entire database with these options. However, they avoid explicitly syncing to disk as often.
Write your changes to an in-memory database and manually sync when you want. Advantages: High speed and reliable. Disadvantages: Complicated, and another program can write to the database without you knowing about it. By writing to an in-memory database, you never need to sync to disk until you want to. Other programs can write to the database file, and if you're not careful you can overwrite those changes. This option is probably too complicated to be worth it.

building a web crawler

I'm currently developing a custom search engine with built-in web crawler. For some reason I'm not into multi-threading, thus so far my indexer was coded in single-threaded manner. Now I have a small dilemma with the crawler I'm building. Can anybody suggest which is better, crawl 1 page then index it, or crawl 1000+ page and cache, then index?
Networks are slow (relative to the CPU). You will see a significant speed increase by parallelizing your crawler. Otherwise, your app will spend the majority of its time waiting on network IO to complete. You can either use multiple threads and blocking IO or a single thread with asynchronous IO.
Also, most indexing algorithms will perform better on batches of documents verses indexing one document at a time.
Better? In terms of what? In terms of speed I can't forsee a noticable difference. In terms of robustness (recovering from a catastrophic failure) its probably better to index each page as you crawl it.
I would strongly suggest getting "in" to to multi-threading if you are serious about your crawler. Basically, you would want to have at least one indexer and at least one crawler (potentially multitudes for both) running at all times. Among other things, this minimizes start-up and shutdown overhead (e.g. initializing and freeing data structures).
Not using threads is OK.
However if you still want performance, you need to deal with Asynchronous IO.
I would recommend checking out Boost.ASIO link text. Using Asynchronous IO will make your dilemma "irrelevant", as it would not matter. Also as a bonus, in future if you do decide to use threads, then its trivial to tell Boost.Asio to apply multuple threads to the problem.

BerkeleyDB Concurrency

What's the optimal level of concurrency that the C++ implementation of BerkeleyDB can reasonably support?
How many threads can I have hammering away at the DB before throughput starts to suffer because of resource contention?
I've read the manual and know how to set the number of locks, lockers, database page size, etc. but I'd just like some advice from someone who has real-world experience with BDB concurrency.
My application is pretty simple, I'll be doing gets and puts of records that are about 1KB each. No cursors, no deleting.
It depends on what kind of application you are building. Create a representative test scenario, and start hammering away. Then you will know the definitive answer.
Besides your use case, it also depends on CPU, memory, front-side bus, operating system, cache settings, etcetera.
Seriously, just test your own scenario.
If you need some numbers (that actually may mean nothing in your scenario):
Oracle Berkeley DB:
Performance Metrics and
Benchmarks
Performance Metrics
& Benchmarks:
Berkeley DB
I strongly agree with Daan's point: create a test program, and make sure the way in which it accesses data mimics as closely as possible the patterns you expect your application to have. This is extremely important with BDB because different access patterns yield very different throughput.
Other than that, these are general factors I found to be of major impact on throughput:
Access method (which in your case i guess is BTREE).
Level of persistency with which you configured DBD (for example, in my case the 'DB_TXN_WRITE_NOSYNC' environment flag improved write performance by an order of magnitude, but it compromises persistency)
Does the working set fit in cache?
Number of Reads Vs. Writes.
How spread out your access is (remember that BTREE has a page level locking - so accessing different pages with different threads is a big advantage).
Access pattern - meanig how likely are threads to lock one another, or even deadlock, and what is your deadlock resolution policy (this one may be a killer).
Hardware (disk & memory for cache).
This amounts to the following point:
Scaling a solution based on DBD so that it offers greater concurrency has two key ways of going about it; either minimize the number of locks in your design or add more hardware.
Doesn't this depend on the hardware as well as number of threads and stuff?
I would make a simple test and run it with increasing amounts of threads hammering and see what seems best.
What I did when working against a database of unknown performance was to measure turnaround time on my queries. I kept upping the thread count until turn-around time dropped, and dropping the thread count until turn-around time improved (well, it was processes in my environment, but whatever).
There were moving averages and all sorts of metrics involved, but the take-away lesson was: just adapt to how things are working at the moment. You never know when the DBAs will improve performance or hardware will be upgraded, or perhaps another process will come along to load down the system while you're running. So adapt.
Oh, and another thing: avoid process switches if you can - batch things up.
Oh, I should make this clear: this all happened at run time, not during development.
The way I understand things, Samba created tdb to allow "multiple concurrent writers" for any particular database file. So if your workload has multiple writers your performance may be bad (as in, the Samba project chose to write its own system, apparently because it wasn't happy with Berkeley DB's performance in this case).
On the other hand, if your workload has lots of readers, then the question is how well your operating system handles multiple readers.