We want to shard our PostgreSQL DB, due to high disk load. Firstly, we looked at django-sharding library, but:
Very much rewriting in our backend
Migrating all tables to 64-bit primary keys is hard work on 300-400gb tables
Generating ids with Postgres Specific algorithm makes it impossible to move data from shard to shard. More than that, we have a large database with old ids. Updating all of them is a big problem too.
Generating ids with special tables makes us do a special SELECT query to main database every time we insert data. We have high write load, so it's not good.
Considring all these, we decided too look on Postgres database sharding solutions. We found 2 opportunities - Citus and PostgresXL. Citus makes us change data format too much and rewrite a big bunch of backend at the same time, so we are about to try PostgresXL as more transparent solution. But reading the docs, I can't understand some things and will be greatfull for recomendations:
Are there any other sharding workarounds except for Citus and PostgresXL? It would be good not to change much in our database on migrating.
Some questions about PostgresXL:
Do I understand correctly, that it's not Postgres extension, it's a standalone fork? So I should build all its parts from sources and than move data in some way?
How are Postgres and PostgresXL versions compatible? We have PostgreSQL 9.4. I don't see such a version in PostgresXL (9.2 or 9.5 no middle?). So can I use, for example, streaming replication for migration?
If yes/no, what is the best solution to migrate data? If I have 2Tb database with heavy write, can I migrate it somehow without stopping for a long period of time?
Thanks.
First off to save your self a LOT of headache have you looked at options Like Amazon's Auora, Dynomo, Red Shift, etc services? They are VERY cost effective at scale, as well as optimized and managed for you.
Actually Amazon's straight Postgress databases can handle MASSIVE amounts of reads or writes. We can go into 2,000- 6,000 IOPS on reads and another 2,000 to 6,000 IOPS in writes without issue. I would really look into this as the option. Azure, Oracle, and Google also have competing services.
Also be aware that Postgres-XL beyond all reason has no HA support. If you lose a single node you lose everything. The nodes can not fail over.
it's a standalone fork?
Yes, They are very different apps and developed separate from each other.
How are Postgres and PostgresXL versions compatible?
They arn't compatible. You can not just migration Postgres to Postgresl-XL. They work VERY differently.
Generating ids with Postgres Specific algorithm makes it impossible to >move data from shard to shard
Not following this, but with sharing you are not supposed to move data from one shard to another. The key being used generally needs to be something specific and unique to split/segregate your data on. Like a date, or a "type" field, or some other (hopefully ordered) field(s)/column(s). This breaks things up but has obvious pain in the a$$ limitations.
Are there any other sharding workarounds except for Citus and
PostgresXL? It would be good not to change much in our database on >>migrating.
Tons of options, but right off the bat going from a standard RDS, to a NoSql, or MPP database is going to be a major migration, a lot of effort, and have a LOT of limitations no matter what you do.
Next Postress-XL and Citus are MPP (massive parallel processing) clustering apps, not sharing specifically. That is part of what they can do, but it is not their focus.
Other options for MPP
pgPool -- (not great for heavy writes )
haProxy -- ( have not done it but read about it. Lost of work to setup and maintain. )
MySql Cluster -- (Huge pain to use the OSS version and major $$$ for the commercial version)
Green Plumb
Teradata
Vertica
what is the best solution to migrate data?
Very unlikely to find a simple migration for this kind of switch. You can expect to likely need to export the data your self from the existing RDS and import it to the new DB and will likely have to write something your self to get it the way you want it.
Our staging server, a t2.micro instance on AWS was getting down constantly. On investigating, we found that when manage.py migrate is run CPU usage shoots up to 99%. It was easily reproducible on the local machine. We are running Django 1.9 and postgresql database. I am not sure now, is it us doing something wrong or it is meant to be that way. We have around 18 apps in the project, but running migrate app_name also results in same behaviour. Attaching the screenshots of CPU usage.
Also, I profiled the migrate function, here is a graph:
Are you depending on migrate to run regularly? Because once the project is nearing and then entering production state, there shouldn't be many migrations to run. Or do you mean that migrate takes this long, even if migrate --list shows that there is nothing to migrate?
Also, to know what Postgres is doing, you should set up logging of queries including their time. You can filter to log only longer running queries:
http://www.postgresql.org/docs/9.5/static/runtime-config-logging.html
Run those queries through the explain analyze sql command:
psql> EXPLAIN ANALYZE <complete query>;
http://www.postgresql.org/docs/9.5/static/using-explain.html
You need to provide the information you get from explain to get further help.
EDIT:
Also, you could try to squash migrations if you have a lot of migration files. I could imagine that Django works itself through all of them, one by one. So if you have many apps with many files depending on each other, you can imagine what happens.
https://docs.djangoproject.com/en/1.9/topics/migrations/#squashing-migrations
EDIT 2:
Moving this from the comment into the answer:
Does migrate --list also consume that much CPU? If not, then you could run it first, see whether there really is a need to migrate and only run migrate on those apps that have open migrations.
I think this would be the best. If you can profile in more detail, you might actually address the Django community for help. I could imagine that you have an interesting setup with which to find out how to tune the Django migrations to do less (actually unnecessary) work. But I don't know the migrations code too much so I cannot tell.
But this also depends on how many apps we are talking about, and how many migration files. If you have less than 30 apps (including 3rd party), I think it should work fine and there is something else wrong (IMHO!).
Also, you have not shown the resource usage of your server. If the slowness is due to swapping/too much RAM usage you really might be able to boost it by supplying more RAM (to the process).
I believe migrations consume a lot, specially when having many models and many apps, more apps more dependencies more migrations complexity.
I would recommend starting a new instance which only run migration and shutdown after this. This way you web server could be reachable.
This does not address the problem statement exactly but a part of it. I went through the documentation of AWS t2.micro and found that T2.Micro instances are designed to handle the CPU Burst of short intervals(~1 min) happening after reasonable long intervals. From the t2.micro documentation:
A CPU Credit provides the performance of a full CPU core for one minute. Traditional Amazon EC2 instance types provide fixed performance, while T2 instances provide a baseline level of CPU performance with the ability to burst above that baseline level. The baseline performance and ability to burst are governed by CPU credits.
Running migration's shouldn't be an issue given this ^ even if it is consuming 100% of the CPU. We investigated more and found that there were crons running on the server which were not supposed to be.
I would like to move a database in a Django project from a backend to another (in this case azure sql to postgresql, but I want to think of it as a generic situation). I can't use a dump since the databases are different.
I was thinking of something at the django level, like dumpdata, but depending on the amount of available memory and the size of the db sometimes it appears unreliable and crashes.
I have seen solutions that try to break the process into smaller parts that the memory can handle but it was a few years ago, so I was hoping to find other solutions.
So far my searches have failed since they always lead to 'south', which refers to schema migration and not moving data.
I have not implemented this before, but what about the following:
Django supports multiple databases...so just configure DATABASES in your settings file to support the old postgresql database and the azure sql database. Then create a small script that makes use of bulk_create, reading the data from one DB and writing it to the other.
Sorry for this question, I dont know if i've understood the concept, but SQLite is Serverless, this means the database in in a local machine, and it's stored in one file, this file is only accessible on one mode: if one client reads it, it's made only for reading mode for other clients, and if a client writes, then all clients have the write mode, so only in one mode at once!
so imagine that i've made a django application, a blog for example; then how is this made using sqlite? since if a client enters to the blog he gots the reading mode to see the page and the blog entries, and if a registred client tries to add a comment then the file will be made as write mode, so how can sqlite handle this?
so, does SQLite is here just like the BaseHTTPServer (the server shipped with django), for testing and learning purpose?
Different databases manage concurrency in different ways, but in sqlite, the method used is a global database-level lock. Only one thread or process can make changes to a sqlite database at a time; all other, concurrent processes will be forced to wait until the currently running process has finished.
As your number of users grows; sqlite's simple locking strategy will lead to increasingly great lock contention, and you will need to migrate your data to another database, such as MySQL (Which can do row level locking, at least with InnoDB engine) or PostgreSQL (Which uses Multiversion Concurrency Control). If you anticipate that you will get a substantial number of users (on the level of say, more than 1 request per second for a good part of the day), you should migrate off of sqlite; and the sooner you do so, the easier it will be.
SQLite is not like BaseHTTPServer or anything basic like that. It's a fully featured embedded database. Quite fast too. Its SQL language might not have the most bells and whistles, but it's flexible enough. I haven't run into cases where I needed something it cannot do for the projects I was involved in (which aren't your typical web apps, truth be told).
Anyone that claims SQLite is good or bad for production without discussing the actual design is not telling you much. SQLite is pretty fast. In some cases, literally orders of magnitude faster than, say, Postgres, which comes up as a go-to alternative among Djangonauts. As someone pointed out, it also supports lots of concurrency. It's a matter of whether your app falls under the 'some cases' or not.
Now, there is one significant factor that has to be taken into account. SQLite is an in-process database. This is really important. If you are using something like gevent, you may run into edge cases where your app breaks. E.g., trying to do a transaction where you have a context switch in middle of it can possibly break the transaction in horrible ways. In other words, 'concurrency' really depends on your app, because SQLite is part of your app.
What you can't do with SQLite, though, in terms of scaling, is you can't make clusters of SQLite servers like you can with some of the other database engines, because it's in-process. Your app may or may not need to go to such lengths in terms of scaling, but my guess is that vast majority of apps out there don't anyway (wild guess).
On the other hand, being in-process means adding custom functions and aggregates to it is pretty trivial. I'm not sure if Django's ORM makes that any more difficult than it has to be, but you can come up with pretty good designs taking advantage of those features.
This issue in database theory is called concurrency and SQLite does support it in Windows versions > Win98 and elsewhere according to the FAQ:
http://www.sqlite.org/faq.html#q5
We are aware of no other embedded SQL database engine that supports as
much concurrency as SQLite. SQLite allows multiple processes to have
the database file open at once, and for multiple processes to read the
database at once. When any process wants to write, it must lock the
entire database file for the duration of its update. But that normally
only takes a few milliseconds. Other processes just wait on the writer
to finish then continue about their business. Other embedded SQL
database engines typically only allow a single process to connect to
the database at once.
Basically, do not worry about concurrency, any database worth its salt takes care of just fine. More information on as how SQLite3 manages this can be found here. You, as a developer, not a database designer, needn't care about it unless you are interested in the inner-workings.
SQLite will only work effectively in production from some specific situations. It's quite easy to get MySQL or PostgreSQL up and running, even on Windows, and have a database that works in most situations.
The real problem is that SQLite3 isn't threaded in Django so only one PAGE view can happen at a time on your server, see this bug https://code.djangoproject.com/ticket/12118 Fixed
I don't use SQLite3 even in development.
EDIT: I keep getting downvoted here but the Django documentation itself recommended not using SQLite3 in Production at the time I wrote this answer. The documentation still contains the following caveat:
SQLite provides an excellent development alternative for applications that are predominantly read-only or require a smaller installation footprint.
If you do not have a small foot print/read-only Django instance, do NOT use SQLite3. Feel free to continue to downvote this answer.
It is not impossible to use Django with Sqlite as database in production, primarily depending on your website/webapp traffic and how hard you hit your db (alongside what kind of operations you perform on it i.e. reads/writes/etc). In fact, approaching end of 2019, I have used it in several low volume applications with less than 5k daily interactions (these are more common than you might think).
Simply put for the current state of tech , at the moment Sqlite-3 supports unlimited concurrent reads (or as far as your machine / workers can handle), BUT only a single process can write to it at any point in time. Bear in mind, a well designed query/ops to the db will last only miliseconds!
Coming from experience in using sqlite as the only db for simple non-routine (by non-routine i mean that a typical user would not be using this app on a daily basis year-round) production web app for overseas job matching that deal with ~5000 registered students (stats show consistently less than 2k requests per day that involves hitting the database during peak season - 40% write 60% read), I've had no problems whatsoever with timeouts/performance issues.
It really boils down to being pragmatic about the development and the URS (client spec). If it becomes the next unicorn , one can always migrate the SQLITE to another RDBMS. For instance, see David d C e Freitas's take on migration in Quick easy way to migrate SQLite3 to MySQL?
Additionally the SQLITE website uses sqlite db at its backend .. see below...
The SQLite website (https://www.sqlite.org/) uses SQLite itself, of course, and as of this writing (2015) it handles about 400K to 500K HTTP requests per day, about 15-20% of which are dynamic pages touching the database. Dynamic content uses about 200 SQL statements per webpage. This setup runs on a single VM that shares a physical server with 23 others and yet still keeps the load average below 0.1 most of the time.
Bear in mind that the above quote is of course mainly referring to read operations, so the values may not be a applicable for write-heavy sites.
The example I gave above on the job matching application I built using sqlite as db is quite write heavy if you've noticed the numbers ... on average, 40% are short lived write operations (i.e. form submissions, etc etc) but bear in mind my volume hitting the db is only 2k per day during peak season.
Then again, if you realize that your sqlite.db is causing alot of timeout and bad user experience (408 !!! on form submission...), especially with Django throwing the OperationalError: database is locked error. (and then they have to key in the whole thing again)...You can always increase the timeout in your settings.py as per django docs as a temporary solution while you prepare for migrating the db.
'OPTIONS': {
# ...
'timeout': 20,
# ...
}
Again, it all boils down to pragmatic development and facing reality that the site may not attract as much activity as hoped , and is prone to over-engineering from the get-go.
There are many times that going for a simple solution enables faster time to market , essentially, to quickly test waters , and of course, be prepared If the piranhas do come in swarms and then its time to upgrade to another RDBMS.
With Django's ORM, for most cases you dont need to touch your models.py during migration to other supported sql db. Be VERY mindfull though that Sqlite does not support some more advanced functions or even fields that its bigger cousins MYSQL and POSTGRES do.
Late to the party, but the question is still relavant as of mid 2018.
"Client" of a blog site is a different term that a "database client". SQLite documentation refers to a client as a process opening a database file. Such process, say a django app, may handle many web app clients ("users") simultaneously and it still is going to be just one client from the standpoint of SQLiite.
The important consideration for choosing SQLite over proper RDBMS is whether your architecture is comprised of more than one software component connecting to a database. In such case, using SQLite may be a major performance bottleneck due to the fact that each app needs to access the same DB file, possibly over a network.
If multiple apps(database clients) is not the case, SQLite is a great production choice in 99% of cases. The remaining 1% is apps using specific DB features, apps under enormous load, etc.
Know your architecture.
The anwer to this question depends on the application that you want to deploy in production:
According to the how to use from the SQLite website, SQLite works great in production as the database engine for most website having low to medium traffic (which is to say, most websites).
They argue that the amount of web traffic that SQLite can handle depends on how heavily you use the database of your website. It is known that any site that gets fewer than 100K hits/day should work fine with SQLite. However, this 100K hits/day figure is a conservative estimate, not a hard upper bound.
In summary, SQLite might be a great choice for applications with fewer users and databases uses. Thus, use SQLite for website with fewer or medium interactions with the database and MySQL or PostgreSQL for website with higher interactions with the database.
Reference: sqlite.org
I'm slowly getting into the position where one of my Django sites needs some robustness behind it. I'd currently running on a single VPS on a SQLite database with memcached.. It's about as un-scaled as things can get.
If I bought another VPS account, what would I want to do?
Move to MySQL/PostgreSQL with replication? What's easiest? Does replication protect me from one server exploding? Are there concurrency downsides?
How do I load-balance between the two servers?
I'd put memcached on the new server too. If I put both IPs into the configuration, would that keep a copy of data on both servers? (I'm thinking of what happens to session data - currently stored in memcached)
I'm currently using Cherokee as the httpd - I'm sure this has its own set of issues. If you've any tips, let me know.
Am I going at this the wrong way? Is there an easier way to have faster, more robust django sites?
First step: switch from SQLite to a real production database (I like Postgres). This should happen long before you even think about a second VPS. SQLite essentially does not support concurrency at all. Personally, I wouldn't even consider deploying a live site on SQLite in the first place.
If your site is running on SQLite and is functioning, my guess is you are still quite a long ways from actually outgrowing your single VPS (unless it's already heavily loaded otherwise).
If/when you do need to add a second server, how you configure things depends on where you're actually seeing a bottleneck. Chances are it'll be the database, in which case a good step might be simply moving the database onto its own server (presuming you can guarantee low latency between the two VPSes) and loading the database server with as much RAM as you can afford. In general disk performance suffers most in a VPS, so another step to consider might be putting the DB onto raw metal.
I'd probably look at those steps before I'd think about DB replication or multiple web-tier servers, but it really depends on profiling your actual case (and how you value performance vs reliability).
Watching the Django Deployment Workshop by Jacob Kaplan-Moss should give you a good overview.
MySQL supports Master-Slave and Master-Master setups I don't use PostgreSQL.
You can use nginx as your loadbalancer, HAProxy is an option, too (SO use it).
Memcached distributes the objects over the servers, If one crashes the data is lost.
I don't know Cherokee, but nginx is great.