using SQLite in Django in production?

using SQLite in Django in production? - django

Sorry for this question, I dont know if i've understood the concept, but SQLite is Serverless, this means the database in in a local machine, and it's stored in one file, this file is only accessible on one mode: if one client reads it, it's made only for reading mode for other clients, and if a client writes, then all clients have the write mode, so only in one mode at once!
so imagine that i've made a django application, a blog for example; then how is this made using sqlite? since if a client enters to the blog he gots the reading mode to see the page and the blog entries, and if a registred client tries to add a comment then the file will be made as write mode, so how can sqlite handle this?
so, does SQLite is here just like the BaseHTTPServer (the server shipped with django), for testing and learning purpose?

Different databases manage concurrency in different ways, but in sqlite, the method used is a global database-level lock. Only one thread or process can make changes to a sqlite database at a time; all other, concurrent processes will be forced to wait until the currently running process has finished.
As your number of users grows; sqlite's simple locking strategy will lead to increasingly great lock contention, and you will need to migrate your data to another database, such as MySQL (Which can do row level locking, at least with InnoDB engine) or PostgreSQL (Which uses Multiversion Concurrency Control). If you anticipate that you will get a substantial number of users (on the level of say, more than 1 request per second for a good part of the day), you should migrate off of sqlite; and the sooner you do so, the easier it will be.

SQLite is not like BaseHTTPServer or anything basic like that. It's a fully featured embedded database. Quite fast too. Its SQL language might not have the most bells and whistles, but it's flexible enough. I haven't run into cases where I needed something it cannot do for the projects I was involved in (which aren't your typical web apps, truth be told).
Anyone that claims SQLite is good or bad for production without discussing the actual design is not telling you much. SQLite is pretty fast. In some cases, literally orders of magnitude faster than, say, Postgres, which comes up as a go-to alternative among Djangonauts. As someone pointed out, it also supports lots of concurrency. It's a matter of whether your app falls under the 'some cases' or not.
Now, there is one significant factor that has to be taken into account. SQLite is an in-process database. This is really important. If you are using something like gevent, you may run into edge cases where your app breaks. E.g., trying to do a transaction where you have a context switch in middle of it can possibly break the transaction in horrible ways. In other words, 'concurrency' really depends on your app, because SQLite is part of your app.
What you can't do with SQLite, though, in terms of scaling, is you can't make clusters of SQLite servers like you can with some of the other database engines, because it's in-process. Your app may or may not need to go to such lengths in terms of scaling, but my guess is that vast majority of apps out there don't anyway (wild guess).
On the other hand, being in-process means adding custom functions and aggregates to it is pretty trivial. I'm not sure if Django's ORM makes that any more difficult than it has to be, but you can come up with pretty good designs taking advantage of those features.

This issue in database theory is called concurrency and SQLite does support it in Windows versions > Win98 and elsewhere according to the FAQ:
http://www.sqlite.org/faq.html#q5
We are aware of no other embedded SQL database engine that supports as
much concurrency as SQLite. SQLite allows multiple processes to have
the database file open at once, and for multiple processes to read the
database at once. When any process wants to write, it must lock the
entire database file for the duration of its update. But that normally
only takes a few milliseconds. Other processes just wait on the writer
to finish then continue about their business. Other embedded SQL
database engines typically only allow a single process to connect to
the database at once.
Basically, do not worry about concurrency, any database worth its salt takes care of just fine. More information on as how SQLite3 manages this can be found here. You, as a developer, not a database designer, needn't care about it unless you are interested in the inner-workings.

SQLite will only work effectively in production from some specific situations. It's quite easy to get MySQL or PostgreSQL up and running, even on Windows, and have a database that works in most situations.
The real problem is that SQLite3 isn't threaded in Django so only one PAGE view can happen at a time on your server, see this bug https://code.djangoproject.com/ticket/12118 Fixed
I don't use SQLite3 even in development.
EDIT: I keep getting downvoted here but the Django documentation itself recommended not using SQLite3 in Production at the time I wrote this answer. The documentation still contains the following caveat:
SQLite provides an excellent development alternative for applications that are predominantly read-only or require a smaller installation footprint.
If you do not have a small foot print/read-only Django instance, do NOT use SQLite3. Feel free to continue to downvote this answer.

It is not impossible to use Django with Sqlite as database in production, primarily depending on your website/webapp traffic and how hard you hit your db (alongside what kind of operations you perform on it i.e. reads/writes/etc). In fact, approaching end of 2019, I have used it in several low volume applications with less than 5k daily interactions (these are more common than you might think).
Simply put for the current state of tech , at the moment Sqlite-3 supports unlimited concurrent reads (or as far as your machine / workers can handle), BUT only a single process can write to it at any point in time. Bear in mind, a well designed query/ops to the db will last only miliseconds!
Coming from experience in using sqlite as the only db for simple non-routine (by non-routine i mean that a typical user would not be using this app on a daily basis year-round) production web app for overseas job matching that deal with ~5000 registered students (stats show consistently less than 2k requests per day that involves hitting the database during peak season - 40% write 60% read), I've had no problems whatsoever with timeouts/performance issues.
It really boils down to being pragmatic about the development and the URS (client spec). If it becomes the next unicorn , one can always migrate the SQLITE to another RDBMS. For instance, see David d C e Freitas's take on migration in Quick easy way to migrate SQLite3 to MySQL?
Additionally the SQLITE website uses sqlite db at its backend .. see below...
The SQLite website (https://www.sqlite.org/) uses SQLite itself, of course, and as of this writing (2015) it handles about 400K to 500K HTTP requests per day, about 15-20% of which are dynamic pages touching the database. Dynamic content uses about 200 SQL statements per webpage. This setup runs on a single VM that shares a physical server with 23 others and yet still keeps the load average below 0.1 most of the time.
Bear in mind that the above quote is of course mainly referring to read operations, so the values may not be a applicable for write-heavy sites.
The example I gave above on the job matching application I built using sqlite as db is quite write heavy if you've noticed the numbers ... on average, 40% are short lived write operations (i.e. form submissions, etc etc) but bear in mind my volume hitting the db is only 2k per day during peak season.
Then again, if you realize that your sqlite.db is causing alot of timeout and bad user experience (408 !!! on form submission...), especially with Django throwing the OperationalError: database is locked error. (and then they have to key in the whole thing again)...You can always increase the timeout in your settings.py as per django docs as a temporary solution while you prepare for migrating the db.
'OPTIONS': {
# ...
'timeout': 20,
# ...
}
Again, it all boils down to pragmatic development and facing reality that the site may not attract as much activity as hoped , and is prone to over-engineering from the get-go.
There are many times that going for a simple solution enables faster time to market , essentially, to quickly test waters , and of course, be prepared If the piranhas do come in swarms and then its time to upgrade to another RDBMS.
With Django's ORM, for most cases you dont need to touch your models.py during migration to other supported sql db. Be VERY mindfull though that Sqlite does not support some more advanced functions or even fields that its bigger cousins MYSQL and POSTGRES do.

Late to the party, but the question is still relavant as of mid 2018.
"Client" of a blog site is a different term that a "database client". SQLite documentation refers to a client as a process opening a database file. Such process, say a django app, may handle many web app clients ("users") simultaneously and it still is going to be just one client from the standpoint of SQLiite.
The important consideration for choosing SQLite over proper RDBMS is whether your architecture is comprised of more than one software component connecting to a database. In such case, using SQLite may be a major performance bottleneck due to the fact that each app needs to access the same DB file, possibly over a network.
If multiple apps(database clients) is not the case, SQLite is a great production choice in 99% of cases. The remaining 1% is apps using specific DB features, apps under enormous load, etc.
Know your architecture.

The anwer to this question depends on the application that you want to deploy in production:
According to the how to use from the SQLite website, SQLite works great in production as the database engine for most website having low to medium traffic (which is to say, most websites).
They argue that the amount of web traffic that SQLite can handle depends on how heavily you use the database of your website. It is known that any site that gets fewer than 100K hits/day should work fine with SQLite. However, this 100K hits/day figure is a conservative estimate, not a hard upper bound.
In summary, SQLite might be a great choice for applications with fewer users and databases uses. Thus, use SQLite for website with fewer or medium interactions with the database and MySQL or PostgreSQL for website with higher interactions with the database.
Reference: sqlite.org

Related

Sharding existing postgresql database with PostgresXL

We want to shard our PostgreSQL DB, due to high disk load. Firstly, we looked at django-sharding library, but:
Very much rewriting in our backend
Migrating all tables to 64-bit primary keys is hard work on 300-400gb tables
Generating ids with Postgres Specific algorithm makes it impossible to move data from shard to shard. More than that, we have a large database with old ids. Updating all of them is a big problem too.
Generating ids with special tables makes us do a special SELECT query to main database every time we insert data. We have high write load, so it's not good.
Considring all these, we decided too look on Postgres database sharding solutions. We found 2 opportunities - Citus and PostgresXL. Citus makes us change data format too much and rewrite a big bunch of backend at the same time, so we are about to try PostgresXL as more transparent solution. But reading the docs, I can't understand some things and will be greatfull for recomendations:
Are there any other sharding workarounds except for Citus and PostgresXL? It would be good not to change much in our database on migrating.
Some questions about PostgresXL:
Do I understand correctly, that it's not Postgres extension, it's a standalone fork? So I should build all its parts from sources and than move data in some way?
How are Postgres and PostgresXL versions compatible? We have PostgreSQL 9.4. I don't see such a version in PostgresXL (9.2 or 9.5 no middle?). So can I use, for example, streaming replication for migration?
If yes/no, what is the best solution to migrate data? If I have 2Tb database with heavy write, can I migrate it somehow without stopping for a long period of time?
Thanks.

First off to save your self a LOT of headache have you looked at options Like Amazon's Auora, Dynomo, Red Shift, etc services? They are VERY cost effective at scale, as well as optimized and managed for you.
Actually Amazon's straight Postgress databases can handle MASSIVE amounts of reads or writes. We can go into 2,000- 6,000 IOPS on reads and another 2,000 to 6,000 IOPS in writes without issue. I would really look into this as the option. Azure, Oracle, and Google also have competing services.
Also be aware that Postgres-XL beyond all reason has no HA support. If you lose a single node you lose everything. The nodes can not fail over.
it's a standalone fork?
Yes, They are very different apps and developed separate from each other.
How are Postgres and PostgresXL versions compatible?
They arn't compatible. You can not just migration Postgres to Postgresl-XL. They work VERY differently.
Generating ids with Postgres Specific algorithm makes it impossible to >move data from shard to shard
Not following this, but with sharing you are not supposed to move data from one shard to another. The key being used generally needs to be something specific and unique to split/segregate your data on. Like a date, or a "type" field, or some other (hopefully ordered) field(s)/column(s). This breaks things up but has obvious pain in the a$$ limitations.
Are there any other sharding workarounds except for Citus and
PostgresXL? It would be good not to change much in our database on >>migrating.
Tons of options, but right off the bat going from a standard RDS, to a NoSql, or MPP database is going to be a major migration, a lot of effort, and have a LOT of limitations no matter what you do.
Next Postress-XL and Citus are MPP (massive parallel processing) clustering apps, not sharing specifically. That is part of what they can do, but it is not their focus.
Other options for MPP
pgPool -- (not great for heavy writes )
haProxy -- ( have not done it but read about it. Lost of work to setup and maintain. )
MySql Cluster -- (Huge pain to use the OSS version and major $$$ for the commercial version)
Green Plumb
Teradata
Vertica
what is the best solution to migrate data?
Very unlikely to find a simple migration for this kind of switch. You can expect to likely need to export the data your self from the existing RDS and import it to the new DB and will likely have to write something your self to get it the way you want it.

Advice on caching for Django/Postgres application

I am building a Django web application and I'd like some advice on caching. I know very little about caching. I've read the caching chapter in the Django book, but am struggling to relate it to my real-world situation.
My application will be a web front-end on a Postgres database containing a largeish amount of data (150GB of server logs).
The database is read-only: the purpose of the application is to give users a simple way to query the data. For example, the user might ask for all rows from server X between dates A and B.
So my database needs to support very fast read operations, but it doesn't need to worry about write operations (much - I'll add new data once every few months, and it doesn't matter how long that takes).
It would be nice if clients making the same request could use a cache, rather than making another call to the Postgres database.
But I don't know what sort of cache I should be looking at: a web cache, or a database cache. Or even if Postgres is the best choice (I'd just like to use it because it works well with Django, and is so robust). Could anyone advise?
The Django book says memcached is the best cache with Django, but it runs in memory, and the results of some of these queries could be several GB, so memcached might fill up the machine's memory quickly. But perhaps I don't fully understand how memcached operates.

Your query should in no way return several GB of data. There's no practical reason to do so, as the user cannot absorb that much data at a time. Your result set should be paged, such that the user sees only 10, 25, whatever results at a time. That then allows you to also limit your query to only fetch 10, 25, whatever records at a time starting from a particular index based on the page number.
Caching search result pages is not a particularly good idea, regardless, though. For one, the odds that different users will ever conduct exactly the same search are pretty minimal, and you'll end up wasting RAM to cache result sets that will never be used again. Also, something like logs should be real-time. If you return a cached result set, there might be new, relevant results that are not included, obscuring the usefulness of your search.

As mentioned above you have limitations on what problems caching can solve. As you are building this application, then I see no reason why you couldn't just plug in Django Haystack and Whoosh and see how it performs, then switching to some of the other more Enterprise search backends is a breeze.

Should I implement revisioning using database triggers or using django-reversion?

We're looking into implementing audit logs in our application and we're not sure how to do it correctly.
I know that django-reversion works and works well but there's a cost of using it.
The web server will have to make two roundtrips to the database when saving a record even if the save is in the same transaction because at least in postgres the changes are written to the database and comitting the transaction makes the changes visible.
So this will block the web server until the revision is saved to the database if we're not using async I/O which is currently the case. Even if we would use async I/O generating the revision's data takes CPU time which again blocks the web server from handling other requests.
We can use database triggers instead but our DBA claims that offloading this sort of work to the database will use resources that are meant for handling more transactions.
Is using database triggers for this sort of work a bad idea?
We can scale both the web servers using a load balancer and the database using read/write replicas.
Are there any tradeoffs we're missing here?
What would help us decide?

You need to think about the pattern of db usage in your website.
Which may be unique to you, however most web apps read much more often than they write to the db. In fact it's fairly common to see optimisations done, to help scaling a web app, which trade off more complicated 'save' operations to get faster reads. An example would be denormalisation where some data from related records is copied to the parent record on each save so as to avoid repeatedly doing complicated aggregate/join queries.
This is just an example, but unless you know your specific situation is different I'd say don't worry about doing a bit of extra work on save.
One caveat would be to consider excluding some models from the revisioning system. For example if you are using Django db-backed sessions, the session records are saved on every request. You'd want to avoid doing unnecessary work there.
As for doing it via triggers vs Django app... I think the main considerations here are not to do with performance:
Django app solution is more 'obvious' and 'maintainable'... the app will be in your pip requirements file and Django INSTALLED_APPS, it's obvious to other developers that it's there and working and doesn't need someone to remember to run the custom SQL on the db server when you move to a new server
With a db trigger solution you can be certain it will run whenever a record is changed by any means... whereas with Django app, anyone changing records via a psql console will bypass it. Even in the Django ORM, certain bulk operations bypass the model save method/save signals. Sometimes this is desirable however.
Another thing I'd point out is that your production webserver will be multiprocess/multithreaded... so although, yes, a lengthy db write will block the webserver it will only block the current process. Your webserver will have other processes which are able to server other requests concurrently. So it won't block the whole webserver.
So again, unless you have a pattern of usage where you anticipate a high frequency of concurrent writes to the db, I'd say probably don't worry about it.

Designing backend software for multiplayer cross platform app

I am currently in the initial design phase of my first app.
In my app there will be individual sessions containing 1-5 users.
I need to be able to keep track of each users gps location and be able to push and pull them to each of the users. Each user will have the most recently reported location of every other user in the session.
There will be other calculations done on the data set but that will be client side, the server should only need to handle pushing and pulling of user locations (and the usernames).
I'm predicting due to the nature of the app 90% of sessions should not last more than 2 hours, with the possibility of the server ending sessions that are older then 24-48 hours (once real world testing of the app begins I would have a better idea of how long sessions should last).
I was thinking of using django to build an API, and to store all the data in the program itself and not to use a database as this should be faster and I don't think it is necessary to store the data since it has such a short lifetime.
Is this a good starting point? Is there anything I should be thinking about or considering? I'm completely new to designing backend software.

While performance might not even be an issue in the beginning, there are some things you can do once you hit a certain load:
Keep all your session data in one model, even if you're denormalizing (putting redundant information into your database) your database a bit. That way you only have to do one read to the database and no expensive JOINs
Use the Django caching framework (https://docs.djangoproject.com/en/dev/topics/cache/) to cache views, so multiple reads of the same data don't have to hit the database
Before you start optimizing, profile your code to see where your performance bottlenecks really are. Sometimes you'll be surprised which operations are expensive, and which aren't.

Django -- I have a small app ready, Should I go on private VPS or Google App Engine?

I have my first app, not that big, but it is the first step. (next big one on the way)
Now if I want to put it on my own Linode VPS, I have to configure mod_python or mod_wsgi, as well as memcache, Ngix, mySQL or Postgresql, etc. to make it work. If I put it GAE, All I have to do is convert the models to use GAE's API.
What I like about GAE is scaling. (if they can really do it)
Then I'd only worry about developing my apps and doing SEO work on them instead of worrying about load share/balance, cache, db / IO redundancy, etc.
I don't want to do any porting later on. (I have to decide now and stick with it)
So, if you have any experience on this, what do you recommend:
1- Use VPS(s) for everthing
2- Use VPS(s) plus Amazon S3
3- Use VPS(s) plus Amazon S3 & SimpleDB
4- Use GAE
Also: Would I be able to get away with not having JOIN rights when using the BigTable?
Note: I don't have any spatial need now, but for a location table I might need that later on.
I'd like to know what do you think!

There's business risk and technical risk.
Business risk is that you might have to move hosts later for some external reason. VPS's, EC2, etc require more upfront investment, but keep you independent. Tools like Chef can help with the configuration effort.
Technical risk is that your application may not be easily implemented on the platform. Since most VPS options allow you to install arbitrary software, they minimize this, again at the cost of more configuration effort on your part. AFAIK, the largest constraint GAE enforces on you is it's difficult to do long running background tasks. (Working without JOINs and other aspects of de-normalized data requires a different way of thinking, but this approach is fairly common in web applications no matter where they run once the SQL database is larger than a single host can support.)
If you can live with both these risks, GAE would appear to save you a substantial amount of effort. If you cannot live with these risks, you should tailor your own environment.
As an aside, I find S3 to be worth it no matter your environment. It's far simpler than ensuring your local server static file storage is reliably backed up, and you never have to worry about capacity. It's best if you use it for data that is uploaded but rarely overwritten or deleted (think facebook photo albums).

I don't want to do any porting later on. (I have to decide now and stick with it)
If that's the case, wouldn't you prefer to control deployment from the outset? It could be a great pain to port back from GAE later down the line if you hit its limits (whether they be technological limits or simply business decisions by Google that run counter to your plans for the future of your app).
Also configuring mod_wsgi, installing postgres etc. isn't particularly difficult, and you don't have to worry about things like load balancing and db redundancy for a while yet.
If it were me, I'd prefer the long-term certainty of a traditional server over the quick win of GAE. It all depends on your vision for the app, however.

I may be biased, but if you can live with GAE's limitations it really saves you a lot of work and worry about system administration issues (and to some extent scaling) -- plus, it's free as long as your resource consumption is low (basically meaning your traffic is low).
Can you do without joins? I don't know, as I don't know your app -- I'm a SQL fanatic, myself, yet for simple enough needs I haven't found it too hard to adapt. As I see it, the main limitation of non-relational DBs is that they're nowhere as nice as relational ones for "ad hoc" queries... you typically have to write a lot of procedural code instead of a nice SELECT or two:-(. But, that's more of a "data mining later" issue than one connected with serving your web app -- probably best solved by regularly bulk-downloading data from the web app's online storage to a "data warehouse" kind of setup, anyway, even if such storage was relational in the first place;-).

Before deciding, it might be worth a quick prototype adaptation of your app to GAE. You might run into stoppers that force the decision. Possible stopper issues include
Your schema doesn't make the transition to BigTable
You're depending on some C-based library that GAE doesn't support
You have a few long-running requests that exceed the thresholds that GAE imposes

The answer depends on the complexity and nature of your model layer, really. If it's complex or tightly bound to the rest of your code, porting is likely to be a significant effort. If it's fairly straightforward, or easy to tear out and replace, I would say go for it.
These days, I mostly write new code for GAE, but the fact that I can simply deploy with a single command has really lowered the barrier I feel towards writing cool new apps. Not having to worry about deployment and hosting is quite liberating.

All I have to do is convert the models to use GAE's API.
I am sorry, you are totally mistaken.
You also need to rewrite all the views code that uses the ORM. There are no joins. So you have to deal with and write a lot of procedural code instead of the nifty SQL that provides U whatever you want.
Querying is slow. You need to override save method of each model to store additional information of that model which may take a lot of time to compute when need. You also need to work on memcache to make the queries fast enough.
And then, Guido has said Django 1.1 is going to be included in a future version of Appengine. I am hoping they will have an out of the box generic ORM to BigTable mapper.
That said, if your app is simple without many joins needed, you could use the appengine patch project to use the current version of django on Appengine. Here is how.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js