I have an application that, when unoptimized, will require many writes to the postgreSQL database in response to real-time information - as many as 1 per second!!!
Therefore, I'd like to cache this stream of data - either through redis/redisco or memcache - and then do a single bulk_create in my postgreSQL database every ~5 min.
As I understand, the django memcache will store in memory, but it is possible to invalidate the memcache when a write is needed.
Alternatively, I was considering to put information in redis, perhaps using redisco models, and do a bulk_create to the database every ~5 min.
3 part question:
Which option would be better for scaling long term?
What are some pros/cons of each?
Finally, does anyone have any references/tutorials I could read up on?
Thanks!
Premature optimization is the root of all evil. PostgreSQL is capable of handling heavy mixed read/write workloads. Start there, and start exploring other options as you need to, however with a high-end server, you will be able to get up to about 14000 writes per second (depending on query specifics) on PostgreSQL 9.2 when it comes out. With 9.1, you max out at about 3000 writes per second and the difference has to do with locking behavior.
Don't optimize yet. If you start to get into the hundreds of writes per second, then maybe it would be worth it. However especially if these are simple writes, you are better off keeping your architecture simple.
Related
I'm working to design a middle layer for an application that will receive up to ~5000 requests every few seconds and need to retrieve information from a database. I've been looking at use the Play Framework (I use scala for my REST api design) as they say its fully async and built on Akka. However, the main bottleneck of any solution seems to happen during read/writes to the database. Many Database cannot support simultaneous read/writes from a database of such a scale. How is such high concurrency achieved then for an app like this? I would guess Facebook/Twitter/ (name other big company) may have achieved this for their Applications as millions of people may be using them concurrently.
As Tim's comment was saying caching may or may not be able to help in your case. If not I would also recommend looking into horizontally scalable databases, for example cockroachdb if you want a transactional SQL db. Otherwise there are many no-sql choices a la mongodb etc. And if you really want to stick to traditional SQL systems you'll have to vertically scale your servers (buy the most expensive hardware) and work with read-replicas.
A huge component is your data model and query access pattern. If each query is incrementing a shared counter that has to be synchronized there will be a ton of contention, but if each query is touch completely separate data on the other end the spectrum than there will be a lot less contention.
I think there are a couple of dimensions I would consider:
Data Schema and Access Patterns (discussed above)
Language Choice
This is important becaues if you were in a web server context and were using prefork by default each process may have its own connection to the database. In an environment like python or ruby you may need hundreds of processes to handle your load. Contrast this with akka or another async networking based runtime (node, python gevent/asyncio, go, etc) where a single instance with a small thread pool can handle a large number of requests. Each have their tradeoffs.
Distributed Systems
Depending on your data schema and access patterns 5000 requests per second to a RDBMS is completely achievable. It would probably require relatively beefy hardware but but I'v personally done it a number of times. Getting to larger scales requires more computers in order to distribute the work/load. If your workload is right heavy and you can support potentially stale reads, a read replica is one option. With another machine in the mix reads are distributed over 2 machines but writes are still directed at a single machine (leader). Caching is another option.
At much higher workloads some sort of partitioning needs to occur in order to overcome the constraints of a single machine. https://github.com/vitessio/vitess
Many of the big contenders have solutions to horizontally scaling their databases. This has many drawbacks as well and will require careful planning.
The one thing I'd recommend is that if 5000 requests per second is projected for the near future, start with the minimal amount of hardware necessary (single instance) query patterns and operation get exponentially more complicated with a distributed database.
I have Neo4j v2.1.6 (default configuration) and Neo4j.rb v4.1.0. All queries are slow around 50ms. I have only 5 nodes in db.
For example:
User.find_by(person_id: 826268332)
CYPHER 47ms MATCH (n:`User`) WHERE (n.person_id = {n_person_id}) RETURN n LIMIT {limit_1} | {:n_person_id=>826268332, "limit_1"=>1}
Where can be a problem?
I'm one of the core maintainers of Neo4j.rb, along with Brian Underwood, who replied above. This is not exactly a full answer since we need to know more about your system to answer that, but I'm posting this here because it's too much for one comment.
My money is on something wrong with your DB or your system. We had a similar issue reported -- slow queries when working locally, no cause able to be determined -- for a user running Windows. See Neo4j.rb version 3.0 slow performance RoR, over 1024ms for all queries. We weren't able to pin it down. Locally, running that exact same query, I see 13ms the first time I run it and ~3ms every time after that. Indexing won't make a difference in a DB that small.
Ways to limit the chance of a problem and generally improve performance:
Use Ruby MRI 2.2.0
Use Neo4j 2.1.6 or 2.2.0
Use Mac or Linux, not Windows
Require the oj and oj_mimic_json gems in your app
You will see longer responses for a query like that if your db and app server are in two different networks.
Regarding the comment that this simple query is much faster in MongoDB and PostgreSQL: yes, it's going to be. Both of those return simple queries faster than Neo4j.rb for no fewer than two reasons:
The Ruby gems for connecting to those DBs do not use a REST interface, they use custom binary protocols.
Both of those are optimized for returning single records quickly, Neo is optimized for returning large groups of records quickly.
Before releasing Neo4j.rb 4.0, I did a ton of benchmarks against Postgres and MongoDB and found the same results: they crush us when returning single objects. (PostgreSQL is amazing technology general.) As soon as you start looking for related objects, though, things balance out, and as you add complexity, the difference becomes even more significant. I don't have any numbers to share, unfortunately, but I'll make a blog post about it sometime soon if I have some time.
That is strange. In the neo4j gem I often see simple queries run in around 1-5 ms.
For debugging, what if you did this?
User.where(yeti_person_id: 826268332).first
Also, what does this give you?
puts User.where(yeti_person_id: 826268332).to_cypher
We are developing an online school diary application using django. The prototype is ready and the project will go live next year with about 500 students.
Initially we used sqlite and hoped that for the initial implementation this would perform well enough.
The data tables are such that to obtain details of a school day (periods, classes, teachers, classrooms, many tables are used and the database access takes 67ms on a reasonably fast PC.
Most of the data is static once the year starts with perhaps minor changes to classrooms. I thought of extracting the timetable for each student for each term day so no table joins would be needed. I put this data into a text file for one student, the file is 100K in size. The time taken to read this data and process it for a days timetable is about 8ms. If I pre-load the data on login and store it in sessions it takes 7ms at login and 2ms for each query.
With 500 students what would be the impact on the web server using this approach and what other options are there (putting the student text files into a sort of memory cache rather than session for example?)
There will not be a great deal of data entry, students adding notes, teachers likewise, so it will mostly be checking the timetable status and looking to see what events exist for that day or week.
What is your expected response time, and what is your expected number of requests per minute? One twentieth of a second for the database access (which is likely to be slow part) for a request doesn't sound like a problem to me. SQLite should perform fine in a read-mostly situation like this. So I'm not convinced you even have a performance problem.
If you want faster response you could consider:
First, ensuring that you have the best response time by checking your indexes and profiling individual retrievals to look for performance bottlenecks.
Pre-computing the static parts of the system and storing the HTML. You can put the HTML right back into the database or store it as disk files.
Using the database as a backing store only (to preserve state of the system when the server is down) and reading the entire thing into in-memory structures at system start-up. This eliminates disk access for the data, although it limits you to one physical server.
This sounds like premature optimization. 67ms is scarcely longer than the ~50ms where we humans can observe that there was a delay.
SQLite's representation of your data is going to be more efficient than a text format, and unlike a text file that you have to parse, the operating system can efficiently cache just the portions of your database that you're actually using in RAM.
You can lock down ~50MB of RAM to cache a parsed representation of the data for all the students, but you'll probably get better performance using that RAM for something else, like the OS disk cache.
I agree with some of other answers which suggest to use MySQL or PostgreSQL instead of SQLite. It is not designed to be used as production db. It is great for storing data for one-user applications such as mobile apps or even a desktop application, but it falls short very quickly in server applications. With Django it is trivial to switch to any other full-pledges database backend.
If you switch to one of those, you should not really have any performance issues, especially if you will do all the necessary joins using select_related and prefetch_related.
If you will still need more performance, considering that "most of the data is static", you actually might want to convert Django site a static site (a collection of html files) and then serve those using nginx or something similar to that. The simplest way I can think of doing that is to just write a cron-job which will loop over all needed url-configs, request the page from Django and then save that as an html file. If you want to go into that direction, you also might want to take a look at Python's static site generators: Hyde and Pelican.
This approach will certainly work much faster then any caching system however you will loose any dynamic components of the site. If you need them, then caching seems like the best and fastest solution.
You should use MySQL or PostgreSQL for your production database. sqlite3 isn't a good idea.
You should also avoid pre-loading data on login. Since your records can be inserted in advance, write django management commands and run the import to your chosen database before hand and design your models such that when a user logs in, the user would already be able to access and view/edit his or her related data (which are pre-inserted before the application even goes live). Hardcoding data operations when log in does not smell right at all from an application design point-of-view.
https://docs.djangoproject.com/en/dev/howto/custom-management-commands/
The benefit of designing your django models and using custom management commands to insert the records right way before your application goes live implies that you can use django orm to make the appropriate relationships between users and their records.
I suspect - based on your description of what you need above - that you need to re-look at the approach you are creating this application.
With 500 students, we shouldn't even be talking about caching. If you want response speed, you should deal with the following issues in priority:-
Use a production quality database
Design your application use case correctly and design your application model right
Pre-load any data you need to the production database
front end optimization comes first (css/js compression etc)
use django debug toolbar to figure out if any of your sql is slow and optimize specifically those
implement caching (memcached etc) as needed
As a general guideline.
Note: I'm using Postgres 9.x and Django ORM
I have some functions in my application which open a transaction, run a few queries, then do a couple full seconds of other things (3rd party API access, etc.), and then run a few more queries. The queries aren't very expensive,, but I've been concerned that, by having many transactions open for so long, I'll somehow bog down my database eventually or run out of connections or something. How big of a deal is this, performance-wise?
Keeping a transaction open has pros and cons.
On the plus side, every transaction has an overhead cost. If you can do a couple of related things in one transaction, you normally win performance.
However, you acquire locks on rows or whole tables along the way (especially with any kind of write operation). These are automatically released at the end of the transaction. If other processes might wait for the same resources, it is a very bad idea to call external processes while the transaction stays open.
Maybe you can do the call to the 3rd party API before you acquire any locks, and do all queries in swift succession afterwards?
Read about checking locks in the Postgres Wiki.
While not exact answer, I can't recommend this presentation highly enough.
“PostgreSQL When It’s Not Your Job” at DjangoCon US
It is from this year's DjangoCon, so there should be a video also, hopefully soon.
Plus check out authors blog, it's a golden mine of useful information on Postgres as a whole and django in particular. You'll find interesting info about transaction handling there.
SQLite is a great little database, but I am having an issue with it on Windows. It can take up to 50 seconds to perform a query on a 100MB database the first time the application is launched. Subsequent loads take 10% of that time.
After some discussions on the SQLite mailing list, I am told
"The bug is in Windows. It aggressively pre-caches big database files
-- reads in big chunks of the files -- to make it look as if programs
like Outlook are better than they really are. Unfortunately although
this speeds up some programs it makes others act jerky because they
have no control over how much is read when they ask for just a few
bytes of file."
This problem is compounded because there is no way to get progress information while all this is happening from SQLite, so my users think something is broken. (I could display a dummy progress report, but that is really cheesy for a sharp tool.)
I believe there is a way to turn the pre-caching off globally, but is there some way around this programmatically?
I don't know how to fix the caching problem, but 50 seconds sounds extreme. If the query itself takes 10% of that, that means 45 seconds to load a 100mb file. Even if Windows does read in the entire file in one go, that shouldn't take more than a couple of seconds given normal harddrive speeds.
Is the file very fragmented or something?
It sounds to me like there's more than just precaching at play here.
I'm too having the same problem with my first query. The problem returns after not querying the database for a long time. It seems to be a memory caching problem. My software runs 24/7 and every once in a while the user performs the SELECT query. I am also performing the query on a database of the same size.