I have a weird legacy database use-case: I have multiple databases, with (1) exactly the same schema, but (2) very different datasets. Databases, entire databases, with this schema, are being added to the total dataset every week.
Is there a way to (1) introspect the server to find out what databases are available, and if so, is there a way to (2) route to the correct database by URL, rather than by the current per-model solution (since my models don't change, only the associated underlying tables)?
Can this introspection be made dynamic, so every time someone hits the home page I can show them the list of available databases?
A generic solution is preferable, of course, but a MySQL-only solution is currently acceptable.
(The use case in the European Molecular Biology Lab's genome library, which is published every few months as a suite of MySQL database dumps, one database per species, with a core schema of about twenty tables which map nicely to six or so apps. The schema is stable and hasn't changed in years.)
Yes, you are able to run any raw SQL, and show databases is not exception. But it will be hard to change list of available databases and to switch between them. I'm afraid this will require modification or monkey patching of django's internals.
Update: Wait! I've looked into the code behind the django.db.connections and found that if you just extend settings.DATABASES in runtime, then you'll be able to use SomeModel.objects.using('some-new-database').all() in the code. Have not tested, but belive this should work!
Related
Our shop has recently started taking on an SOA approach to application development. We are seeing some great benefits with the separation of concerns, reusability, and other benefits of SOA/microservices.
However, one big item we're stuck on is aggregating, filtering, and paginating results across services. Let me describe the issue with a scenario.
Say we have 3 services:
PersonService - Stores information on people (names, addresses, etc)
ItemService - Stores information on items that are purchasable.
PaymentService - Stores information regarding payments that people have made for different items.
Now, say we want to build a reporting/admin tool that can display / report on multiple services in aggregate. For instance, we want to display a paginated list of Payments, along with the Person and Item that each payment was for. This is pretty straightforward: Grab the list of payments, then query PersonService and ItemService for the respective Person and Item records.
However, the issue comes into play when we want to then filter down that data: For instance, displaying a paginated list of payments made by people with the first name 'Bob', who have purchased the item 'Car'. This makes things much more complicated, because we need to filter results from 3 different services without knowing how many results each service is going to return.
From a performance perspective, querying all of the services over and over again to narrow down the results would be costly, so I've been researching better solutions. However, I cannot find concrete solutions to this problem (or at least a "best practice"). In a monolithic application, we'd simply use SQL joins across the different tables. I'm having a ton of trouble figuring out how/if something similar is possible across services.
My question to the community is: What would your approach be? Things I've considered:
Using some sort of search index (Elasticsearch, Solr) that contains all data for all services (updated via events pushed out by services), and then querying the search index for results.
Attempting to understand how projects like GraphQL and Neo4j may assist us with these issues.
I stick with Sam Newman who says in Chapter 4 "The shared Database" of his book something like:
Remember when we talked about the core principles behind good microservices? Strong cohesion and loose coupling --with database integration, we lose both things. Database integration makes it very easy for services to share data, but does nothing about sharing behaviour. Our internal representation is exposed over the wire to our consumers, and it can be very difficult to avoid making breaking changes, wich inevitably leads to fear of any changes at all. Avoid at (nearly) all costs.
This is the point I make when I curse at Content-Management-Systems.
In my view a microservice is autonomous, what it cannot be if it shares things or consumes shared things. The only exception I make here are Domain-Objects, those represent the shared understanding of the business model and must be used in communication between microservices solely.
It depends on the microservice itself if an ER or AggregationOriented database (divided into document based or graph based) better suits the needs.
The funny thing is, by being loosley coupled and by being autonomus you are able to do just that!
If an PaymentService shares the behaviour of "how many payments for Person A"
He needs to know Person A in order to fullfill this. But Everything he knows about Person A must origin from the PersonService, maybe at runtime (the PaymentService maybe just stores an id) or event based (the PaymentService stores the data it needs up to the Domain-Object user, what gets updated triggered and supplied by the PersonService). The PaymentService itself does not share users itself.
The answer to this question is that you need a separate Read Database or Materialized View that aggregates data from multiple databases, and makes it ready for fast retrieval. See the CQRS pattern: https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs
The data in the Materialized View might not be "the most up to date", meaning there might be a small delay between when the change is made by the respective microservice, and when time the "Materialized View" is updated, but this is fine, as retrieving the data fast is more important than if the data is stale for a few seconds or even minutes (there are systems where the Materialized View can take 2-5 minutes to be updated, and yet that might still be acceptable)
The best pattern to implement this Read Database or Materialized View from CQRS, is typically the Event Sourcing pattern, where we can listen to a queue for new updates and update the Read Database immediately. See the Event Sourcing pattern: https://learn.microsoft.com/en-us/azure/architecture/patterns/event-sourcing
Storing this data in elasticsearch/solr/cognitivesearch type service in addition to SQL could help solve some of these problems.
In your given example,
In the search index(elasticsearch/solr/cognitivesearch) person object will have a property called "items" that will contain a list of items that are paid for by that person.
That way, you can filter across objects, get a paginated list that is sorted by any property of the person. You can add similar information on other documents to better suit your business needs.
Using a GraphDatabase would seem to solve your problem from a 10000ft, but you will run into pagination problems when you operate at scale. GraphDatabases do not do pagination well(they will have to visit all the nodes anyway, even when you need a paginated list) and will cause timeouts/performance issues.
You can use replication tables.
All databases have replication feature
If you have personService that has person table and PaymentService that has payment table then create reportService that has person and payment tables, that they filled by replication feature.
I'll provide more background information first. The question is asked again in the last bullet of "My Thoughts & Questions" section.
Background
I am working on a legacy system which looks kind of like a batch processing system: A job is passed along a series of worker programs that each works on part of the whole job until it is finished. The job status/management information is generated by each worker program along the way and written into local text files with names like "status_info.txt" and "mgmt_info.txt".
This processing system does NOT use any databases, just pure text files on the local file system. Changing their code to use a databases is also kind of expensive which I want to avoid.
I am trying to add a GUI to this system for primarily two purposes. First, let the users view (a read operation) the job status and management information so they can get a big picture of what's going on and whether there is any error in any steps. Second, allow the users to redo one or more steps by changing (a write operation) the job status and management information.
My Current Solution
I am thinking of using Django to develop the GUI because:
Django is fast on development; Django is web-based so almost no installation is required;
I want to enable remote monitoring of the system so a web-based, in-browser GUI makes more sense;
I used some Django before so had some experience.
However, I see Django mostly works with a real database: SQLite, MySQL, PostgreSQL, etc.. The user-defined model will be matched into the tables in these databases by Django automatically. However, the legacy system only produces text files.
My Thoughts & Questions
Fortunately, I noticed that the text files are all in one of the two formats:
Multiple lines of strings;
Multiple lines of key-value pairs.
Both formats look easy to match to a database table design. For example, a "multiple lines of strings" can be considered as a DB table of a single column of text, while a "multiple lines of key-value pairs" as a two-column table.
Therefore, my question is: Can I build my model upon local text files instead of a real database, and somehow override some code somewhere in Django that acts as the interface between the core framework and the external database, so these text files will play the role of a "database" to Django and the reading/writing operations will happen to these files?? I've searched on internet and stackoverflow but wasn't lucky enough. Will appreciate if you can give me some helpful links.
What Not to Do.
If you are going to reproduce an RDBMS using files you are in for a lot and I mean a lot of grief and hard work. Even the simplest RDBMS like sqlite has thousands of man hours of work invested on it. If you were to bring your files into django or any other framework you would need to write a custom backend for it.
What To Do
Create django models backed by an RDBMS and import the files into it. Alternatively since this data appears to be mostly in Key Value pairs, you might be able to use Mongodb or redis.
You can use inotify to monitor the file system to detect when a new file has been created by the batch processing system. When that happens you can invoke a django CLI script to process that file and import it's data into the database.
The rest of it is a straight forward django app.
I would like to move a database in a Django project from a backend to another (in this case azure sql to postgresql, but I want to think of it as a generic situation). I can't use a dump since the databases are different.
I was thinking of something at the django level, like dumpdata, but depending on the amount of available memory and the size of the db sometimes it appears unreliable and crashes.
I have seen solutions that try to break the process into smaller parts that the memory can handle but it was a few years ago, so I was hoping to find other solutions.
So far my searches have failed since they always lead to 'south', which refers to schema migration and not moving data.
I have not implemented this before, but what about the following:
Django supports multiple databases...so just configure DATABASES in your settings file to support the old postgresql database and the azure sql database. Then create a small script that makes use of bulk_create, reading the data from one DB and writing it to the other.
This is more of an architectural question than a technological one per se.
I am currently building a business website/social network that needs to store large volumes of data and use that data to draw analytics (consumer behavior).
I am using Django and a PostgreSQL database.
Now my question is: I want to expand this architecture to include a data warehouse. The ideal would be: the operational DB would be the current Django PostgreSQL database, and the data warehouse would be something additional, preferably in a multidimensional model.
We are still in a very early phase, we are going to test with 50 users, so something primitive such as a one-column table for starters would be enough.
I would like to know if somebody has experience in this situation, and that could recommend me a framework to create a data warehouse, all while mantaining the operational DB with the Django models for ease of use (if possible).
Thank you in advance!
Here are some cool Open Source tools I used recently:
Kettle - great ETL tool, you can use this to extract the data from your operational database into your warehouse. Supports any database with a JDBC driver and makes it very easy to build e.g. a star schema.
Saiku - nice Web 2.0 frontend built on Pentaho Mondrian (MDX implementation). This allows your users to easily build complex aggregation queries (think Pivot table in Excel), and the Mondrian layer provides caching etc. to make things go fast. Try the demo here.
My answer does not necessarily apply to data warehousing. In your case I see the possibility to implement a NoSQL database solution alongside an OLTP relational storage, which in this case is PostgreSQL.
Why consider NoSQL? In addition to the obvious scalability benefits, NoSQL offer a number of advantages that probably will apply to your scenario. For instance, the flexibility of having records with different sets of fields, and key-based access.
Since you're still in "trial" stage you might find it easier to decide for a NoSQL database solution depending on your hosting provider. For instance AWS have SimpleDB, Google App Engine provide their own DataStore, etc. However there are plenty of other NoSQL solutions you can go for that have nice Python bindings.
Sorry for this question, I dont know if i've understood the concept, but SQLite is Serverless, this means the database in in a local machine, and it's stored in one file, this file is only accessible on one mode: if one client reads it, it's made only for reading mode for other clients, and if a client writes, then all clients have the write mode, so only in one mode at once!
so imagine that i've made a django application, a blog for example; then how is this made using sqlite? since if a client enters to the blog he gots the reading mode to see the page and the blog entries, and if a registred client tries to add a comment then the file will be made as write mode, so how can sqlite handle this?
so, does SQLite is here just like the BaseHTTPServer (the server shipped with django), for testing and learning purpose?
Different databases manage concurrency in different ways, but in sqlite, the method used is a global database-level lock. Only one thread or process can make changes to a sqlite database at a time; all other, concurrent processes will be forced to wait until the currently running process has finished.
As your number of users grows; sqlite's simple locking strategy will lead to increasingly great lock contention, and you will need to migrate your data to another database, such as MySQL (Which can do row level locking, at least with InnoDB engine) or PostgreSQL (Which uses Multiversion Concurrency Control). If you anticipate that you will get a substantial number of users (on the level of say, more than 1 request per second for a good part of the day), you should migrate off of sqlite; and the sooner you do so, the easier it will be.
SQLite is not like BaseHTTPServer or anything basic like that. It's a fully featured embedded database. Quite fast too. Its SQL language might not have the most bells and whistles, but it's flexible enough. I haven't run into cases where I needed something it cannot do for the projects I was involved in (which aren't your typical web apps, truth be told).
Anyone that claims SQLite is good or bad for production without discussing the actual design is not telling you much. SQLite is pretty fast. In some cases, literally orders of magnitude faster than, say, Postgres, which comes up as a go-to alternative among Djangonauts. As someone pointed out, it also supports lots of concurrency. It's a matter of whether your app falls under the 'some cases' or not.
Now, there is one significant factor that has to be taken into account. SQLite is an in-process database. This is really important. If you are using something like gevent, you may run into edge cases where your app breaks. E.g., trying to do a transaction where you have a context switch in middle of it can possibly break the transaction in horrible ways. In other words, 'concurrency' really depends on your app, because SQLite is part of your app.
What you can't do with SQLite, though, in terms of scaling, is you can't make clusters of SQLite servers like you can with some of the other database engines, because it's in-process. Your app may or may not need to go to such lengths in terms of scaling, but my guess is that vast majority of apps out there don't anyway (wild guess).
On the other hand, being in-process means adding custom functions and aggregates to it is pretty trivial. I'm not sure if Django's ORM makes that any more difficult than it has to be, but you can come up with pretty good designs taking advantage of those features.
This issue in database theory is called concurrency and SQLite does support it in Windows versions > Win98 and elsewhere according to the FAQ:
http://www.sqlite.org/faq.html#q5
We are aware of no other embedded SQL database engine that supports as
much concurrency as SQLite. SQLite allows multiple processes to have
the database file open at once, and for multiple processes to read the
database at once. When any process wants to write, it must lock the
entire database file for the duration of its update. But that normally
only takes a few milliseconds. Other processes just wait on the writer
to finish then continue about their business. Other embedded SQL
database engines typically only allow a single process to connect to
the database at once.
Basically, do not worry about concurrency, any database worth its salt takes care of just fine. More information on as how SQLite3 manages this can be found here. You, as a developer, not a database designer, needn't care about it unless you are interested in the inner-workings.
SQLite will only work effectively in production from some specific situations. It's quite easy to get MySQL or PostgreSQL up and running, even on Windows, and have a database that works in most situations.
The real problem is that SQLite3 isn't threaded in Django so only one PAGE view can happen at a time on your server, see this bug https://code.djangoproject.com/ticket/12118 Fixed
I don't use SQLite3 even in development.
EDIT: I keep getting downvoted here but the Django documentation itself recommended not using SQLite3 in Production at the time I wrote this answer. The documentation still contains the following caveat:
SQLite provides an excellent development alternative for applications that are predominantly read-only or require a smaller installation footprint.
If you do not have a small foot print/read-only Django instance, do NOT use SQLite3. Feel free to continue to downvote this answer.
It is not impossible to use Django with Sqlite as database in production, primarily depending on your website/webapp traffic and how hard you hit your db (alongside what kind of operations you perform on it i.e. reads/writes/etc). In fact, approaching end of 2019, I have used it in several low volume applications with less than 5k daily interactions (these are more common than you might think).
Simply put for the current state of tech , at the moment Sqlite-3 supports unlimited concurrent reads (or as far as your machine / workers can handle), BUT only a single process can write to it at any point in time. Bear in mind, a well designed query/ops to the db will last only miliseconds!
Coming from experience in using sqlite as the only db for simple non-routine (by non-routine i mean that a typical user would not be using this app on a daily basis year-round) production web app for overseas job matching that deal with ~5000 registered students (stats show consistently less than 2k requests per day that involves hitting the database during peak season - 40% write 60% read), I've had no problems whatsoever with timeouts/performance issues.
It really boils down to being pragmatic about the development and the URS (client spec). If it becomes the next unicorn , one can always migrate the SQLITE to another RDBMS. For instance, see David d C e Freitas's take on migration in Quick easy way to migrate SQLite3 to MySQL?
Additionally the SQLITE website uses sqlite db at its backend .. see below...
The SQLite website (https://www.sqlite.org/) uses SQLite itself, of course, and as of this writing (2015) it handles about 400K to 500K HTTP requests per day, about 15-20% of which are dynamic pages touching the database. Dynamic content uses about 200 SQL statements per webpage. This setup runs on a single VM that shares a physical server with 23 others and yet still keeps the load average below 0.1 most of the time.
Bear in mind that the above quote is of course mainly referring to read operations, so the values may not be a applicable for write-heavy sites.
The example I gave above on the job matching application I built using sqlite as db is quite write heavy if you've noticed the numbers ... on average, 40% are short lived write operations (i.e. form submissions, etc etc) but bear in mind my volume hitting the db is only 2k per day during peak season.
Then again, if you realize that your sqlite.db is causing alot of timeout and bad user experience (408 !!! on form submission...), especially with Django throwing the OperationalError: database is locked error. (and then they have to key in the whole thing again)...You can always increase the timeout in your settings.py as per django docs as a temporary solution while you prepare for migrating the db.
'OPTIONS': {
# ...
'timeout': 20,
# ...
}
Again, it all boils down to pragmatic development and facing reality that the site may not attract as much activity as hoped , and is prone to over-engineering from the get-go.
There are many times that going for a simple solution enables faster time to market , essentially, to quickly test waters , and of course, be prepared If the piranhas do come in swarms and then its time to upgrade to another RDBMS.
With Django's ORM, for most cases you dont need to touch your models.py during migration to other supported sql db. Be VERY mindfull though that Sqlite does not support some more advanced functions or even fields that its bigger cousins MYSQL and POSTGRES do.
Late to the party, but the question is still relavant as of mid 2018.
"Client" of a blog site is a different term that a "database client". SQLite documentation refers to a client as a process opening a database file. Such process, say a django app, may handle many web app clients ("users") simultaneously and it still is going to be just one client from the standpoint of SQLiite.
The important consideration for choosing SQLite over proper RDBMS is whether your architecture is comprised of more than one software component connecting to a database. In such case, using SQLite may be a major performance bottleneck due to the fact that each app needs to access the same DB file, possibly over a network.
If multiple apps(database clients) is not the case, SQLite is a great production choice in 99% of cases. The remaining 1% is apps using specific DB features, apps under enormous load, etc.
Know your architecture.
The anwer to this question depends on the application that you want to deploy in production:
According to the how to use from the SQLite website, SQLite works great in production as the database engine for most website having low to medium traffic (which is to say, most websites).
They argue that the amount of web traffic that SQLite can handle depends on how heavily you use the database of your website. It is known that any site that gets fewer than 100K hits/day should work fine with SQLite. However, this 100K hits/day figure is a conservative estimate, not a hard upper bound.
In summary, SQLite might be a great choice for applications with fewer users and databases uses. Thus, use SQLite for website with fewer or medium interactions with the database and MySQL or PostgreSQL for website with higher interactions with the database.
Reference: sqlite.org