How can I approach data split across multiple databases?

How can I approach data split across multiple databases? - django

I'm putting together a proposal for the development of a web application.
The app is to be launched in multiple countries, and some of the client's partners and (allegedly; I'm no lawyer) some of the countries involved have rules about where personal data can be stored. The upshot is that there is a hard requirement that particular data about certain countries' users is stored on servers in that country. (It sounds like they're OK with me caching data in any country, though -- so I intend to have a Redis in-memory store in the main data centre.) Some of the data (credit card details, for example) will additionally be encrypted, but this seems to make no difference to them in terms of where it can be stored.
With the current set of requirements, users from one country won't actually ever interact with users from another country, so one obvious option is to run different instances of the application in each country, entirely self-contained. This is simpler from an architectural point of view, but harder to manage, and would have overall higher server costs. It might get complicated if for example the client wants reports on all users across all countries, or eventually they want to merge the databases, and users' primary keys have to change. Not impossible, but it'd likely be a pain.
Probably better would be to have a central database with all information the client deems it acceptable to host in a single spot (North America somewhere), and then satellite databases in each country holding the information the client needs to be kept "at home".
So the main database would have the main users table, consisting of only a PK and a country code, and would have lots of other tables. Each local database would have a "user details" table, with a foreign key (to the main users table on the main database) and a bunch of other columns of personally identifiable information, as well as username, email address, password, etc.
The client may then push to have other data stored in the satellite locations, some of which may be one-to-many with a user or many-to-many with a user.
My questions:
How can this be handled with Django? Can it, or should I look at other frameworks?
Can the built-in User model be edited to look in all the satellite databases for the matching User model on log in, and when logged in to retrieve the user data from those databases without too much trouble?
Are there any guidelines you can give me to make sure code stays simple and things stay efficient?
Will this be significantly easier if the satellite database only has one-to-one data with the main User table? I imagine that having one-to-many or many-to-many data in those satellite databases would be a major pain (or at least inefficient), or am I wrong?

To answer your questions accordingly:
Looks like something that you could do in Django (I like Django so I may not be the best to opinion here) - maybe the following will convince you (or not).
A microservice approach? Multiple instances of the "user" resource multiservice each with it's own database (I heard you about the costs but maybe?).
You can do plenty with Django Authentication Backends (including wirting your own) - there is a "remote" auth backend you could use as an example. Read about stateless authentication (JWT).
Look at points 1 and 2.
Consider not using the built-in Django user model is it doesn't suit you.

Related

Handling multiple users concurrently populating a PostgreSQL database

I'm currently trying to build a web app that would allow many users to query an external API (I cannot retrieve all the data served by this API at regular intervals to populate my PostgreSQL database for various reasons). I've read several thing about ACID and MVCC but still, I'm not sure there won't be any problem if several users are populating/reading my PostgreSQL database at the very same time. So here I'm asking for advice (I'm very new to this field)!
Let's say my users query the external API to retrieve articles. They make their search via a form, the back end gets it, queries the api, populates the database, then query the database to return some data to the front end.
Would it be okay to simply create a unique table to store the articles returned by the API when users are querying it ?
Shall I rather store the articles returned by the API and associate each of them to the user that requested it (the Article model will contain a foreign key mapping to a User model)?
Or shall I give each user a table (data isolation would be good but that sounds very inefficient)?
Thanks for your help !

Would it be okay to simply create a unique table to store the articles returned by the API when users are querying it ?
Yes. If the articles have unique keys (doi?) you could use INSERT...ON CONFLICT DO NOTHING to handle the (presumably very rare) case that an article is requested by two people nearly simultaneously.
Shall I rather store the articles returned by the API and associate each of them to the user that requested it (the Article model will contain a foreign key mapping to a User model)?
Do you want to? Is there a reason to? Do you care who requested each article? It sounds like you anticipating storing only the first person to request each article, and not every request?
Or shall I give each user a table (data isolation would be good but that sounds very inefficient)?
Right, you would be hitting the API a lot more often (assuming some large fraction of articles are requested more than once) and storing a lot of duplicates. It might not even solve the problem, if one person hits "submit" twice in a row, or has multiple tabs open, or writes a bot to hit your service in parallel.

Cross-service references in DB

I am building service oriented system, with multiple services and application.
Current I am not sure how to handle DB references between resources from multiple services and databases.
For example, I have a users service, where I can define all users and their roles.
Next I have, products service, where I can define my products, their prices and other information.
I also have invoicing service, which is used to create invoices. This service will use information from previous two services. It will link products and users to invoice. Now I am not sure what is the best approach for this?
Do I just save product ID and user ID that it got from other two services, without any referential integrity?
If I do this, then I will have problem when generating reports, because at time of generation I will need to send a lot of requests to products service, to get names and prices of product in invoice. Same for users.
Do I create some table products in my invoicing application, and store name and price of product at the moment of invoice creation?
If I go with this approach, then in case that price or name of product changes, I will have inconsistent data across my applications?
Is there some well-known pattern for this kind of problem, that is what is the best solution.

Cross-service references in DB is a common challenge for Data integrity between multiple web services, And specially when we are talking about Real time access.
There is two approaches for your case :
1- Databases Replication across your servers
I suppose that you have each application hosted on a separate server, So i can name your servers as Users_server, Products_server and Invoices_Server.
In your example, your Invoice web service need to grab data from Users & Products Servers, in this case you can create a Replication of your Users Database and Products Database on your Invoices_server.
This way you can run your Join queries on the same server and get data from multiple databases.
Query example :
SELECT *
FROM UsersDB.User u
JOIN InvoicesDB.Invoice i ON u.Id = i.ClientId
2- Main Database Replication
1st step you have to replicate all your databases into one main server we can call it Base_server, which basically contain all your databases from all your services.
Then you can build an internal web service for your application to provide needed data in just "One Call", this answer your question about generating reports.
In other words, you will make one call to the mane Base service instead of making 2 or 3 calls to your separate services.
Note: As a Backend developer we use this organization as a best practice while building a large bundle based application, we create a base bundle and then create service_bundle which rely on the base bundle.
If your services are already live, we may need more details about the technology and databases type you using in order to give you a more accurate solution.

Just because you are using SOA doesn't mean you abandon database integrity. Continue to use referential integrity where your database design requires it.
At the service level, you can have each service be responsible for returning identity information for the entities which it owns. This identity information may or may not be the actual primary key from the database, but it will be used by the clients of the service as though it were the actual primary key.
When a client wants to create an invoice, it will call the User service and receive a User entity, which will contain a User Identifier. It will call the Product service and receive a set of products, each with a product identifier. It will then call the Invoice service to create an invoice, passing the user identifier and the product identifiers. This will likely return an invoice identifier.

You can (probably should) enforce the integrity making the productId and userId foreign keys in your invoice table. Then your DB makes sure the referenced entities exist. Reports should join tables, not query services for each item. I assume a central DB shared across the system.

How to restrict certain rows in a Django model to a department?

This looks like it should be easy but I just can't find it.
I'm creating an application where I want to give admin site access to people from different departments. Those people will read and write the same tables, BUT they must only access rows belonging to their department! I.e. they must not see any records produced by the other departments and should be able to modify only the records from their own department. If they create a record, it should automatically "belong" to the department of the user which created it (they will create records only from the admin site).
I've found django-guardian, but it looks like an overkill - I don't really want to have arbitrary per-record permissions.
Also, the number of records will potentially be large, so any kind of front-end permission checking on a per-record basis is not suitable - it must be done by DB-side filtering. Other than that, I'm not really particular how it will be done. E.g. I'm perfectly fine with mapping departments to auth groups.

Making sharding simple with Django

I have a Django project based on multiple PostgreSQL servers.
I want users to be sharded across those database servers using the same sharding logic used by Instagram:
User ID => logical shard ID => physical shard ID => database server => schema => user table
The logical shard ID is directly calculated from the user ID (13 bits embedded in the user id).
The mapping from logical to physical shard ID is hard coded (in some configuration file or static table).
The mapping from physical shard ID to database server is also hard coded. Instagram uses Pgbouncer at this point to retrieve a pooled database connection to the appropriate database server.
Each logical shard lives in its own PostgreSQL schema (for those not familiar with PostgreSQL, this is not a table schema, it's rather like a namespace, similar to MySQL 'databases'). The schema is simply named something like "shardNNNN", where NNNN is the logical shard ID.
Finally, the user table in the appropriate schema is queried.
How can this be achieved as simply as possible in Django ?
Ideally, I would love to be able to write Django code such as:
Fetching an instance
# this gets the user object on the appropriate server, in the appropriate schema:
user = User.objects.get(pk = user_id)
Fetching related objects
# this gets the user's posted articles, located in the same logical shard:
articles = user.articles
Creating an instance
# this selects a random logical shard and creates the user there:
user = User.create(name = "Arthur", title = "King")
# or:
user = User(name = "Arthur", title = "King")
user.save()
Searching users by name
# fetches all relevant users (kings) from all relevant logical shards
# - either by querying *all* database servers (not good)
# - or by querying a "name_to_user" table then querying just the
# relevant database servers.
users = User.objects.filter(title = "King")
To make things even more complex, I use Streaming Replication to replicate every database server's data to multiple slave servers. The masters should be used for writes, and the slaves should be used for reads.
Django provides support for automatic database routing which is probably sufficient for most of the above, but I'm stuck with User.objects.get(pk = user_id) because the router does not have access to the query parameters, so it does not know what the user ID is, it just knows that the code is trying to read the User model.
I am well aware that sharding should probably be used only as a last resort optimization since it has limitations and really makes things quite complex. Most people don't need sharding: an optimized master/slave architecture can go a very long way. But let's assume I do need sharding.
In short: how can I shard data in Django, as simply as possible?
Thanks a lot for your kind help.
Note
There is an existing question which is quite similar, but IMHO it's too general and lacks precise examples. I wanted to narrow things down to a particular sharding technique I'm interested in (the Instagram way).

Mike Clarke recently gave a talk at PyPgDay on how Disqus shards their users with Django and PostgreSQL. He wrote up a blog post on how they do it.
Several strategies can be employed when sharding Postgres databases. At Disqus, we chose to shard based on table name. Where as the original table name as generated by Django might be comments_post, our sharding tools will rewrite the SQL to query a table comments_post_X, where X is the shard ID calculated based on a consistent hashing scheme. All these tables live in a single schema, on a single database instance.
In addition, they released some code as part of a sample application demonstrating how they shard.

You really don't want to be in the position of asking this question. If you are sharding by user id then you probably don't want to search by name.
If you are sharding your database then it's not going to be invisible to your application and will probably end up requiring schema alterations.
You might find SkyTools useful - read up on PL/Proxy. It's how Skype shard their databases.

it is better to use professional sharding middleware, for example: Apache ShardingSphere.
The project contains 2 productions, ShardingSphere-JDBC for java driver, and ShardingSphere-Proxy for all programing languages. It can support python and Django as well.

SOA/Web Service Pagination

In SOA we should not be building or holding state (or designing dependencies) between client and server. This is understood. But what patterns can be followed in the case that a client wants to consume a real-time service that may return an open ended number of 'rows'?
Web applications, similar to SOA but allowing for state (sessions) have solved this with pagination. Pagination requires (in most cases, especially with SQL) that the server holds the data and that the client request the data in chunks.
If we where to consider pagination-like scenarios for web services, what patterns would these follow that would still allow the tenets of SOA to be adhered (or as close as possible).
Some rules for the thinkers:
1) Backed by a SQL database (therefore there is no concept of a row number in a select set)
2) It is important to not skip a row or duplicate a row in a set during pagination
3) Data may be inserted and deleted at any time into the database by other clients
4) There is no need to consider the dataset a live (update-able) dataset
Personally, I think that 1 and 2 above already spell our the solution by constraining the solution space with the requirements.
My proposed solution would have the data (as much as is selected) be stored in a read-only store/cache where it can be assigned a row number within the result set and allow pagination to occur on this data snapshot. I have would have infrastructure to store snapshots (servers, external caches, memcached or ehcache - this must scale quite large). The result of such a query would be a snapshot ID and clients could retrieve the data from the snapshot using a snapshot API (web services) and the snapshot ID. Results would be processed in a read-only, forward only manner for x records at a time where x was something reasonable.
Competing thoughts and ideas, criticisms or accolades would be greatly appreciated.

Paginated results in a Web Service is actually quite easy to achieve.
All you have to do is add two parameters to the web service call: Page Size, Page Number.
Page Size is the number of results to include in a page. Page Number is the number of the page of results you are looking for.
Your web service then goes back to the database (or cache), retreives the results, figures out which results fit on the requested page, and return only those results.
The client then has to make a single request per page of results they want from the service.

What you propose with memcached will also work with a caching table. The first service call would (1) INSERT results INTO the caching table with a snapshot ID (2) return the first page from the caching table and the snapshot ID. Subsequent calls would return pages based on page size and page number by querying the caching table using the snapshot ID.
I should think this could also be optimized by using an in-memory caching table, but that depends on whether your database supports INSERT-INTO from a disk table to an in-memory table. That might get complicated in a clustered environment though.
Such a cache is stateful by its very nature if you are retaining a client-specific copy between requests, whether storage is in a session object, database table or memcached data store. Given the requirements though, you have no choice but to cache results in some form or another, except you risk the chance of returning deleted or no-longer-relevant records as legitimate results.

SOA is not meant for such low level functionality.
SOA is meant to glue together business areas, not frontends to backends. Not because your application talks to the back end using webservices you have a "SOA" application. This is non sense since SOA is meaningless in the context of 1 isolated system.
From that point of view, it is then clear that, in SOA, the caller should not have known about the SQL table you are paginating, that’s an implementation detail that SOA should hide. In the other hand the server should not know about the client's state, because it should be agnostic to the details of the clients, to be really open.
So, just understand that pagination is not SOA. Do as you wish, just understand that the webservice you are using to paginate is an internal artifact of your application, not to be used for external clients in a SOA bus. Also remember that it can not be transaction consistent with out state in the server. Probably the problem is that you have only one service layer for the application's UI and the SOA bus, you need to separate them.
Using this webservice in a SOA bus would be bad. I can not be consistent as the user paginates and as other applications hang to it they become tied to the specific SQL.
... then you might as well have granted direct SQL access to the table for all that matters.
SOA is for business messages between systems, not to glue an application's frontend to the backend.

Same problem, resolved using the Navision approach.
$ws->getList($first_record_id, $limit)
This return a page of $limit element that start from the the passed id
select * from collection where collection.id > $first_record_id ASC limit $limit
ordered by id ASC
Navision use Key (each element has a key) but in MySQL an autoincrement id is better.
In this case pagination is intended for handle large result sets and not for a frontend pagination...

I am not sure if SOA is of concern here. The problem you have seems to be with paginating your API's. I will point you to how twitter handles their pagination dev.twitter.com/rest/public/timelines

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js