Cross-service references in DB - web-services

I am building service oriented system, with multiple services and application.
Current I am not sure how to handle DB references between resources from multiple services and databases.
For example, I have a users service, where I can define all users and their roles.
Next I have, products service, where I can define my products, their prices and other information.
I also have invoicing service, which is used to create invoices. This service will use information from previous two services. It will link products and users to invoice. Now I am not sure what is the best approach for this?
Do I just save product ID and user ID that it got from other two services, without any referential integrity?
If I do this, then I will have problem when generating reports, because at time of generation I will need to send a lot of requests to products service, to get names and prices of product in invoice. Same for users.
Do I create some table products in my invoicing application, and store name and price of product at the moment of invoice creation?
If I go with this approach, then in case that price or name of product changes, I will have inconsistent data across my applications?
Is there some well-known pattern for this kind of problem, that is what is the best solution.

Cross-service references in DB is a common challenge for Data integrity between multiple web services, And specially when we are talking about Real time access.
There is two approaches for your case :
1- Databases Replication across your servers
I suppose that you have each application hosted on a separate server, So i can name your servers as Users_server, Products_server and Invoices_Server.
In your example, your Invoice web service need to grab data from Users & Products Servers, in this case you can create a Replication of your Users Database and Products Database on your Invoices_server.
This way you can run your Join queries on the same server and get data from multiple databases.
Query example :
SELECT *
FROM UsersDB.User u
JOIN InvoicesDB.Invoice i ON u.Id = i.ClientId
2- Main Database Replication
1st step you have to replicate all your databases into one main server we can call it Base_server, which basically contain all your databases from all your services.
Then you can build an internal web service for your application to provide needed data in just "One Call", this answer your question about generating reports.
In other words, you will make one call to the mane Base service instead of making 2 or 3 calls to your separate services.
Note: As a Backend developer we use this organization as a best practice while building a large bundle based application, we create a base bundle and then create service_bundle which rely on the base bundle.
If your services are already live, we may need more details about the technology and databases type you using in order to give you a more accurate solution.

Just because you are using SOA doesn't mean you abandon database integrity. Continue to use referential integrity where your database design requires it.
At the service level, you can have each service be responsible for returning identity information for the entities which it owns. This identity information may or may not be the actual primary key from the database, but it will be used by the clients of the service as though it were the actual primary key.
When a client wants to create an invoice, it will call the User service and receive a User entity, which will contain a User Identifier. It will call the Product service and receive a set of products, each with a product identifier. It will then call the Invoice service to create an invoice, passing the user identifier and the product identifiers. This will likely return an invoice identifier.

You can (probably should) enforce the integrity making the productId and userId foreign keys in your invoice table. Then your DB makes sure the referenced entities exist. Reports should join tables, not query services for each item. I assume a central DB shared across the system.

Related

Using RLS with Analysis Service Live Connection in a PBIE "App Owns Data" scenario

I'm kind of new to PBI and I'm looking if it's the right tool for my case.
I would like to use Power BI Embedded in a web application for our customer (where they're logged in) which do not have any Power BI account/licence.
The database on which the reports are based are on-premise so we're would use Analysis Service Live Connection to access them.
Each customer should have his own report.
Is it possible to use RLS in that case?
Does that mean we've to create a role for each of them?
What username should be given in the EffectiveIdentity? Is it 'free text' that is used by PBI to get the username in the DAX?
If each customer will have his own report, then why do you need RLS at all? Just make the report to show what the user is supposed to see. Or you want to have a single report (or set of reports), which is shared between the users and they should see only their data? I will assume it is the later one.
I will start with the last question - the effective identity is not a "free text". It must be a valid user name, having rights to access the data, as specified in the documentation:
The effective identity that is provided for the username property must be a Windows user with permissions on the Analysis Services server.
The you can define RLS in your Analysis Service model, by adding a "users security" table, where you specify which rows should be visible to each user. Define relationships between this users security table and other tables in the model, and then let RLS to filter the data in the security table. The relationships with the rest of the model will apply cascade filtering on the data, so only relevant rows will be visible to the user. See Implement row-level security in an Analysis Services tabular model for example.
So the answer of your second question is no, you don't need a separate role for each user, because the filtering is based on the username and for every user it filters the same thing the same way.

Is there a one-to-many relationship between service & sku

We have exported our GCP billing information to bigquery. When querying over it I've noticed that there seems to be a one-to-many relationship between service and sku. In other words, every instance of a sku seems to have the same service.
If this is indeed the case I can build some nice drill-down type reports whereby the report user drill from service to sku. before I build such a report though I want to confirm that this is indeed the case.
Can multiple instances of a SKU map to different services?
Ah, seems the answer is "no, there isn't a one-to-many relationship between service and sku"
That's a shame.

How can I approach data split across multiple databases?

I'm putting together a proposal for the development of a web application.
The app is to be launched in multiple countries, and some of the client's partners and (allegedly; I'm no lawyer) some of the countries involved have rules about where personal data can be stored. The upshot is that there is a hard requirement that particular data about certain countries' users is stored on servers in that country. (It sounds like they're OK with me caching data in any country, though -- so I intend to have a Redis in-memory store in the main data centre.) Some of the data (credit card details, for example) will additionally be encrypted, but this seems to make no difference to them in terms of where it can be stored.
With the current set of requirements, users from one country won't actually ever interact with users from another country, so one obvious option is to run different instances of the application in each country, entirely self-contained. This is simpler from an architectural point of view, but harder to manage, and would have overall higher server costs. It might get complicated if for example the client wants reports on all users across all countries, or eventually they want to merge the databases, and users' primary keys have to change. Not impossible, but it'd likely be a pain.
Probably better would be to have a central database with all information the client deems it acceptable to host in a single spot (North America somewhere), and then satellite databases in each country holding the information the client needs to be kept "at home".
So the main database would have the main users table, consisting of only a PK and a country code, and would have lots of other tables. Each local database would have a "user details" table, with a foreign key (to the main users table on the main database) and a bunch of other columns of personally identifiable information, as well as username, email address, password, etc.
The client may then push to have other data stored in the satellite locations, some of which may be one-to-many with a user or many-to-many with a user.
My questions:
How can this be handled with Django? Can it, or should I look at other frameworks?
Can the built-in User model be edited to look in all the satellite databases for the matching User model on log in, and when logged in to retrieve the user data from those databases without too much trouble?
Are there any guidelines you can give me to make sure code stays simple and things stay efficient?
Will this be significantly easier if the satellite database only has one-to-one data with the main User table? I imagine that having one-to-many or many-to-many data in those satellite databases would be a major pain (or at least inefficient), or am I wrong?
To answer your questions accordingly:
Looks like something that you could do in Django (I like Django so I may not be the best to opinion here) - maybe the following will convince you (or not).
A microservice approach? Multiple instances of the "user" resource multiservice each with it's own database (I heard you about the costs but maybe?).
You can do plenty with Django Authentication Backends (including wirting your own) - there is a "remote" auth backend you could use as an example. Read about stateless authentication (JWT).
Look at points 1 and 2.
Consider not using the built-in Django user model is it doesn't suit you.

Making sharding simple with Django

I have a Django project based on multiple PostgreSQL servers.
I want users to be sharded across those database servers using the same sharding logic used by Instagram:
User ID => logical shard ID => physical shard ID => database server => schema => user table
The logical shard ID is directly calculated from the user ID (13 bits embedded in the user id).
The mapping from logical to physical shard ID is hard coded (in some configuration file or static table).
The mapping from physical shard ID to database server is also hard coded. Instagram uses Pgbouncer at this point to retrieve a pooled database connection to the appropriate database server.
Each logical shard lives in its own PostgreSQL schema (for those not familiar with PostgreSQL, this is not a table schema, it's rather like a namespace, similar to MySQL 'databases'). The schema is simply named something like "shardNNNN", where NNNN is the logical shard ID.
Finally, the user table in the appropriate schema is queried.
How can this be achieved as simply as possible in Django ?
Ideally, I would love to be able to write Django code such as:
Fetching an instance
# this gets the user object on the appropriate server, in the appropriate schema:
user = User.objects.get(pk = user_id)
Fetching related objects
# this gets the user's posted articles, located in the same logical shard:
articles = user.articles
Creating an instance
# this selects a random logical shard and creates the user there:
user = User.create(name = "Arthur", title = "King")
# or:
user = User(name = "Arthur", title = "King")
user.save()
Searching users by name
# fetches all relevant users (kings) from all relevant logical shards
# - either by querying *all* database servers (not good)
# - or by querying a "name_to_user" table then querying just the
# relevant database servers.
users = User.objects.filter(title = "King")
To make things even more complex, I use Streaming Replication to replicate every database server's data to multiple slave servers. The masters should be used for writes, and the slaves should be used for reads.
Django provides support for automatic database routing which is probably sufficient for most of the above, but I'm stuck with User.objects.get(pk = user_id) because the router does not have access to the query parameters, so it does not know what the user ID is, it just knows that the code is trying to read the User model.
I am well aware that sharding should probably be used only as a last resort optimization since it has limitations and really makes things quite complex. Most people don't need sharding: an optimized master/slave architecture can go a very long way. But let's assume I do need sharding.
In short: how can I shard data in Django, as simply as possible?
Thanks a lot for your kind help.
Note
There is an existing question which is quite similar, but IMHO it's too general and lacks precise examples. I wanted to narrow things down to a particular sharding technique I'm interested in (the Instagram way).
Mike Clarke recently gave a talk at PyPgDay on how Disqus shards their users with Django and PostgreSQL. He wrote up a blog post on how they do it.
Several strategies can be employed when sharding Postgres databases. At Disqus, we chose to shard based on table name. Where as the original table name as generated by Django might be comments_post, our sharding tools will rewrite the SQL to query a table comments_post_X, where X is the shard ID calculated based on a consistent hashing scheme. All these tables live in a single schema, on a single database instance.
In addition, they released some code as part of a sample application demonstrating how they shard.
You really don't want to be in the position of asking this question. If you are sharding by user id then you probably don't want to search by name.
If you are sharding your database then it's not going to be invisible to your application and will probably end up requiring schema alterations.
You might find SkyTools useful - read up on PL/Proxy. It's how Skype shard their databases.
it is better to use professional sharding middleware, for example: Apache ShardingSphere.
The project contains 2 productions, ShardingSphere-JDBC for java driver, and ShardingSphere-Proxy for all programing languages. It can support python and Django as well.

SOA/Web Service Pagination

In SOA we should not be building or holding state (or designing dependencies) between client and server. This is understood. But what patterns can be followed in the case that a client wants to consume a real-time service that may return an open ended number of 'rows'?
Web applications, similar to SOA but allowing for state (sessions) have solved this with pagination. Pagination requires (in most cases, especially with SQL) that the server holds the data and that the client request the data in chunks.
If we where to consider pagination-like scenarios for web services, what patterns would these follow that would still allow the tenets of SOA to be adhered (or as close as possible).
Some rules for the thinkers:
1) Backed by a SQL database (therefore there is no concept of a row number in a select set)
2) It is important to not skip a row or duplicate a row in a set during pagination
3) Data may be inserted and deleted at any time into the database by other clients
4) There is no need to consider the dataset a live (update-able) dataset
Personally, I think that 1 and 2 above already spell our the solution by constraining the solution space with the requirements.
My proposed solution would have the data (as much as is selected) be stored in a read-only store/cache where it can be assigned a row number within the result set and allow pagination to occur on this data snapshot. I have would have infrastructure to store snapshots (servers, external caches, memcached or ehcache - this must scale quite large). The result of such a query would be a snapshot ID and clients could retrieve the data from the snapshot using a snapshot API (web services) and the snapshot ID. Results would be processed in a read-only, forward only manner for x records at a time where x was something reasonable.
Competing thoughts and ideas, criticisms or accolades would be greatly appreciated.
Paginated results in a Web Service is actually quite easy to achieve.
All you have to do is add two parameters to the web service call: Page Size, Page Number.
Page Size is the number of results to include in a page. Page Number is the number of the page of results you are looking for.
Your web service then goes back to the database (or cache), retreives the results, figures out which results fit on the requested page, and return only those results.
The client then has to make a single request per page of results they want from the service.
What you propose with memcached will also work with a caching table. The first service call would (1) INSERT results INTO the caching table with a snapshot ID (2) return the first page from the caching table and the snapshot ID. Subsequent calls would return pages based on page size and page number by querying the caching table using the snapshot ID.
I should think this could also be optimized by using an in-memory caching table, but that depends on whether your database supports INSERT-INTO from a disk table to an in-memory table. That might get complicated in a clustered environment though.
Such a cache is stateful by its very nature if you are retaining a client-specific copy between requests, whether storage is in a session object, database table or memcached data store. Given the requirements though, you have no choice but to cache results in some form or another, except you risk the chance of returning deleted or no-longer-relevant records as legitimate results.
SOA is not meant for such low level functionality.
SOA is meant to glue together business areas, not frontends to backends. Not because your application talks to the back end using webservices you have a "SOA" application. This is non sense since SOA is meaningless in the context of 1 isolated system.
From that point of view, it is then clear that, in SOA, the caller should not have known about the SQL table you are paginating, that’s an implementation detail that SOA should hide. In the other hand the server should not know about the client's state, because it should be agnostic to the details of the clients, to be really open.
So, just understand that pagination is not SOA. Do as you wish, just understand that the webservice you are using to paginate is an internal artifact of your application, not to be used for external clients in a SOA bus. Also remember that it can not be transaction consistent with out state in the server. Probably the problem is that you have only one service layer for the application's UI and the SOA bus, you need to separate them.
Using this webservice in a SOA bus would be bad. I can not be consistent as the user paginates and as other applications hang to it they become tied to the specific SQL.
... then you might as well have granted direct SQL access to the table for all that matters.
SOA is for business messages between systems, not to glue an application's frontend to the backend.
Same problem, resolved using the Navision approach.
$ws->getList($first_record_id, $limit)
This return a page of $limit element that start from the the passed id
select * from collection where collection.id > $first_record_id ASC limit $limit
ordered by id ASC
Navision use Key (each element has a key) but in MySQL an autoincrement id is better.
In this case pagination is intended for handle large result sets and not for a frontend pagination...
I am not sure if SOA is of concern here. The problem you have seems to be with paginating your API's. I will point you to how twitter handles their pagination dev.twitter.com/rest/public/timelines