Can I set a dynamic database as a source for a dataset? - apache-superset

I have multiple databases which have the same exact schema (different data on each one, database per customer of a SaaS platform).
I would like to create a Dashboard (with charts, datasets) which could be populated by the permissions of the logged in user.
This means the dashboard will query the data from a specified source database, instead of a pre-defined one.
The premise is basically to de-couple a chart / dataset from a database and allow it to be parametrised.

This is a case that is not really supported by Superset, but there's one workaround that I can think of that might work: you can define a DB_CONNECTION_MUTATOR in your superset_config.py that routes to a different database depending on the user.
In your superset_config.py, add this function:
def DB_CONNECTION_MUTATOR(uri, params, username, security_manager, source):
user = security_manager.find_user(username=username)
if url.database = "db_name" and user and user.email.endswith("#examplea.com"):
uri.host = "host-for-examplea.com"
return uri, params
In the function above we're changing the host of the SQLAlchemy URL to host-for-examplea.com if the user has an email ending in #examplea.com.
To make it work, create a default database (which we called db_name in this example), and create all the charts/dashboards based on it. Then, users should be redirected to specific databases by the DB_CONNECTION_MUTATOR.
One serious problem that might happen is with caching, though. You should make sure that all caches are disabled to prevent users from seeing data from other databases.

Related

How can I approach data split across multiple databases?

I'm putting together a proposal for the development of a web application.
The app is to be launched in multiple countries, and some of the client's partners and (allegedly; I'm no lawyer) some of the countries involved have rules about where personal data can be stored. The upshot is that there is a hard requirement that particular data about certain countries' users is stored on servers in that country. (It sounds like they're OK with me caching data in any country, though -- so I intend to have a Redis in-memory store in the main data centre.) Some of the data (credit card details, for example) will additionally be encrypted, but this seems to make no difference to them in terms of where it can be stored.
With the current set of requirements, users from one country won't actually ever interact with users from another country, so one obvious option is to run different instances of the application in each country, entirely self-contained. This is simpler from an architectural point of view, but harder to manage, and would have overall higher server costs. It might get complicated if for example the client wants reports on all users across all countries, or eventually they want to merge the databases, and users' primary keys have to change. Not impossible, but it'd likely be a pain.
Probably better would be to have a central database with all information the client deems it acceptable to host in a single spot (North America somewhere), and then satellite databases in each country holding the information the client needs to be kept "at home".
So the main database would have the main users table, consisting of only a PK and a country code, and would have lots of other tables. Each local database would have a "user details" table, with a foreign key (to the main users table on the main database) and a bunch of other columns of personally identifiable information, as well as username, email address, password, etc.
The client may then push to have other data stored in the satellite locations, some of which may be one-to-many with a user or many-to-many with a user.
My questions:
How can this be handled with Django? Can it, or should I look at other frameworks?
Can the built-in User model be edited to look in all the satellite databases for the matching User model on log in, and when logged in to retrieve the user data from those databases without too much trouble?
Are there any guidelines you can give me to make sure code stays simple and things stay efficient?
Will this be significantly easier if the satellite database only has one-to-one data with the main User table? I imagine that having one-to-many or many-to-many data in those satellite databases would be a major pain (or at least inefficient), or am I wrong?
To answer your questions accordingly:
Looks like something that you could do in Django (I like Django so I may not be the best to opinion here) - maybe the following will convince you (or not).
A microservice approach? Multiple instances of the "user" resource multiservice each with it's own database (I heard you about the costs but maybe?).
You can do plenty with Django Authentication Backends (including wirting your own) - there is a "remote" auth backend you could use as an example. Read about stateless authentication (JWT).
Look at points 1 and 2.
Consider not using the built-in Django user model is it doesn't suit you.

How to restrict certain rows in a Django model to a department?

This looks like it should be easy but I just can't find it.
I'm creating an application where I want to give admin site access to people from different departments. Those people will read and write the same tables, BUT they must only access rows belonging to their department! I.e. they must not see any records produced by the other departments and should be able to modify only the records from their own department. If they create a record, it should automatically "belong" to the department of the user which created it (they will create records only from the admin site).
I've found django-guardian, but it looks like an overkill - I don't really want to have arbitrary per-record permissions.
Also, the number of records will potentially be large, so any kind of front-end permission checking on a per-record basis is not suitable - it must be done by DB-side filtering. Other than that, I'm not really particular how it will be done. E.g. I'm perfectly fine with mapping departments to auth groups.

Template version Lotus Notes

I need to create a view in a BD(a admin db kinnd) that shows me the template version of all the other databases!
can anyone help me with this please!?
You don't need to create a database for this purpose, there is already one. It is called "catalog.nsf" and contains the Information you want. You just need to create a view and modify the selection- formula slightly:
Original:
SELECT #IsAvailable(ReplicaID)& #IsUnavailable(RepositoryType)& !(DBListInCatalog = "0")
New:
SELECT #IsAvailable(ReplicaID)& #IsUnavailable(RepositoryType)
That way you see all databases, even the ones that normally are not visible in catalog.
The information you are looking for is in the "DbInheritTemplateName"- Field.
If you want to code this yourself, you can either run through all documents in the catalog.nsf and read it from there or you use a NotesDBDirectory, run through it and read the "DesignTemplateName"- property of NotesDatabase- Class.
Example code for catalog:
Dim dbCatalog as NotesDatabase
Dim dc as NotesDocumentCollection
Dim doc as NotesDocument
Dim strTemplate as String
Set dbCatalog = New NotesDatabase( "YourServerName" , "catalog.nsf" )
Set dc = dbCatalog.Search( "#IsAvailable(ReplicaID)& #IsUnavailable(RepositoryType)", Nothing, 0 )
Set doc = dc.GetFirstDocument()
While not doc is Nothing
strTemplate = doc.GetItemValue( "DBInheritTemplateName" )(0)
'- do whatever you want: create a document in your database, create a list...
Set doc = dc.GetNextDocument(doc)
Wend
Example code for NotesDBDirectory
Dim dbDirectory as New NotesDBDirectory( "YourServerName" )
Dim db as NotesDatabase
Dim strTemplate as String
Set db = dbDirectory.GetFirstDatabase( DATABASE )
While not db is Nothing
strTemplate = db.DesignTemplateName
'- do whatever you want: create a document in your database, create a list...
Set db = dbDirectory.GetNextDatabase
Wend
As Panu stated, a database catalog provides a list of all databases on a server. You use the server Catalog task to create a database catalog. The Catalog task bases the catalog file (CATALOG.NSF) on the CATALOG.NTF template and adds the appropriate entries to the catalog's ACL. All databases on a server are included in the catalog when the Catalog task runs.
To help users locate databases across an organization, or to keep track of all the replicas for each database, you must set up a Domain Catalog -- a catalog that combines the information from the database catalogs of multiple servers -- on one of your servers. You can set up a Domain Catalog regardless of whether you plan to implement Domino's Domain Search capability.
Besides allowing users to see what databases are on a particular server, catalogs provide useful information about databases. For each database in a view, a Database Entry document provides information such as file name, replica ID, design template, database activity, replication, full-text index, and ACL, as well as buttons that let users browse the database or add it to their bookmarks. In addition, the document displays a link to the database's Policy (About This Database) document, which, for databases users are not authorized to access, they can view by sending an e-mail request to the database manager.
See the Domino Admin Help for more details.

Making sharding simple with Django

I have a Django project based on multiple PostgreSQL servers.
I want users to be sharded across those database servers using the same sharding logic used by Instagram:
User ID => logical shard ID => physical shard ID => database server => schema => user table
The logical shard ID is directly calculated from the user ID (13 bits embedded in the user id).
The mapping from logical to physical shard ID is hard coded (in some configuration file or static table).
The mapping from physical shard ID to database server is also hard coded. Instagram uses Pgbouncer at this point to retrieve a pooled database connection to the appropriate database server.
Each logical shard lives in its own PostgreSQL schema (for those not familiar with PostgreSQL, this is not a table schema, it's rather like a namespace, similar to MySQL 'databases'). The schema is simply named something like "shardNNNN", where NNNN is the logical shard ID.
Finally, the user table in the appropriate schema is queried.
How can this be achieved as simply as possible in Django ?
Ideally, I would love to be able to write Django code such as:
Fetching an instance
# this gets the user object on the appropriate server, in the appropriate schema:
user = User.objects.get(pk = user_id)
Fetching related objects
# this gets the user's posted articles, located in the same logical shard:
articles = user.articles
Creating an instance
# this selects a random logical shard and creates the user there:
user = User.create(name = "Arthur", title = "King")
# or:
user = User(name = "Arthur", title = "King")
user.save()
Searching users by name
# fetches all relevant users (kings) from all relevant logical shards
# - either by querying *all* database servers (not good)
# - or by querying a "name_to_user" table then querying just the
# relevant database servers.
users = User.objects.filter(title = "King")
To make things even more complex, I use Streaming Replication to replicate every database server's data to multiple slave servers. The masters should be used for writes, and the slaves should be used for reads.
Django provides support for automatic database routing which is probably sufficient for most of the above, but I'm stuck with User.objects.get(pk = user_id) because the router does not have access to the query parameters, so it does not know what the user ID is, it just knows that the code is trying to read the User model.
I am well aware that sharding should probably be used only as a last resort optimization since it has limitations and really makes things quite complex. Most people don't need sharding: an optimized master/slave architecture can go a very long way. But let's assume I do need sharding.
In short: how can I shard data in Django, as simply as possible?
Thanks a lot for your kind help.
Note
There is an existing question which is quite similar, but IMHO it's too general and lacks precise examples. I wanted to narrow things down to a particular sharding technique I'm interested in (the Instagram way).
Mike Clarke recently gave a talk at PyPgDay on how Disqus shards their users with Django and PostgreSQL. He wrote up a blog post on how they do it.
Several strategies can be employed when sharding Postgres databases. At Disqus, we chose to shard based on table name. Where as the original table name as generated by Django might be comments_post, our sharding tools will rewrite the SQL to query a table comments_post_X, where X is the shard ID calculated based on a consistent hashing scheme. All these tables live in a single schema, on a single database instance.
In addition, they released some code as part of a sample application demonstrating how they shard.
You really don't want to be in the position of asking this question. If you are sharding by user id then you probably don't want to search by name.
If you are sharding your database then it's not going to be invisible to your application and will probably end up requiring schema alterations.
You might find SkyTools useful - read up on PL/Proxy. It's how Skype shard their databases.
it is better to use professional sharding middleware, for example: Apache ShardingSphere.
The project contains 2 productions, ShardingSphere-JDBC for java driver, and ShardingSphere-Proxy for all programing languages. It can support python and Django as well.

Getting list of MySQL Databases without logging in

I'm working on a Qt/C++ open source project that uses MySQL databases. One class will be used during initial configuration (first run) where the user will be able to select a database. Is there a way to provide a list of all databases on a host without logging in and executing a SHOW DATABASES; transaction? I want to get a list of all databases on the host, not just those owned by a particular user. The only way I know of to do this is to execute SHOW DATABASES; as root on a specific host, but I don't want to require the user to have root access except in certain situations where it is absolutely necessary.
The idea is to have a dialog that lets the user select the default database they want to use during subsequent sessions and provide the user/pass that goes with it. Bonus points if I can get the owner of each database too. (for instance, have the program display that database foo is owned by johndoe while database foo2 is owned by janesmith) Once the user has made a choice, the dialog will then write this info to that user's program configuration file which gets read on normal startup.
Can this be done or will I have to find some workaround such as making the user provide a login/password first and showing a list of databases owned by that account? That would be relatively cumbersome but easy.
You can't execute MySql queries without logging in. That said it is possible to create a user which has very minimal privileges.
You can create a user with just enough privileges to show the list of databases and run the query as that user, then when the user has logged in change the connection string.
There is a SHOW DATABASES grant which allows just that : http://dev.mysql.com/doc/refman/5.0/en/privileges-provided.html#priv_show-databases
Normally you define a user for the application with read-only privileges and after fetching the information needed you present it to the user and then ask for his credentials. I'm just oversimplifying and not going over the specifics of how this is done.