How are cross database joins performed in superset? - apache-superset

How are cross database joins performed in superset? For example, are the 2 datasources pulled into a pandas dataframe? or a sqlite / postgres db? then joined in memory? or do you have to provide a database instance for superset to perform operations like these?

Superset provides the possibility of creating Virtual Datasets with custom SQL queries, so you need to have your datasources in tables in a database to perform the joins and create charts using the Virtual Datasets.

If I understand the question correctly, I believe the SuperSet is an opensource equivalent to the ADO.Net DataSet from Microsoft. If so, then the selected data from both DB's are pulled into memory (data tables) using separate connections (because each connectionstring is going to be different) and then the operations are performed on the fly, in memory.
In that scenario, no external database would be required.

Related

What is the approach to merge data from multiple databases (same schema) using Power BI?

I have 3 OLTP databases, all using the same database schema. Each db represents one department.
I am exploring Power BI as a solution for reporting at the company level, so all departments combined.
What is the approach to combine data from multiple dbs into a data warehouse? For example - do I need SSIS to combine the 3 dbs into 1 data warehouse?
Another option could be to have 1 shared dataset per db, and then the final report can connect and combine multiple live datasets? Or is there another way with Power BI like combining multiple live datasets?
Any reference link on how if someone has done this?
Or is there another way with Power BI
Yes. Simply create a single import model and load data from all three databases in it. So for each table in your Power BI model you would have three Power Queries set to not load into the model, and you would append them in a query that is used to load your model. See eg: https://learn.microsoft.com/en-us/power-query/append-queries
Best practice would be to:
Extract the data into a single database (DWH or reporting schema)
Build the necessary items there for your data model, be it reporting schema, or star/snowflake schemas
Connect Power BI to that schema.
Combining datasets is going to be tricky, you may have the same measures in each of the datasets. Combining in the database, with any added columns to indicate the department is the best option in terms of supporting updating/adding/removing items. For example, if the schema changes in the DB's you do it in one place, not three datasets. The toolset in DB/SSIS will be better suited to the heavy lifting of the data to a location.
You would use SSIS to extract the data if on-prem data, Azure Data Factory for Azure DB's. Extract to a staging schema, convert/transform the data into its final from, with a new schema to define what it is, facts/dimensions other schema names such as reporting can be used, depending on the data model you wish to build. Most of this is covered by the standard ETL pattern of OLTP to an OLAP database.

PostgreSQL: How to store and fetch historical SQL data from Azure Data Lakes (ADLS)

I have one single Django web application deployed on Azure with a transactional SQL DB i.e. PostgreSQL.
Within the Django application, every day this historical data needs to be accessed (eg: to show the pattern over a period of years, months etc.) from the ADLS.
However, the ADLS will only return a single/multiple Files, and my application needs an intermediate such as Azure Synapse to convert this unstructured data into Structured DB in order to perform Queries on this historical data to show it within the web application.
Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.
Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)
This entire exercise is being undertaken in order to store older historical data in a large repository and also access/query data from that ADLS repository whenever required.
Please guide what Azure alternatives may work in this case.
You need to break down your problem. For each piece you have multiple choices with different cost implications and complexity of implementation and amount of control/flexibility you get.
Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.
Synapse Serverless SQL Pool lets you query JSON files from Datalake without a physical DB. It's only compute no storage.
This is for infrequent access to large datasets, because every query goes and parses the data in Datalake.
If you want you can also COPY INTO some_table all the data from files and then perform queries more efficiently on some_table (which is stored in DB, with indices, partitions, ...) using a dedicated Synapse SQL Pool.
E.g. following JSON
{
"_id":"ahokw88",
"type":"Book",
"title":"The AWK Programming Language",
"year":"1988",
"publisher":"Addison-Wesley",
"authors":[
"Alfred V. Aho",
"Brian W. Kernighan",
"Peter J. Weinberger"
],
"source":"DBLP"
}
Can be queried with following SQL:
SELECT
JSON_VALUE(jsonContent, '$.title') AS title
, JSON_VALUE(jsonContent, '$.publisher') as publisher
, jsonContent
FROM OPENROWSET
(
BULK 'json/books/*.json',
DATA_SOURCE = 'SqlOnDemandDemo'
, FORMAT='CSV'
, FIELDTERMINATOR ='0x0b'
, FIELDQUOTE = '0x0b'
, ROWTERMINATOR = '0x0b'
)
WITH
( jsonContent varchar(8000) ) AS [r]
WHERE
JSON_VALUE(jsonContent, '$.title') = 'Probabilistic and Statistical Methods in Cryptology, An Introduction by Selected Topics'
Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)
Synapse offers good old JDBC drivers, so as long as your ORM layer can use a JDBC source you should be good to go. Remember that underlying data source (Synapse) is meant for MPP and not transactional processing. So inserting 1000 rows in a for loop using INSERT INTO... would take 1000 seconds, but querying 10 million rows using a SELECT ... statement would probably take less than 100. So know what you do with it.
Does Synapse have to be configured with both the App DB and ADLS in a pipeline system through Azure Data Factory? And is this achievable for a PostgreSQL DB? Since I could not Azure docs that talk specifically about PostgreSQL DB <---> ADLS connections. – Simran 14 hours ago
You're mixing things here. You can NOT use Synapse to give a single view of data across two data sources: 1) PostgreSQL, 2) ADLS.
Only source for Serverless is ADLS.
You can do this using Data Factory, which would allow you to create two data sources (ADLS and PostgreSQL), read from them, merge them to produce a new data set, write the output to some output data sink like PostgreSQL. Your Django code then would be able to read this from PostgreSQL as usual.
Understand the cost and performance implications of each piece before you make a decision:
Serverless SQL Pool
Dedicated SQL pool
Data Factory

How to run dynamic queries in Informatica cloud mapping task?

I am new in informatica cloud. I have list of queries ready in my table. Like below.
Now I want to take one by one query from this table which work as a source query and whatever results return which I need to load into target. All tables were already created in source and target.
I just need to copy the data based on dynamic queries which kept in my one of sql tables.
If anyone have any idea then please share your toughs with me. It great helps to me.
The source connection will be the connector to your source database and the Source Type will be query. From there it depends how you are managing your variables. See thread on Informatica Network for links to multiple examples.
Read the table like normally you would do in the cloud. Then pass each of the record into the sql transformation for execution. configure where the sql transformation has to execute and it will run the queries in the database you want.
you can use a SQL task to run dynamic SQL queries.
link to using SQL task approach: https://www.datastackpros.com/2019/12/informatica-cloud-incremental-load_14.html

Why well-designed DynamoDB application require only one table?

On Amazon DynamoDB help center I've read that
You should maintain as few tables as possible in a DynamoDB
application. Most well designed applications require only one table.
Sorry guys, but what does it mean? Whether should I design a database with just ONE table, or should I work with just one table in my (let it be php) application (but a database may contain several tables)?
Thank you!
I think this One Table concept means if you draw a relational database diagram of your models and associations, then all associated tables that connected should be able to be merged, or designed into one single NoSQL table. If you got two sets of tables in the same relational database that have no association between them, then you can group them into two separate NoSQL tables.

Push from one sql server to another autonomously

I have an application that requires me to pull certain information from DB#1 and push it to DB#2 every time a certain entry in a table from DB#1 is updated. The polling rate doesn't need to be extremely fast, but it probably shouldn't be any slower than 1 second.
I was planning on writing a small service using the C++ Connector library, but I am worried about putting too much load on DB#1. Is there a more efficient way of doing this? Such as built in functionality within an SQL script?
There are many methods to accomplish this, so it may be other factors you prefer that drive the approach.
If the SQL Server databases are on the same server instance:
Trigger on the DB1 tables that push to the DB2 tables
Stored procedure (in DB1 or DB2) that uses MERGE to identify changes and sync them to DB2, then use SQL job to call the procedure on your schedule
Enable Change Tracking on database and desired tables, then use stored proc + SQL job to send changes without any queries on source tables
If on different instances or servers (can also work if on same instance though):
SSIS Package to identify changes and push to DB2 (bonus can work with change data capture)
Merge Replication to synchronize changes
AlwaysOn Availability Groups to synchronize entire dbs
Microsoft Sync Framework
Knowing nothing about your preferences or comfort levels, I would probably start with Merge Replication - can be a bit tricky and tedious to setup, but performs very well.
You can create a trigger in DB1 and dblinks in between DB1 and DB2. So you can natively invoke trigger within DB1 and transfer data directly to DB2.