concatenation of data into superset - apache-superset

there are two tables, one collects facts on a daily basis, the other on a monthly basis with the same set of attributes (for example, region, city, technology).
I need to calculate the formula in a superset
SUM(t1.count_exp) / SUM(t2.count_base)
which will be correctly visualized when calculating by region, or by city, or by region + city + technology per month.
in other bi systems, the group by is performed first, then the join is executed and the formula above is calculated, which gives the desired result. How to achieve a similar result in a superset?

Assuming both tables are in same database, then you can write your own query joining the two tables in 'SQL Lab' and then visualize the query results using 'Explore' option available there.
Once you click on 'Explore' from SQL Lab, Superset will create a Virtual Dataset(Table) inside Superset from results of SQL query. Any filters/group by/limit applied on this virtual table from visualization will query over this query.
https://superset.apache.org/docs/frequently-asked-questions
A view is a simple logical layer that abstract an arbitrary SQL
queries as a virtual table. This can allow you to join and union
multiple tables, and to apply some transformation using arbitrary SQL
expressions. The limitation there is your database performance as
Superset effectively will run a query on top of your query (view). A
good practice may be to limit yourself to joining your main large
table to one or many small tables only, and avoid using GROUP BY where
possible as Superset will do its own GROUP BY and doing the work twice
might slow down performance.

Related

How would I add a WHERE clause to a SQL or Access Data Source in Power BI?

I am evaluating Power BI as a possible tool for publication quality reports that will be distributed to clients. My source databases are in Microsoft SQL and Access.
I am somewhat confused by the Power Query Editor.
As a part of this, we will need to be able to specify the value of a Client Id field and apply a WHERE clause to the SQL and Access data source.
I see that I can filter data on one or more columns. This would be cumbersome if generating reports for a set of individual clients.
I see a Manage Parameters feature on the Home Tab of the Power Query Editor. Can these parameters be compared to values in database tables?
Are there examples of using M or DAX (or anything else) to implement an equivalent WHERE clause?
Do I have to run stored procedures, populate temporary tables and then run Power BI reports?
Here are a few options:
Connect to your database using a Native Database Query. (Related post.)
Connect to a view you create on your database that includes the WHERE clause.
Filter the table in the Query Editor after connecting to it.
Import or DirectQuery the whole table and filter at run time.
In #3, basic filtering usually gets folded into the under-the-hood query that Power BI sends to the database so that this is similar to #1 as far as your database sees.
With #4, it's possible to apply row-level security so that different people have access to different subsets of the data.

Power BI - Merging multiple customer tables

Connection Type: Direct Query to multiple sources so limited DAX available especially in Power Query load.
Data Model Query: The Data model is not a perfect star schema but there is an attempt to separate tables into business processes and lookup tables. There are probably a few issues to discuss the current data model. I only have 1 question at this time.
My current goal is to generate a single summarised customer table to replace the current two tables that have some measure I need like the number of app customers, a number of total customers, date customer first accessed app etc.
So I cannot merge the 2 customer tables and add calculated columns and measures at the import stage as power bi does not support or allow it and sql is out as I am using direct query. My plan is to create a summarised customer table using DAX summarise function on front end visual page, that has only the app customers and then measures like the total number of customers etc. Is this best practice or is there a better way of approaching this? Understand you would ideally do in sql, or power query but in these circumstances, I think this is the best way but wanted a second view.
Is there a reason to use Direct Query over Import? If you are in Import mode, you can easily Append the two client tables together in PowerQuery.
Treb Gatte, Power BI MVP

Problems loading data in to Analysis Services Model

I’m building an model in Azure Analysis Services. The model should contain only data for the last 3 months and is processed every day.
I have a separate dimension for date that has a relation with a fact table using a datekey. I’m using a power query to only load the last 3 months in the date dimension. In the power query to load the fact table I used Table.nestedjoin to only load the rows that have a value in the date table.
When I do this, the processing of the model takes forever. After some troubleshooting I saw that the query Analysis Services is using to retrieve data from the SQL database retrieves all rows. So, Am I correct saying AS load all data before it merge the rows? Is there a way to change this? Or is there a better way to a chief my solution?
Kind regards,
Joins are super slow in Power Query. You should avoid them if you can do it in the datasource or use normal relationships in the data model.
Also, you can setup the date dimension in DAX and dynamically populate it to contain only dates present in the FACT table.
As for the load of all the data, it could be because the data is fetched as is, and only then power query applies the transformations (the join).
You can modify the query in the Power Query Editor / Advenced Editor to add a where clause direclty in the query

Using fake timestamps to create partitions on Google BigQuery

Google BigQuery (BQ) allows you to create a partition using timestamp or date types only.
99% of my data has a very clear selector, idClient. I've created to my customer's views with a predicate like idClient = code so the privacy is guaranteed.
The problem with this strategy is that there are customers with 5M rows and others with 200K and as BQ does not have indexes, they are always processing data from each other (and the costs are rising).
I am intending to create a timestamp field where each customer will have a different timestamp that will be repeated for every Insert in every customer sensitive table and thus I can query by timestamp by fixing it as it would be with a standard ID.
Does this make any sense? If BQ was an indexed database I'd be concerned about skewed data but as it is always full table scan, I think I'd have only benefits and no downsides.
The solution for your problem is to add Cluster field to your table which is equivalent to an Index in other databases
This link provides the basic on how to use cluster field
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns
Note: When using cluster field BigQuert dryRun doesn't show the cost improvement which can only be seen post-execution

BigQuery - querying only a subset of keys in a table with key value schema

So I have a table with the following schema:
timestamp: TIMESTAMP
key: STRING
value: FLOAT
There are around 200 unique keys. I am partitioning the dataset by date.
I want to run several (5-6 currently, but I expect to add at least 15 more) queries on a daily basis on this database. Brute forcing these would cost me a lot daily, which I want to avoid.
The issue is that because of this key - value format, and BigQuery being a columnar database, each query queries the whole day's data, despite each query actually using a maximum of 4 keys. What is a best way to optimize this?
I am thinking the best way I can go about it right now is to create separate temp tables for each key as a daily batch process, run my queries on them and then delete them.
Ideal way I would want to go about it is partitioning by key, I am not sure there is any such provision?
You can try using recently introduced clustering partitioned tables
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.
Update (moved from comments)
Also have in mind below
Feature Partitioning Clustering
--------------- ------------- -------------
Cardinality Less than 10k Unlimited
Dry Run Pricing Available Not available
Query Pricing Exact Best Effort
Pay special attention to Dry Run Pricing - unfortunately - clustered tables do not support dry run (validation) based on clustered keys - and rather show only validation based on partitions. but if you set your clustering properly - actual run will end up with lower cost. you should try with smaller data to get comfortable with this
See more at Clustering partitioned tables