I expect to have 10-25 TBs of data in my MPP DW in nearest few years. The size of one dataset could be up to 500GB csv. I want to do interactive querying against that data in anaytical tools (Power BI).
I'm looking for a way to achieve interactive querying with reasonable billing.
I know Azure Analytics Services (AAS) Multidimensional Model can be used for that data volumes. But it will give me less performance as tabular model. In other hand, even with 10x compression rate I can't keep everything in RAM simultaneously due to AAS pricing.
So I'm wondering if there is an possibility to keep all tabular models inside AAS in detached state (disk only) within minimal AAS cluster size (minimal billing), while on request do scale out (increase number of nodes) and attach specific dataset (load from disk to RAM)? Is there any other option to use AAS tabular model and not keeping all 10-25 TBs in RAM simultaneously?
I assume this option with small amount of concurrent users will have better performance that multidimensional model, while not require keep everything in RAM (less expensive).
Related
I've read almost all the threads about how to improve BigQuery performance, to retrieve data in milliseconds or at least under a second.
I decided to use BI Engine for the purpose because it has seamless integration without code changes, it supports partitioning, smart offloading, real-time data, built-in compression, low latency, etc.
Unfortunately for the same query, I got a slower response time with the BI engine enabled, than just the query with cache enabled.
BigQuery with cache hit
Average 691ms response time from BigQuery API
https://gist.github.com/bgizdov/b96c6c3d795f5f14e5e9a3e9d7091d85
BigQuery + BiEngine
Average 1605ms response time from BigQuery API.
finalExecutionDurationMs is about 200-300ms, but the total time to retrieve the data (just 8 rows) is 5-6 times more.
BigQuery UI: Elapsed 766ms, the actual time for their call to REST entity service is 1.50s. This explains why I get similar results.
https://gist.github.com/bgizdov/fcabcbce9f96cf7dc618298b2d69575d
I am using Quarkus with BigQuery integration and measuring the time for the query with Stopwatch by Guava.
The table is about 350MB, the BI reservation is 1GB.
The returned rows are 8, aggregated from 300 rows. This is a very small data size with a simple query.
I know BigQuery does not perform well with small data sizes, or it doesn't matter, but I want to get data for under a second, that's why I tried BI, and it will not improve with big datasets.
Could you please share job id?
BI Engine enables a number of optimizations, and for vast majority of queries they allow significantly faster and efficient processing.
However, there are corner cases when BI Engine optimizations are not as effective. One issue is initial loading of the data - we fetch data into RAM using optimal encoding, whereas BigQuery processes data directly. Subsequent queries should be faster. Another is - some operators are very easy to optimize to maximize CPU utilization (e.g. aggregations/filtering/compute), while others may be more tricky.
In my scenario, Databricks is performing read and writing transformations in Delta tables. We have PBI connected to the Databricks cluster that needs to be running most of the time, which is expensive.
Knowing that delta tables are in a container, what would be the best way in terms of cost x performance to feed PBI from delta tables?
If your set size is under max allowed size in PowerBI (100 GB I guess) and daily refresh is enough you can just load everything to your PowerBI model.
https://blog.gbrueckl.at/2021/01/reading-delta-lake-tables-natively-in-powerbi/
If you want to save the costs maybe you don't need transactions and can save it in csv in data lake, than loading everything to PowerBI and refresh daily is really easy.
If you want to save the costs and query new incoming data all the time using DirectQuery consider using Azure SQL. It has really competitive prices starting from 5 eur/usd. Integration with databricks is also perfect write in append mode do all magic.
Another option to consider is to create an Azure Synapse workspace and use serverless SQL compute to query the delta lake files. This is a pay-per-the-TB consumed pricing model so you don’t have to have your Databricks cluster running all the time. It’s a great way to load Power BI import models.
Looking at this page: Power BI features comparison I see that a dataset can be 10gb and storage is limited to 100tb. Can I take this to mean there is a limit of 10,000 10gb apps?
Also is there a limit on the number of users? It implies no with the statement "Licensed by dedicated cloud compute and storage resources", but I wanted to be sure.
I assume I am paying for compute so the real limits are based on what compute resources I purchase? Are there any limits on this?
Thanks.
Yes you can have 10,000 10GB datasets, to use up the total volume of 100TB, however storage will also be used for Excel Workbooks, Dataflows storage, as well as Excel ranges pinned to a dashboard, and other uploaded images.
There is no limit on the total number of users, however there is a limit based on 'peak renders per hour', which means how often users interact with the report. PBI Premium does expect you you have a mix of frequent and infrequent users, so for Premium P1 nodes, the peak renders per hour is 1 to 2400. Anything over that, you may experience performance degradation on that node is for example you had 3500 renders of a report in an hour, but it will depend on the type of report, queries etc. You can scale up to quite a number of nodes if you need to, Power BI Premium Gen 2 does allow auto scale.
Cost Effective way to connect Google BigQuery with Power BI. What intermediate layer is required in between GCP and Power BI?
You can access BigQuery directly from DataStudio using a custom query or loading the whole table. Technically nothing is necessary between BigQuery and DataStudio.
Regarding best practices, if your dashboard reads a lot of data and its constantly used it can lead to a high cost. In this case a "layer" makes sense.
If this is your case you could pre-aggregate your data in BigQuery to avoid a big amount of data to be read many times by DataStudio. My suggestion is:
Create a process (could be a scheduled query) that periodically aggregate your data and then save it in another table
In DataStudio read your data from the aggregated table
These steps can help you reducing costs and also can make your dashboards loading faster. The negative point is that if you are working with streaming data this approach in general will not let you see the most recent registries unless you run the aggregation process very constantly.
I'm using AWS Redshift as a back-end to my tableau desktop. AWS cluster is running with two dc1.large nodes and database table which I'm analyzing is of 30GB (with redshift compression enabled), I chose Redshift over tableau extract for performance issue but seems like Redshift live connection is much slower than extract. Any suggestions where shall I look into?
To use Redshift as a backend for a BI platform like Tableau, there are four things you can do to address latency:
1) Concurrency: Redshift is not great at running multiple queries at the same time so before you start tuning the database, make sure your query is not waiting in line behind other queries. (If you are the only one on the cluster, this shouldn't be a problem.)
2) Table size: Whenever you can, use aggregate tables for better performance. Fewer rows to scan means less IO and faster turnaround!
3) Query complexity: Ideally, you want your BI tool to issue simple, fast performing queries. Make sure your source tables are fast, and that Tableau isn't being forced to do a bunch of joins. Also, if your query does need to join multiple tables, make sure any large tables have the same distribution key.
4) "Indexing": Technically, Redshift does not support true indexing, but you can get close to the same thing by using "interleaved" sort keys. Traditional compound sort keys won't help, but an interleaved sort key can allow you to quickly access rows from multiple vectors (date and customer_id, for instance) without having to scan the entire table.
Reality Check
After all of these things are optimized, you will often find that you still can't be as fast as a Tableau extract. Simply stated, a "fast" Tableau dashboard needs to return data to it's user in <5 seconds. If you have 7 visuals on your dashboard, and each of the underlying queries takes 800 milliseconds to return (which is super fast for a database query), then you still are just barely reaching your target performance. Now, if just one of those queries take 5 seconds or more, your dashboard is going to feel "slow" no matter what you do.
In Summary
Redshift can be tuned using the approach above, but it may or may not be worth the effort. The best applications for using a live Redshift query instead of Tableau Extracts are in cases where the data is physically too large to create an extract of, and when you require data at a level of granularity that makes pre-aggregation infeasible. One good strategy is to create your main dashboard using an extract so that exploration/discovery is as fast as possible, and then use direct (live) Redshift queries for your drill-through reports (for instance, when you want to see exactly which customers roll up into your totals).
Few pointers as below
1) Use vacuum & Analyze once your ETL completes
2) Have you created the Table with proper Dist key and Sort Key
3) Aggregation if it's ok from the point of Data Granularity, requirement etc
1.Remove cursor, tableau access data from redshift leader node using cursor. Cursor works iteratively. Thus, impacting the performance.
2. Perform manual analyze on the table, after running heavy load operations. https://docs.aws.amazon.com/redshift/latest/dg/r_ANALYZE.html
3.Check the dist key distribution to avoid data skewness and improve performance.