BigQuery BI Engine: how to choose a good reservation size? - google-cloud-platform

I am enabling BI Engine to speed up my queries and to save cost for my project in the EU region.
What would be a good choice for setting the reservation size? 1GB, 2GB, 4GB?
How do I make that decision?

Below is a SQL script that groups queries in amount of GB processed, so first row is 0 to 1 GB processed per query, second row is 1 to 2 GB processed, etc.
And then for every row it shows the amount processed, amount billed and the costs associated and the costs saved.
This should help you in seeing where your costs lie, how many queries you have of a certain size and whether you may increase the size of your reservation of lower it.
Please note that BI Engine only can speed up certain SELECT QUERY and not MERGE, INSERT, CREATE etc statements. And there's more exceptions. So for fair comparison, I'm excluding those types of queries to have a better insight on the size of savings.
See also: https://cloud.google.com/bigquery/docs/bi-engine-intro#bi-engine-use-cases
DECLARE QUERY_COST_PER_TB NUMERIC DEFAULT 5.00; -- current cost in dollars of processing 1 TB of data in BQ
with possible_bi_engine_jobs_incl_parent_jobs as (
select
creation_time,
bi_engine_statistics,
cache_hit,
total_bytes_processed / power(1024, 3) GB_processed,
floor(total_bytes_processed / power(1024, 3)) GB_processed_floor,
total_bytes_billed / power(1024, 3) GB_billed,
total_bytes_processed / power(1024, 4) * QUERY_COST_PER_TB expected_cost_in_euros,
total_bytes_billed / power(1024, 4) * QUERY_COST_PER_TB actual_cost_in_euros,
query,
job_id,
parent_job_id,
user_email,
job_type,
statement_type,
from `my_project_id.region-eu.INFORMATION_SCHEMA.JOBS`
where 1=1
and creation_time >= '2022-12-08'
and creation_time < '2022-12-09'
and cache_hit = false -- bi engine will not be improving on queries that are already cache hits
and total_bytes_processed is not null -- if there's no bytes processed, then ignore the job
and statement_type = 'SELECT' -- statement types such as MERGE, CREATE, UPDATE cannot be run by bi engine, only SELECT statements
and job_type = 'QUERY' -- LOAD jobs etc. cannot be run by bi engine, only QUERY jobs
and upper(query) like '%FROM%' -- query should contain FROM, otherwise it will not be run by bi engine
and upper(query) not like '%INFORMATION_SCHEMA.%' -- metadata queries can not be run by bi engine
),
-- to prevent double counting of total_bytes_processed and total_bytes_billed
parent_job_ids_to_ignore as (
select distinct parent_job_id
from possible_bi_engine_jobs_incl_parent_jobs
where parent_job_id is not null
),
possible_bi_engine_jobs_excl_parent_jobs as (
select *
from possible_bi_engine_jobs_incl_parent_jobs
where job_id not in (select parent_job_id from parent_job_ids_to_ignore) -- to prevent double counting of total_bytes_processed and total_bytes_billed
)
select
GB_processed_floor, -- all queries which processed less GB than the floor value
count(1) query_count,
sum(case when bi_engine_statistics.bi_engine_mode in ('FULL', 'PARTIAL') then 1 else 0 end) bi_engine_enabled,
sum(case when bi_engine_statistics.bi_engine_mode in ('DISABLED') or bi_engine_statistics.bi_engine_mode IS NULL then 1 else 0 end) bi_engine_disabled,
round(sum(GB_processed), 1) GB_processed,
round(sum(GB_billed), 1) GB_billed,
round(sum(expected_cost_in_euros), 2) expected_cost_in_euros,
round(sum(actual_cost_in_euros), 2) actual_cost_in_euros,
round(sum(expected_cost_in_euros) - sum(actual_cost_in_euros), 2) saved_cost
from possible_bi_engine_jobs_excl_parent_jobs
group by GB_processed_floor
order by GB_processed_floor
;
This results in the following cost savings table grouped by size of the queries:
Other useful links on BI Engine savings:
https://medium.com/google-cloud/ensure-the-right-bigquery-bi-engine-capacity-with-cloud-workflows-orchestration-lowering-your-9e2634c84a82
https://lakshmanok.medium.com/speeding-up-small-queries-in-bigquery-with-bi-engine-4ac8420a2ef0
BigQuery - BI Engine measuring savings

Related

In PowerBI what options are there to hide/redact/mask values in visuals to anonymise data

What options are there in PowerBI to suppress, redact or hide values to anonymise values in reports and visuals without loosing detail and have that restriction apply to multiple pages in a report?
Cat
Count
%
Category 1
23
10
Category 2
2
0.9%
Category 3
4
1.7%
So that its possible to keep the rows but end up with a placeholder where count is <4 and % is greater than 1% but less than 2%
Cat
Count
%
Category 1
23
10
Category 2
*
0.9%
Category 3
4
*
So far my experience has been
a measure with a filter applied will hide rows but you can't apply a measure filter to an entire page or all report pages.
Ive seen mention of conditional formatting to hide the value by having the font and background the same colour but that seems open to error and labour intensive.
I also want to be clear when a value has been suppressed or masked
I suspect there is a more better way but I haven't been able to figure out where to even start.
OK, I have something working but you will need Tabular Editor to create a calculation group. Here are the steps.
I'm using the following table (named "Table") as the source data.
Add two measures (calculation groups only work on measures) as follows.
% Measure = SUM('Table'[%])
Count Measure = SUM('Table'[Count ])
Open tabular editor and create a new calculation group named "CG" with a calculation item named "Mask". Paste the following code into the calculation item.
if (
(selectedmeasurename() = "% Measure" && selectedmeasure() >1 && selectedmeasure() <2)
||
(selectedmeasurename() = "Count Measure" && selectedmeasure() <4)
,"*",selectedmeasure()
)
4. Save the calculation group and in Power BI drag the name column onto the filter for all pages as follows, ensuring it is selected:
The masking will now happen across all reports automatically. Below you can see the same table on two different reports which have now been successfully masked.
It depends on your data connection type as to whether this is available, but a calculated column (instead of a measure) can be used as a filter at the "this page" or "all pages" level.
If this option is available, then you can find it right next to the "New Measure" field.
Using this and your sample data above, I created a couple of calculated columns and show the resulting table. You can then display these columns and use them as filters throughout the report. Your DAX may be slightly different depending on how the actual data is formatted and such.
Count Calculated Column
Masked Count =
IF(
'Table'[Count] < 4,
"*",
CONVERT('Table'[Count], STRING)
)
% Calculated Column
Masked % =
IF(
'Table'[%] > .01 && 'Table'[%] < .02,
"*",
CONVERT('Table'[%] * 100, STRING) & "%"
)
Resulting Table
Example of how the filter can be used
The values of these columns will update as your data source is refreshed in Power BI. However, calculated columns aren't available for Live Connection, in which case you would have to do this kind of logic at a lower level (in Analysis Services for example).
Additionally, you could potentially use Power Query Editor to accomplish this kind of thing.

Calculating the percentage of grand total by category with DAX for one category only

I need help calculating the percentage of each category in a column (based on the grand total) in DAX but for one specific category.
This is how the data is structured. Each row is an individual transaction with an ID column and item column.
I need to get the % of transactions that are for bagels only. This is my sql code I currently use.
`Select 100 - CAST(count([Items])
- count(case [Items] when 'Bagel' then 1 else null end) AS FLOAT)
/ count([Items]) * 100 as Percent_Bagel
from Items_Table where Items != 'Coffee'
and Items != 'Muffin'`
I need this to be converted to a DAX formula to use in a Power BI measure if possible, but I am new to DAX and don't know where to begin.
Thank you.
The "right" implementation for you always depends on the context. You can achieve your goal through different approaches.
I have used the following:
Measure =
DIVIDE(
-- Numerator: Filter all Bagel's transaction and count them
CALCULATE(
COUNT('Table'[Transaction ID]),
FILTER('Table', 'Table'[Item] = "Bagel")
),
-- Denominator: Remove any filter - essentially fixing the full table - and count all transactions we have
CALCULATE(
COUNT('Table'[Transaction ID]),
ALL('Table'[Item])
),
-- If something goes wrong with the DIVIDE, go for 0
0
)
You may also use the filters to rule out all measures that are not blank.
Without measure filter
With measure filter (Other categories are gone)
Hope this is what you are looking for!

BigQuery - Join take few hours in 3TB table

I have 2 tables in BQ.
Table ID: blog
Table size: 3.07 TB
Table ID: llog
Table size: 259.82 GB
Im running the below query and it took few hours(even its not finished, I killed it, so not able to capture the query plan).
select bl.row.iddi as bl_iddi, count(lg.row.id) as count_location_ping
from
`poc_dataset.blog` bl LEFT OUTER JOIN `poc_dataset.llog` lg
ON
bl.row.iddi = lg.row.iddi
where bl.row.iddi="124623832432"
group by bl.row.iddi
Im not sure how to optimize this. Blog table has trillions of rows.
Unless some extra details are missing in your question - below should give you expected result
#standardSQL
SELECT row.iddi AS iddi, COUNT(row.id) AS count_location_ping
FROM `poc_dataset.llog`
WHERE row.iddi= '124623832432'
GROUP BY row.iddi

How to calculate gap between 2 timestamps (edited for AWS Athena )

I Have many IOT devices that sends data to my Amazon Athena server, i created a table to store the data and the table contains 2 columns: LocalTime indicate the time that the IOT device capture his status, ServerTime indicate the time the Data arrived to server (sometimes the IOT device doesn't have network connections )
I would like to count the "gaps" in block of hours (let's say 1 hour ) in order to know the deviation of the data arriving, for example:
the result that I would like to get is:
In order to calculate the result i want to calculate how many hours passed between serverTime and LocalTime.
so the first entry (1.1.2019 12:15 - 1.1.2019 10:25 ) = 1-2 hours.
Thanks
If it is MSSQL Server is your database, you can try this below script to get your desired output-
SELECT
CAST(DATEDIFF(HH,localTime,serverTime)-1 AS VARCHAR) +'-'+
CAST(DATEDIFF(HH,localTime,serverTime) AS VARCHAR) [Hours],
COUNT(*) [Count]
FROM your_table
GROUP BY CAST(DATEDIFF(HH,localTime,serverTime)-1 AS VARCHAR) +'-'+
CAST(DATEDIFF(HH,localTime,serverTime) AS VARCHAR)
Oracle
If you using Oracle database as a system, you can use this statement:
select CONCAT(CONCAT (diff_hours,'-') , diff_hours+1) as Hours, count(diff_hours) as Count
from (select 24 * (to_date(LocalTime, 'YYYY-MM-DD hh24:mi') - to_date(ServerTime, 'YYYY-MM-DD hh24:mi')) diff_hours from T_TIMETABLE )
group by diff_hours
order by diff_hours;
Note: This will not display the empty intervals.

Using TABLESAMPLE with PERCENT returns all the records from table

I have a small test table with two fields - id and name, 19 records total. When I try to get 10 percent of record from this table using the following query, I get ALL the records. I tried to do this on large table, but result is the same - all records are returned. The query:
select * from test tablesample (10 percent) s;
If I use ROWS instead of TABLESAMPLE (i.e.: select * from test tablesample (10 rows) s;, it works fine, only 10 records are returned. How can I get just the neccessary percentage of records?
You can refer to the link below:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling
You must be using CombinedHiveOutputFormat, which does not go well with ORC format. Hence you will never be able to save the output from Percent query to a table.
In my knowledge the best way to do this is using rand() function. But again you should not use this with order by() clause as it will impact performance. Here is my sample query which is time efficient :
SELECT * FROM table_name
WHERE rand() <= 0.0001
DISTRIBUTE BY rand()
SORT BY rand()
LIMIT 5000;
I tested this on 900M row table and query executed in 2 mins.
Hope this helps.
You can use PERCENT with TABLESAMPLE. For example:
SELECT * FR0M TABLE_NAME
TABLESAMPLE(1 PERCENT) T;
This will select 1% of the data size of inputs and not necessarily the number of rows. More details can be found here.
But if you are really looking for a method to select a percentage of the number of rows, then you may have to use LIMIT clause with the number of records you need to retrieve.
For example, if your table has 1000 records, then you can select random 10% records as:
select * from table_name order by rand() limit 100;