Working with large offsets in BigQuery - google-cloud-platform

I am trying to emulate pagination in BigQuery by grabbing a certain row number using an offset. It looks like the time to retrieve results steadily degrades as the offset increases until it hits ResourcesExceeded error. Here are a few example queries:
Is there a better way to use the equivalent of an "offset" with BigQuery without seeing performance degradation? I know this might be asking for a magic bullet that doesn't exist, but was wondering if there are workarounds to achieve the above. If not, if someone could suggest an alternative approach to getting the above (such as kinetica or cassandra or whatever other approach), that would be greatly appreciated.

Offset in systems like BigQuery work by reading and discarding all results until the offset.
You'll need to use a column as a lower limit to enable the engine to start directly from that part of the key range, you can't have the engine randomly seek midway through a query efficiently.
For example, let's say you want to view taxi trips by rate code, pickup, and drop off time:
SELECT *
FROM [nyc-tlc:green.trips_2014]
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
LIMIT 100
If you did this via OFFSET 100000, it takes 4s and the first row is:
pickup_datetime: 2014-01-06 04:11:34.000 UTC
dropoff_datetime: 2014-01-06 04:15:54.000 UTC
rate_code: 1
If instead of offset, I had used those date and rate values, the query only takes 2.9s:
SELECT *
FROM [nyc-tlc:green.trips_2014]
WHERE rate_code >= 1
AND pickup_datetime >= "2014-01-06 04:11:34.000 UTC"
AND dropoff_datetime >= "2014-01-06 04:15:54.000 UTC"
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
limit 100
So what does this mean? Rather than allowing the user to specific result # ranges (e.g, so new rows starting at 100000), have then specified it in a more natural form (e.g, so how rides that started on January 6th, 2015.
If you want to get fancy and REALLY need to allow the user to specific actual row numbers, you can make it a lot more efficient by calculating row ranges in advance, say query everything once and remember what row number is at the start of the hour for each day (8760 values), or even minutes (525600 values). You could then use this to better guess efficient start. Do a look-up for the closest day/minute for a given row range (e.g in Cloud Datastore), then convert that users query into the more efficient version above.

As already mentioned by Dan you need to introduce a row number. Now row_number() over () exceeds resources. This basically means you have to split up the work of counting rows:
decide for few and as evenly distributed partitions as possible
count rows of each partition
cumulative sum of partitions to know later when to start where with counting rows
split up up work of counting rows
save new table with row count column for later use
As partitions I used EXTRACT(month FROM pickup_datetime) as it distributes nicely
WITH
temp AS (
SELECT
*,
-- cumulative sum of partition sizes so we know when to start counting rows here
SUM(COALESCE(lagged,0)) OVER (ORDER BY month RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) cumulative
FROM (
-- lag partition sizes to next partition
SELECT
*,
LAG(qty) OVER (ORDER BY month) lagged
FROM (
-- get partition sizes
SELECT
EXTRACT(month FROM pickup_datetime) month,
COUNT(1) qty
FROM
`nyc-tlc.green.trips_2014`
GROUP BY
1)) )
SELECT
-- cumulative sum = last row of former partition, add to new row count
cumulative + ROW_NUMBER() OVER (PARTITION BY EXTRACT(month FROM pickup_datetime)) row,
*
FROM
`nyc-tlc.green.trips_2014`
-- import cumulative row counts
LEFT JOIN
temp
ON
(month= EXTRACT(month FROM pickup_datetime))
Once you saved it as a new table you can use your new row column to query without losing performance:
SELECT
*
FROM
`project.dataset.your_new_table`
WHERE
row BETWEEN 10000001
AND 10000100
Quite a hassle, but does the trick.

Why not export the resulting table into GCS?
It will automatically split tables into files if you use wildcards, and this export only has to be done one time, instead of querying every single time and paying for all the processing.
Then, instead of serving the result of the call to the BQ API, you simply serve the exported files.

Related

What does this EXPLAIN query plan output mean for my Redshift query?

I ran this:
EXPLAIN select id, birth_date, ROW_NUMBER() OVER (ORDER BY 1) AS load_id from user_profile;
and I see this:
WindowAgg (cost=0.00..133833424.40 rows=30901176 width=36)
-> Seq Scan on user_profile (cost=0.00..133369906.76 rows=30901176 width=28)
What does this query plan mean?
The query plan is the execution plan that the PostgreSQL planner (Amazon Redshift is based on PostgreSQL) has generated for the your SQL statement.
The first node is a window aggregation (WindowAgg) over the data as you're using the OVER window function to calculate a row number.
The second node is a sequential scan (Seq Scan) on the user_profile table, as you're doing a full select of the table without any filtering.
A sequential scan scans the entire table as stored on disk since your query requires a full traversal of the table. Even if there is a multi-column index on id & birth_date, the query engine would pretty much always go for a sequence scan here as you need everything (depending on the random_page_cost & enable_seqscan parameters in PostgreSQL).
The cost number is actually arbitrary, but conventionally means the number of disk page fetches; it's split into 2 values with the delimiter being ...
The first value shows the startup cost - this is the estimated cost to return the first row. The second value shows the total cost - this is the estimated cost to return all rows.
For example, for the Seq Scan, the startup cost is 0 and the total cost is estimated to be 133369906.76.
For sequential scans, the startup cost is usually 0. There's nothing really to do other than return data so it can start returning data right away. Total costs for a node includes the cost of all its child nodes as well - in this case, the final total cost of both operations looks to be 133833424.40 which is the sum of the scan and aggregation cost.
The rows value demonstrates the estimated number of rows that will be returned. In this case, both operations have the same value as the aggregation will apply to all rows & no filtering is being carried out that will reduce the number of final rows.
The width value demonstrates the estimated size in bytes of each returned row i.e. each row will most likely be 28 bytes in length before the aggregation and 36 bytes after the aggregation.
Putting that all together, you could read the query plan as such:
Sequential Scan on table user_profile
will most likely start returning rows immediately
estimated disk page fetch count of 133369906.76
estimated 30,901,176 rows to be returned
estimated total row size of 28 bytes
Window Aggregation on data from above operation
will most likely start returning rows immediately
estimated disk page fetch count of 133833424.40
estimated 30,901,176 rows to be returned
estimated total row size of 36 bytes

Best way to select random rows in redshift without order by

i have to select a set of rows (like 200 unique rows) from 200 million rows at once without order by and it must be efficient.
As you are experiencing sorting 200M rows can take a while and if all you want is 200 rows then this is an expense you shouldn't need to pay. However, you do need to sort on a random value if you want to select 200 rows that are random. Otherwise the sort order of the base tables and the order of reply from the Redshift slices will meaningfully skew you sample.
You can get around this by sampling down (through a random process) so a much more manageable number of rows, then sort by the random value and pick your final 200 rows. While this does need to sort rows it does it upon a significantly smaller number which will speed things up considerably.
select a, b from (
select a, b, random() as ranno
from test_table)
where ranno < .005
order by ranno
limit 200;
You start with 200M rows. Select .5% of them in the WHERE clause. Then order these 10,000 rows before selecting 200. This should speed things up and maintain the randomness of the selection.
Sampling down your data to a reasonable percentage like 10%,5%,1%,.. etc should bring your volume to a manageable size. Then you can order by the sample and choose the count of rows you need.
select * from (select *, random() as sample
from "table")
where sample < .01
order by sample limit 200
The following is an expansion on the question which I found useful for me that others might find helpful as well. In my case, I had a huge table which I could split by a key field value into smaller subsets, but even after splitting it the volume per individual subset would stay very large (10s of millions of rows) and I still needed to sample it anyway. I was initially concerned that the sampling won't work on the subset I created using With statement, but it turned out this is not the case. I compared the distribution of the sample across all different meaningful keys afterwards between the full subset (20 million) and the sample (30K) and I got almost the exact distribution which worked great. Sample code below:
With subset as (select * from "table" Where Key_field='XYZ')
select * from (select *, random() as sample
from subset) s
where s.sample < .01
order by s.sample limit 200

How to create a measure that goes first through equipment and then sums it all

I have a database that has some values as "Date", "StopedTime", "PlannedProductionQtt" and "PlannedProductionTime". These values are sorted by equipment, as the little example below.
What I need to do is divide PlannedProductionQtt by PlannedProductionTime and then multiply by StoppedTime. After this, I want to make a graph that shows it day by day.
At first I thought it was easy, made a new measure PlannedProductionQtt/PlannedProductionTime = SUM(PlannedProductionQtt)/SUM(PlannedProductionTime) (assume it worked without the table name).
And then I did another measure Impact = SUM(StoppedTime)*PlannedProductionQtt/PlannedProductionTime.
When I plotted a clustered column chart with this measure in values and a the day for the axis, at first I thought I had nailed it, but no. The BI summed all of PlannedProductionQtt and divided by the sum of all PlannedProductionTime for the day, and multiplied by the sum of the StoppedTime of that day.
Unfortunately, this gives me wrong results. So, what I need is a measure (or some measures) that would make it equipment by equipment and the sum it by day.
I don't want to make new tables or columns for theses calculations because I actually have 32 items of equipment, 3+ years of data, more than 1 classification of StoppedTime and the databases for PlannedProduction use more than one line per day per equipment.
To make it clear I added one column as Impact to show the difference.
So, if I sum the column Impact per day, I would have for day 1,2 and 3 the results 110725, 61273 and 220833.
However, if I sum first all the PlannedProductionQtt for day 1, divide it by the sum of PlannedProductionTime of day 1 and multiply it by the sum of StoppedTime of day 1 (which is how PowerBi is calculating) I will have 146497.
I inserted the difference in the table below to make the differences clear:
As Jon suggested in a comment, here is what solved my needs:
measure_name = SUMX( source_table , DIVIDE ( source_table[PlannedProductionQtt] , source_table[PlannedProductionTime] , 0 ) ) * SUM( source_table[StoppedTime] )
You have two different types of data you want to divide there, time and int, so you would probably need to unify that. Easiest way to do it would be from the Transform data panel, selecting the column and changing its
format
The division is done fairly easily, can you try creating a new measure as follows
measure_name = CALCULATE(
DIVIDE(<source_table>[PlannedProductionQtt],
<source_table>[PlannedProductionTime],
0)
* <source_table>[StoppedTime]
)
Then it's only a matter of using it as values in a graph and the 'Date' column in x axis.

Redshift: Aggregate data on large number of dimensions is slow

I have an Amazon redshift table with about 400M records and 100 columns - 80 dimensions and 20 metrics.
Table is distributed by 1 of the high cardinality dimension columns and includes a couple of high cardinality columns in sort key.
A simple aggregate query:
Select dim1, dim2...dim60, sum(met1),...sum(met15)
From my table
Group by dim1...dim60
is taking too long. The explain plan looks simple just a sequential scan and hashaggregate on the able. Any recommendations on how I can optimize it?
1) If your table is heavily denormalized (your 80 dimensions are in fact 20 dimensions with 4 attributes each) it is faster to group by dimension keys only, and if you really need all dimension attributes join the aggregated result back to dimension tables to get them, like this:
with
groups as (
select dim1_id,dim2_id,...,dim20_id,sum(met1),sum(met2)
from my_table
group by 1,2,...,20
)
select *
from groups
join dim1_table
using (dim1_id)
join dim2_table
using (dim2_id)
...
join dim20_table
using (dim20_id)
If you don't want to normalize your table and you like that a single row has all pieces of information it's fine to keep it as is since in a column database they won't slow the queries down if you don't use them. But grouping by 80 columns is definitely inefficient and has to be "pseudo-normalized" in the query.
2) if your dimensions are hierarchical you can group by the lowest level only and then join higher level dimension attributes. For example, if you have country, country region and city with 4 attributes each there's no need to group by 12 attributes, all you can do is group by city ID and then join city's attributes, country region and country tables to the city ID of each group
3) you can have the combination of dimension IDs with some delimiter like - in a separate varchar column and use that as a sort key
Sequential scans are quite normal for Amazon Redshift. Instead of using indexes (which themselves would be Big Data), Redshift uses parallel clusters, compression and columnar storage to provide fast queries.
Normally, optimization is done via:
DISTKEY: Typically used on the most-JOINed column (or most GROUPed column) to localize joined data on the same node.
SORTKEY: Typically used for fields that most commonly appear in WHERE statements to quickly skip over storage blocks that do not contain relevant data.
Compression: Redshift automatically compresses data, but over time the skew of data could change, making another compression type more optimal.
Your query is quite unusual in that you are using GROUP BY on 60 columns across all rows in the table. This is not a typical Data Warehousing query (where rows are normally limited by WHERE and tables are connected by JOIN).
I would recommend experimenting with fewer GROUP BY columns and breaking the query down into several smaller queries via a WHERE clause to determine what is occupying most of the time. Worst case, you could run the results nightly and store them in a table for later querying.

What is the output of this aggregator

Quick question, here is my data
Data_field: 100|address|place|456|687
This column from expression is being passed to aggregator and marked as group by in aggregator.
What will be the output of this aggregator? also can u tell me something brief about aggregator?
Thanks,
Teja
Firstly say for example your data is consisting of 3 records
Data amount
100|address|place|456|687 10
100|address|place|456|687 20
100|address|place|456|687 30
In informatica if its group on Data and SUM(amount), the output will be
100|address|place|456|687 60
Say for example if there is no amount column as below
100|address|place|456|687
100|address|place|456|687
100|address|place|456|687
In informatica if its group on Data then your output is as below
100|address|place|456|687 only one record
In the above scenario One important note about aggregator is , even though
you have not checked the groupby option , informatica by default selects the last record
Aggregator in informatica is similar to using aggregate functions like 'MAX', 'MIN', 'COUNT' etc., upon group in SQL
example : Say if you want to know the max salary in a department then
SQL
select dept, count(*) from employee group by dept;
Informatica
you can enable the groupby option on dept and then create a port which will have max(salary). this will give the output similar to the SQL above
Things to take care in aggregator for better performance:
1) Use Sorter transformation before aggregator
2) Use number columns in group by whenever possible (try to avoid date and string columns)
3) If the source is having huge number of records its better to group the records in SQL override itself because the aggregator will create cache.
4)Add filter if required to avoid unnecessary aggregation
Hope this helps
Regards
Raj
Aggregator transformation can be used for multiple aggregation operation such as AVG, COUNT, FIRST, LAST, MAX, MEDIAN, MIN, PERCENTILE, STDDEV, SUM and VARIENCE . GroupBy option can be checked to calculate the aggregates of a column as per your condition..
For example,
consider the source,
Aggregaion function is provided as,
Average of HEIGHT and maximum of WEIGHT is calculated by grouping POSITION column. Target is obtained as,
As POSITION column is grouped, Average of HEIGHT and maximum of WEIGHT is populated for each value avaliable in the POSITION column.
Aggregation transformation is nearly as same as SQL aggregation functions and SQL groupby clause.