CREATE TABLE using WITH clause in redshift not working [duplicate] - amazon-web-services

I need to create an empty time table series for a report so I can left join activity from several tables to it. Every hour of the day does not necessarily have data, but I want it to show null or zero for inactivity instead of omitting that hour of the day.
In later versions of Postgres (post 8.0.2), this is easy in several ways:
SELECT unnest(array[0,1,2,3,4...]) as numbers
OR
CROSS JOIN (select generate_series as hours
from generate_series(now()::timestamp,
now()::timestamp + interval '1 day',
'1 hour'::interval
)) date_series
Redshift can run some of these commands, but throws an error when you attempt to run it in conjunction with any of the tables.
WHAT I NEED:
A reliable way to generate a series of numbers (e.g. 0-23) as a subquery that will run on redshift (uses postgres 8.0.2).

As long as you have a table that has more rows than your required series has numbers, this is what has worked for me in the past:
select
(row_number() over (order by 1)) - 1 as hour
from
large_table
limit 24
;
Which returns numbers 0-23.

Unfortunately, Amazon Redshift does not allow use of generate_series() for table functions. The workaround seems to be creating a table of numbers.
See also:
Using sql function generate_series() in redshift
Generate Series in Redshift and MySQL, which does not seem correct but does introduce some interesting ideas

Recursion was released for Redshift in April 2021. Now that recursion is possible in Redshift. You can generate series of numbers (or even table) with below code
with recursive numbers(NUMBER) as
(
select 1 UNION ALL
select NUMBER + 1 from numbers where NUMBER < 28
)

I'm not a big fan of querying a system table just to get a list of row numbers. If it's something constant and small enough like hours of a day, I would go with plain old UNION ALL:
WITH
hours_in_day AS (
SELECT 0 AS hour
UNION ALL SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
...
UNION ALL SELECT 23
)
And then joinhours_in_day to whatever you want to.

Related

Why is Decimal to Whole number is preventing Power Query from showing native query?

I have a large dataset with 50M+ rows. Database is on sql server 2019.
In Power Query, all but the last step shows the native query. The last step is converting the value (for some reason Power query picks up the number as decimal) to whole number. When I right-click on this step it shows the native query option disabled.
Why is Decimal to Whole number is preventing Power Query from showing native query? What is the way to achieve native query in this situation?
My intention is to configure incremental load on this table.
Assuming your data source is SQL (?) you could
SELECT CAST(myObscureNumber AS int)
Changing datatypes frequently breaks query folding and is mentioned in the docs. Here is a good read: https://blog.crossjoin.co.uk/2021/06/29/data-type-conversions-for-sql-server-sources-and-query-folding-in-power-query/
Connectors are regularly updated to include more native DB features and enable better folding.
EDIT - Don't do this - see David's comment for why this is not efficient
If PQ folds the conversion to, say, select cast(dt as date) d, ...
from t and then the user filters on the projected column d, will
result in SQL query like select ... where cast(dt as date) >
'2022-01-01', which can't use indexes or partitions, and will have to
convert the dt column for each row to compare it with the filter
value. – David Browne - Microsoft 1 hour ago
According to this, you could try decimal to text and then extract the text before the decimal point to avoid breaking the query fold.
https://en.brunner.bi/post/changing-data-types-that-do-not-break-query-folding-in-power-query-power-bi-1

Performance impact of table join base on column data type

I want to ensure that I don't negatively impact query performance based on schema design for BigQuery. I have two tables which I need to perform a join. The column that I will use to join the tables could be of type INTEGER or STRING. STRING would be easier in my case as it wouldn't require any new validation within our code base to ensure all values are of type INTEGER. But I don't want to join on a type STRING, if query performance will be significantly worse than running the join on an INTEGER type column.
Is there a large performance difference in BigQuery when the join is on type STRING vs type INTEGER?
---Update 10/16---
I ran some basic analysis to test this, here are the results:
Using public dataset, users table has 10M rows and posts table has 31M rows
Join on Integer: 2.78 sec elapsed, 318.1 MB processed (avg over 10 runs)
Join on String 6.77 sec elapsed, 137 MB processed (avg over 10 runs)
-- Join on Integer Query
SELECT count(*)
FROM `bigquery-public-data.stackoverflow.users` u
JOIN `bigquery-public-data.stackoverflow.stackoverflow_posts` p
on u.id = p.owner_user_id
WHERE RAND() < 2
(Where clause added to avoid cache)
-- Join on String
SELECT count(*)
FROM 'bigquery-public-data.stackoverflow.users' u
JOIN 'bigquery-public-data.stackoverflow.stackoverflow_posts' p
on u.display_name = p.owner_display_name
WHERE RAND() < 2
(Where clause added to avoid cache)
Surprisingly, JOIN on STRING appears to perform worse than INTEGER.
No, you won't see any significant difference. Go with the schema that is more natural for your use case.
I am super late to the party here. Came across this question while trying to optimise a join in BigQuery.
The docs suggest that using INT64 will indeed be quicker:
https://cloud.google.com/bigquery/docs/best-practices-performance-compute#use_int64_data_types_in_joins_to_reduce_cost_and_improve_comparison_performance

Working with large offsets in BigQuery

I am trying to emulate pagination in BigQuery by grabbing a certain row number using an offset. It looks like the time to retrieve results steadily degrades as the offset increases until it hits ResourcesExceeded error. Here are a few example queries:
Is there a better way to use the equivalent of an "offset" with BigQuery without seeing performance degradation? I know this might be asking for a magic bullet that doesn't exist, but was wondering if there are workarounds to achieve the above. If not, if someone could suggest an alternative approach to getting the above (such as kinetica or cassandra or whatever other approach), that would be greatly appreciated.
Offset in systems like BigQuery work by reading and discarding all results until the offset.
You'll need to use a column as a lower limit to enable the engine to start directly from that part of the key range, you can't have the engine randomly seek midway through a query efficiently.
For example, let's say you want to view taxi trips by rate code, pickup, and drop off time:
SELECT *
FROM [nyc-tlc:green.trips_2014]
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
LIMIT 100
If you did this via OFFSET 100000, it takes 4s and the first row is:
pickup_datetime: 2014-01-06 04:11:34.000 UTC
dropoff_datetime: 2014-01-06 04:15:54.000 UTC
rate_code: 1
If instead of offset, I had used those date and rate values, the query only takes 2.9s:
SELECT *
FROM [nyc-tlc:green.trips_2014]
WHERE rate_code >= 1
AND pickup_datetime >= "2014-01-06 04:11:34.000 UTC"
AND dropoff_datetime >= "2014-01-06 04:15:54.000 UTC"
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
limit 100
So what does this mean? Rather than allowing the user to specific result # ranges (e.g, so new rows starting at 100000), have then specified it in a more natural form (e.g, so how rides that started on January 6th, 2015.
If you want to get fancy and REALLY need to allow the user to specific actual row numbers, you can make it a lot more efficient by calculating row ranges in advance, say query everything once and remember what row number is at the start of the hour for each day (8760 values), or even minutes (525600 values). You could then use this to better guess efficient start. Do a look-up for the closest day/minute for a given row range (e.g in Cloud Datastore), then convert that users query into the more efficient version above.
As already mentioned by Dan you need to introduce a row number. Now row_number() over () exceeds resources. This basically means you have to split up the work of counting rows:
decide for few and as evenly distributed partitions as possible
count rows of each partition
cumulative sum of partitions to know later when to start where with counting rows
split up up work of counting rows
save new table with row count column for later use
As partitions I used EXTRACT(month FROM pickup_datetime) as it distributes nicely
WITH
temp AS (
SELECT
*,
-- cumulative sum of partition sizes so we know when to start counting rows here
SUM(COALESCE(lagged,0)) OVER (ORDER BY month RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) cumulative
FROM (
-- lag partition sizes to next partition
SELECT
*,
LAG(qty) OVER (ORDER BY month) lagged
FROM (
-- get partition sizes
SELECT
EXTRACT(month FROM pickup_datetime) month,
COUNT(1) qty
FROM
`nyc-tlc.green.trips_2014`
GROUP BY
1)) )
SELECT
-- cumulative sum = last row of former partition, add to new row count
cumulative + ROW_NUMBER() OVER (PARTITION BY EXTRACT(month FROM pickup_datetime)) row,
*
FROM
`nyc-tlc.green.trips_2014`
-- import cumulative row counts
LEFT JOIN
temp
ON
(month= EXTRACT(month FROM pickup_datetime))
Once you saved it as a new table you can use your new row column to query without losing performance:
SELECT
*
FROM
`project.dataset.your_new_table`
WHERE
row BETWEEN 10000001
AND 10000100
Quite a hassle, but does the trick.
Why not export the resulting table into GCS?
It will automatically split tables into files if you use wildcards, and this export only has to be done one time, instead of querying every single time and paying for all the processing.
Then, instead of serving the result of the call to the BQ API, you simply serve the exported files.

REDSHIFT: How can I generate a series of numbers without creating a table called "numbers" in redshift (Postgres 8.0.2)?

I need to create an empty time table series for a report so I can left join activity from several tables to it. Every hour of the day does not necessarily have data, but I want it to show null or zero for inactivity instead of omitting that hour of the day.
In later versions of Postgres (post 8.0.2), this is easy in several ways:
SELECT unnest(array[0,1,2,3,4...]) as numbers
OR
CROSS JOIN (select generate_series as hours
from generate_series(now()::timestamp,
now()::timestamp + interval '1 day',
'1 hour'::interval
)) date_series
Redshift can run some of these commands, but throws an error when you attempt to run it in conjunction with any of the tables.
WHAT I NEED:
A reliable way to generate a series of numbers (e.g. 0-23) as a subquery that will run on redshift (uses postgres 8.0.2).
As long as you have a table that has more rows than your required series has numbers, this is what has worked for me in the past:
select
(row_number() over (order by 1)) - 1 as hour
from
large_table
limit 24
;
Which returns numbers 0-23.
Unfortunately, Amazon Redshift does not allow use of generate_series() for table functions. The workaround seems to be creating a table of numbers.
See also:
Using sql function generate_series() in redshift
Generate Series in Redshift and MySQL, which does not seem correct but does introduce some interesting ideas
Recursion was released for Redshift in April 2021. Now that recursion is possible in Redshift. You can generate series of numbers (or even table) with below code
with recursive numbers(NUMBER) as
(
select 1 UNION ALL
select NUMBER + 1 from numbers where NUMBER < 28
)
I'm not a big fan of querying a system table just to get a list of row numbers. If it's something constant and small enough like hours of a day, I would go with plain old UNION ALL:
WITH
hours_in_day AS (
SELECT 0 AS hour
UNION ALL SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
...
UNION ALL SELECT 23
)
And then joinhours_in_day to whatever you want to.

Creating pairwise combination of ids of a very large table in bigquery

I have a very large table of ids (string) that has 424,970 rows and only a single column.
I am trying to create the combination of those ids in a new table. The motivation for creation of that table can be found in this question.
I tried the following query to create the pairwise combination table:
#standardSQL
SELECT
t1.id AS id_1,
t2.id AS id_2
FROM
`project.dataset.id_vectors` t1
INNER JOIN
`project.dataset.id_vectors` t2
ON
t1.id < t2.id
But the query fails after 15 minutes, with the following error message:
Query exceeded resource limits. 602467.2409093559 CPU seconds were used, and this query must use less than 3000.0 CPU seconds. (error code: billingTierLimitExceeded)
Is there any workaround to run the query and get the desired output table with all combination of ids?
You can try splitting your table T into 2 smaller tables T1 and T2, then perform 4 joins for each of the smaller tables T1:T1, T1:T2, T2:T1, T2:T2, then union the results. This will be equivalent to joining T with itself. If it still fails try breaking it down into even smaller tables.
Alternatively set maximumBillingTier to a higher value https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs.
configuration.query.maximumBillingTier - Limits the billing tier for
this job. Queries that have resource usage beyond this tier will fail
(without incurring a charge). If unspecified, this will be set to your
project default.
If using Java, it can be set in JobQueryConfiguration. This configuration property is not supported in the UI console at the moment.
In order to split a table you can use FARM_FINGERPRINT function in BigQuery. E.g. the 1st part will have a filter:
where mod(abs(farm_fingerprint(id)), 10) < 5
And the 2nd part will the filter:
where mod(abs(farm_fingerprint(id)), 10) >= 5