Creating pairwise combination of ids of a very large table in bigquery - combinations

I have a very large table of ids (string) that has 424,970 rows and only a single column.
I am trying to create the combination of those ids in a new table. The motivation for creation of that table can be found in this question.
I tried the following query to create the pairwise combination table:
#standardSQL
SELECT
t1.id AS id_1,
t2.id AS id_2
FROM
`project.dataset.id_vectors` t1
INNER JOIN
`project.dataset.id_vectors` t2
ON
t1.id < t2.id
But the query fails after 15 minutes, with the following error message:
Query exceeded resource limits. 602467.2409093559 CPU seconds were used, and this query must use less than 3000.0 CPU seconds. (error code: billingTierLimitExceeded)
Is there any workaround to run the query and get the desired output table with all combination of ids?

You can try splitting your table T into 2 smaller tables T1 and T2, then perform 4 joins for each of the smaller tables T1:T1, T1:T2, T2:T1, T2:T2, then union the results. This will be equivalent to joining T with itself. If it still fails try breaking it down into even smaller tables.
Alternatively set maximumBillingTier to a higher value https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs.
configuration.query.maximumBillingTier - Limits the billing tier for
this job. Queries that have resource usage beyond this tier will fail
(without incurring a charge). If unspecified, this will be set to your
project default.
If using Java, it can be set in JobQueryConfiguration. This configuration property is not supported in the UI console at the moment.
In order to split a table you can use FARM_FINGERPRINT function in BigQuery. E.g. the 1st part will have a filter:
where mod(abs(farm_fingerprint(id)), 10) < 5
And the 2nd part will the filter:
where mod(abs(farm_fingerprint(id)), 10) >= 5

Related

CREATE TABLE using WITH clause in redshift not working [duplicate]

I need to create an empty time table series for a report so I can left join activity from several tables to it. Every hour of the day does not necessarily have data, but I want it to show null or zero for inactivity instead of omitting that hour of the day.
In later versions of Postgres (post 8.0.2), this is easy in several ways:
SELECT unnest(array[0,1,2,3,4...]) as numbers
OR
CROSS JOIN (select generate_series as hours
from generate_series(now()::timestamp,
now()::timestamp + interval '1 day',
'1 hour'::interval
)) date_series
Redshift can run some of these commands, but throws an error when you attempt to run it in conjunction with any of the tables.
WHAT I NEED:
A reliable way to generate a series of numbers (e.g. 0-23) as a subquery that will run on redshift (uses postgres 8.0.2).
As long as you have a table that has more rows than your required series has numbers, this is what has worked for me in the past:
select
(row_number() over (order by 1)) - 1 as hour
from
large_table
limit 24
;
Which returns numbers 0-23.
Unfortunately, Amazon Redshift does not allow use of generate_series() for table functions. The workaround seems to be creating a table of numbers.
See also:
Using sql function generate_series() in redshift
Generate Series in Redshift and MySQL, which does not seem correct but does introduce some interesting ideas
Recursion was released for Redshift in April 2021. Now that recursion is possible in Redshift. You can generate series of numbers (or even table) with below code
with recursive numbers(NUMBER) as
(
select 1 UNION ALL
select NUMBER + 1 from numbers where NUMBER < 28
)
I'm not a big fan of querying a system table just to get a list of row numbers. If it's something constant and small enough like hours of a day, I would go with plain old UNION ALL:
WITH
hours_in_day AS (
SELECT 0 AS hour
UNION ALL SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
...
UNION ALL SELECT 23
)
And then joinhours_in_day to whatever you want to.

Performance impact of table join base on column data type

I want to ensure that I don't negatively impact query performance based on schema design for BigQuery. I have two tables which I need to perform a join. The column that I will use to join the tables could be of type INTEGER or STRING. STRING would be easier in my case as it wouldn't require any new validation within our code base to ensure all values are of type INTEGER. But I don't want to join on a type STRING, if query performance will be significantly worse than running the join on an INTEGER type column.
Is there a large performance difference in BigQuery when the join is on type STRING vs type INTEGER?
---Update 10/16---
I ran some basic analysis to test this, here are the results:
Using public dataset, users table has 10M rows and posts table has 31M rows
Join on Integer: 2.78 sec elapsed, 318.1 MB processed (avg over 10 runs)
Join on String 6.77 sec elapsed, 137 MB processed (avg over 10 runs)
-- Join on Integer Query
SELECT count(*)
FROM `bigquery-public-data.stackoverflow.users` u
JOIN `bigquery-public-data.stackoverflow.stackoverflow_posts` p
on u.id = p.owner_user_id
WHERE RAND() < 2
(Where clause added to avoid cache)
-- Join on String
SELECT count(*)
FROM 'bigquery-public-data.stackoverflow.users' u
JOIN 'bigquery-public-data.stackoverflow.stackoverflow_posts' p
on u.display_name = p.owner_display_name
WHERE RAND() < 2
(Where clause added to avoid cache)
Surprisingly, JOIN on STRING appears to perform worse than INTEGER.
No, you won't see any significant difference. Go with the schema that is more natural for your use case.
I am super late to the party here. Came across this question while trying to optimise a join in BigQuery.
The docs suggest that using INT64 will indeed be quicker:
https://cloud.google.com/bigquery/docs/best-practices-performance-compute#use_int64_data_types_in_joins_to_reduce_cost_and_improve_comparison_performance

Redshift: Aggregate data on large number of dimensions is slow

I have an Amazon redshift table with about 400M records and 100 columns - 80 dimensions and 20 metrics.
Table is distributed by 1 of the high cardinality dimension columns and includes a couple of high cardinality columns in sort key.
A simple aggregate query:
Select dim1, dim2...dim60, sum(met1),...sum(met15)
From my table
Group by dim1...dim60
is taking too long. The explain plan looks simple just a sequential scan and hashaggregate on the able. Any recommendations on how I can optimize it?
1) If your table is heavily denormalized (your 80 dimensions are in fact 20 dimensions with 4 attributes each) it is faster to group by dimension keys only, and if you really need all dimension attributes join the aggregated result back to dimension tables to get them, like this:
with
groups as (
select dim1_id,dim2_id,...,dim20_id,sum(met1),sum(met2)
from my_table
group by 1,2,...,20
)
select *
from groups
join dim1_table
using (dim1_id)
join dim2_table
using (dim2_id)
...
join dim20_table
using (dim20_id)
If you don't want to normalize your table and you like that a single row has all pieces of information it's fine to keep it as is since in a column database they won't slow the queries down if you don't use them. But grouping by 80 columns is definitely inefficient and has to be "pseudo-normalized" in the query.
2) if your dimensions are hierarchical you can group by the lowest level only and then join higher level dimension attributes. For example, if you have country, country region and city with 4 attributes each there's no need to group by 12 attributes, all you can do is group by city ID and then join city's attributes, country region and country tables to the city ID of each group
3) you can have the combination of dimension IDs with some delimiter like - in a separate varchar column and use that as a sort key
Sequential scans are quite normal for Amazon Redshift. Instead of using indexes (which themselves would be Big Data), Redshift uses parallel clusters, compression and columnar storage to provide fast queries.
Normally, optimization is done via:
DISTKEY: Typically used on the most-JOINed column (or most GROUPed column) to localize joined data on the same node.
SORTKEY: Typically used for fields that most commonly appear in WHERE statements to quickly skip over storage blocks that do not contain relevant data.
Compression: Redshift automatically compresses data, but over time the skew of data could change, making another compression type more optimal.
Your query is quite unusual in that you are using GROUP BY on 60 columns across all rows in the table. This is not a typical Data Warehousing query (where rows are normally limited by WHERE and tables are connected by JOIN).
I would recommend experimenting with fewer GROUP BY columns and breaking the query down into several smaller queries via a WHERE clause to determine what is occupying most of the time. Worst case, you could run the results nightly and store them in a table for later querying.

REDSHIFT: How can I generate a series of numbers without creating a table called "numbers" in redshift (Postgres 8.0.2)?

I need to create an empty time table series for a report so I can left join activity from several tables to it. Every hour of the day does not necessarily have data, but I want it to show null or zero for inactivity instead of omitting that hour of the day.
In later versions of Postgres (post 8.0.2), this is easy in several ways:
SELECT unnest(array[0,1,2,3,4...]) as numbers
OR
CROSS JOIN (select generate_series as hours
from generate_series(now()::timestamp,
now()::timestamp + interval '1 day',
'1 hour'::interval
)) date_series
Redshift can run some of these commands, but throws an error when you attempt to run it in conjunction with any of the tables.
WHAT I NEED:
A reliable way to generate a series of numbers (e.g. 0-23) as a subquery that will run on redshift (uses postgres 8.0.2).
As long as you have a table that has more rows than your required series has numbers, this is what has worked for me in the past:
select
(row_number() over (order by 1)) - 1 as hour
from
large_table
limit 24
;
Which returns numbers 0-23.
Unfortunately, Amazon Redshift does not allow use of generate_series() for table functions. The workaround seems to be creating a table of numbers.
See also:
Using sql function generate_series() in redshift
Generate Series in Redshift and MySQL, which does not seem correct but does introduce some interesting ideas
Recursion was released for Redshift in April 2021. Now that recursion is possible in Redshift. You can generate series of numbers (or even table) with below code
with recursive numbers(NUMBER) as
(
select 1 UNION ALL
select NUMBER + 1 from numbers where NUMBER < 28
)
I'm not a big fan of querying a system table just to get a list of row numbers. If it's something constant and small enough like hours of a day, I would go with plain old UNION ALL:
WITH
hours_in_day AS (
SELECT 0 AS hour
UNION ALL SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
...
UNION ALL SELECT 23
)
And then joinhours_in_day to whatever you want to.

Azure SQL DW CTAS of over 102,400 rows to one distribution doesn't automatically compress

I thought the way columnstores worked was that if you bulk load over 102,400 rows into one distribution of a columnstore, it would automatically compress it. I'm not observing that in Azure SQL DW.
I'm doing the following CTAS statement:
create table ColumnstoreDemoCTAS
WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION=HASH(Column1))
AS
select top 102401 cast(1 as int) as Column1, f.*
from FactInternetSales f
cross join sys.objects o1
cross join sys.objects o2
Now I check the status of the columnstore row groups:
select t.name
,NI.distribution_id
,CSRowGroups.state_description
,CSRowGroups.total_rows
,CSRowGroups.deleted_rows
FROM sys.tables AS t
JOIN sys.indexes AS i
ON t.object_id = i.object_id
JOIN sys.pdw_index_mappings AS IndexMap
ON i.object_id = IndexMap.object_id
AND i.index_id = IndexMap.index_id
JOIN sys.pdw_nodes_indexes AS NI
ON IndexMap.physical_name = NI.name
AND IndexMap.index_id = NI.index_id
LEFT JOIN sys.pdw_nodes_column_store_row_groups AS CSRowGroups
ON CSRowGroups.object_id = NI.object_id
AND CSRowGroups.pdw_node_id = NI.pdw_node_id
AND CSRowGroups.distribution_id = NI.distribution_id
AND CSRowGroups.index_id = NI.index_id
WHERE t.name = 'ColumnstoreDemoCTAS'
ORDER BY 1,2,3,4 desc;
I end up with one OPEN rowgroup with 102401 rows. Did I misunderstand this behavior of columnstores? Is Azure SQL DW different?
I see the same behavior if I do an bulk insert from SSIS of the same number of rows all as one buffer.
I tried Drew's suggestion of inserting over 6.5 million rows and I still end up with all OPEN row stores:
create table ColumnstoreDemoWide
WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION=HASH(Column1))
AS
select top 7000000 ROW_NUMBER() OVER (ORDER BY f.ProductKey) as Column1, f.*
from FactInternetSales f
cross join sys.objects o
cross join sys.objects o2
cross join sys.objects o3
placing your data in a clustered columnstore will not decrease the number of rows returned. Instead, it will compress the data stored so that it takes up less space on disk. This will mean that less data is moved for queries and you will be charged less for storage, but your results will stay the same. That being said, your data is currently residing in a deltastore, so you will not see any compression. Due to SQL DW's architecture we separate the data into a number of groups under the covers. This allows us to more easily parallelize computations and scale, but also means that each group will have it's own columnstore/deltastore, so you will need to load more rows to get the compression benefits.
In addition to the distribution structure there is a difference in thresholds for SQL Server when compared to SQL Data Warehouse. For DW the threshold was 1,048,576 until a defect was resolved as #JRJ describes. Now Azure SQL DW's threshold is 120,400 like the rest of the SQL family. Once your rows in a distribution exceeds this you should see that your rows are compressed.
You can find a bit more information on loading into a columnstore here: https://msdn.microsoft.com/en-US/library/dn935008.aspx
This was a defect in the service. The fix is currently being rolled out. If you try this out on Japan West for example you will see that the behaviour is as you would expect.