Google Cloud clustered table not working? - google-cloud-platform

On google cloud, I've created an ingestion-time partitioned table clustered on columns Hour, Minute and Second. From my knowledge on clustered table, this means that my rows are distributed in clusters organized by hour, each hour cluster contains minute clusters and each minute cluster should contain second clusters.
So I would expect that when I query data from 13:10:00 to 13:10:30, the query should affect only rows inside cluster of hour 13, minute 30 and seconds from 0 to 30. Am I wrong?
I'm asking this, because actually it seems clusters are not working on my project, since I have a test table of 140 MB, but when I add WHERE condition on my clustered columns, BigQuery still says the query will affect all the table size, while I would expect that using clustered columns in Where condition, the amount of data queried should be smaller. Any help? Thank you.

Related

Redshift spectrum struggles with huge joins

I am trying to find out if I misconfigured something or am I hitting limits of single node redshift cluster?
I am using:
single node ra3 instance,
spectrum layer for files in s3,
files I am using are partitioned in S3 in parquet format and archived using snappy,
data I am trying to join it into is loaded into my redshift cluster (16m rows I will mention later is in my cluster),
data in external tables has numRows property according to the documentation
I am trying to perform a spatial join 16m rows into 10m rows using ST_Contains() and it just never finishes. I know that query is correct because it is able to join 16m rows with 2m rows in 6 seconds.
(query in Athena on same data completes in 2 minutes)
Case with 10m rows has been running for 60 minutes now and seems like it will just never finish. Any thoughts?

BQ google update columns - Exceeded rate limits

Scenario is to update column descriptions in tables(About 1500 columns in 50 tables). Due to multiple restrictions I have been asked to use the bq query command to execute the ALTER TABLE sql for updating column descriptions, thorugh cloud CLI. query -
bq query --nouse_legacy_sql \ 'ALTER TABLE `<Table>` ALTER COLUMN <columnname> SET OPTIONS(DESCRIPTION="<Updated Description>")';
Issue is if I bunch the bq queries together for 1500 columns it is 1500 sql statements.
This is causing the standard Exceeded rate limits: too many table update operations for this table error.
Any suggestions on how to execute it better.
You are hitting the rate limit:
Maximum rate of table metadata update operations per table: 5 operations per 10 seconds
You will need to stagger the updates to make sure it happens in batch of 5 operations per 10 seconds. You could also try to alter all the columns in a single table with a single statement to reduce the number of calls required.

Redshift slow select query

All,
I have a fact table in redshift with around 90 million rows, max columns are integers and have AUTO sort key and EVEN dist key. 2 nodes cluster. Running a simple select statement taking forever and aborting. Any help.
select * from facts_invoice
Basically trying to feed this data to Powerbi and seems like the slowness coming from Redshift itself. In Snowflake, I used 200 Billions select * before and never took more than 10-15 minutes.

Big Query table partitions count limit

I am currently working with Big Query and understand that there is a partition limit of up to 4,000 partitions.
Does anyone know if this limit apply to Active Storage Tier only or both Active & Long Term Storage Tier?
Reason for asking because I have a partitioned table, partitioned by hour and have been using it for more than 6 months already but we don't get any error prompting partition limit exceed 4,000 when we insert new data.
I have did a count on the number of partition attached image below:
As we can see the total partitions is 6,401 and we are still able to insert new data.
At the same we also create a new partitioned table and try moving data into this newly created partitioned table but we encountered some error saying we have exceeded the limit of 4,000.
In addition, I also tried to insert data incrementally but I still get error as follow:
Steps to reproduce error:
Create a partitioned table (partition by hour)
Start moving data by month from another table
My finding:
The mentioned partition limit is only applicable to active storage tier.
Can anyone help to confirm on this?
As I understood the limitation, you can't modify more than 4000 partitions in one job. Your jobs that you describe first are supposedly working because they are modifying only a few partitions.
When you try to move more than 4000 partitions in one go, you will hit the limitation as you described.
I noticed I was hitting this limitation on both Active Storage and Long Term Storage. This is a BigQuery-wide limitation.

BigQuery: Running query against large dataset

I have approximately 100TB of data that I need to backfill by running query against to transform fields, then write the transformation to another table. This table is partitioned by ingestion time timestamp. I have both action as a part of single query as you can see below. I am planning to run this query multiple times in smaller chunks manually by ingestion timestamp ranges.
Is there a better way handle this process rather than running query in manual chunks? For example maybe using Dataflow or other framework.
CREATE TABLE IF NOT EXISTS dataset.table
PARTITION BY DATE(timestamp) AS
with load as (SELECT *, _TABLE_SUFFIX as tableId
FROM `project.dataset.table_*`
WHERE _TABLE_SUFFIX BETWEEN '1' AND '1531835999999'
),................
...................
You need to accurately dose the queries you run as there are very limiting quote enforcement.
Partitioned tables
Maximum number of partitions per partitioned table — 4,000
Maximum number of partitions modified by a single job — 2,000
Each job operation (query or load) can affect a maximum of 2,000 partitions. Any query or load job that affects more than 2,000 partitions is rejected by Google BigQuery.
Maximum number of partition modifications per day per table — 5,000
You are limited to a total of 5,000 partition modifications per day for a partitioned table. A partition can be modified by using an operation that appends to or overwrites data in the partition. Operations that modify partitions include: a load job, a query that writes results to a partition, or a DML statement (INSERT, DELETE, UPDATE, or MERGE) that modifies data in a partition.
More than one partition may be affected by a single job. For example, a DML statement can update data in multiple partitions (for both ingestion-time and partitioned tables). Query jobs and load jobs can also write to multiple partitions but only for partitioned tables. Google BigQuery uses the number of partitions affected by a job when determining how much of the quota the job consumes. Streaming inserts do not affect this quota.
Maximum rate of partition operations — 50 partition operations every 10 seconds
Most of the time you hit the second limitation, single job no more than 2000, and if you parallelise further you hit the last one, 50 partition operations every 10 seconds.
On the other hand the DML MERGE syntax could come into your help.
If you have a sales representative reach out to the BQ team and if they can increase some of your quotas they will respond positive.
Also I've seen people using multiple projects to run jobs past of the quotas.