BQ google update columns - Exceeded rate limits - google-cloud-platform

Scenario is to update column descriptions in tables(About 1500 columns in 50 tables). Due to multiple restrictions I have been asked to use the bq query command to execute the ALTER TABLE sql for updating column descriptions, thorugh cloud CLI. query -
bq query --nouse_legacy_sql \ 'ALTER TABLE `<Table>` ALTER COLUMN <columnname> SET OPTIONS(DESCRIPTION="<Updated Description>")';
Issue is if I bunch the bq queries together for 1500 columns it is 1500 sql statements.
This is causing the standard Exceeded rate limits: too many table update operations for this table error.
Any suggestions on how to execute it better.

You are hitting the rate limit:
Maximum rate of table metadata update operations per table: 5 operations per 10 seconds
You will need to stagger the updates to make sure it happens in batch of 5 operations per 10 seconds. You could also try to alter all the columns in a single table with a single statement to reduce the number of calls required.

Related

Billions of rows table as input to Quicksight dataset

there are two redshift table named A & B and a Quicksight dashboard where it takes A MINUS B as query to display content for a visual. If we use DIRECT query option and it is getting timedout because query is not completing in 2 mins(Quicksight have hard limit to run query within 2 mins) . Is there a way to use such large datasets as input Quicksight dasboard visual ?
Can't use SPICE engine because it have limit 1B or 1TB size limit.Also, it have 15 mins of delay to refresh data.
You will likely need to provide more information to fully resolve. MINUS can be a very expensive operation especially if you haven't optimized the tables for this operation. Can you provide information about your table setup and the EXPLAIN plan of the query you are running?
Barring improving the query, one way to work around a poorly performing query behind quicksight is to move this query to a materialized view. This way the result of the query can be stored for later retrieval but needs to be refreshed when the source data changes. It sounds like your data only changes every 15 min (did I get this right?) then this may be an option.

Google Cloud clustered table not working?

On google cloud, I've created an ingestion-time partitioned table clustered on columns Hour, Minute and Second. From my knowledge on clustered table, this means that my rows are distributed in clusters organized by hour, each hour cluster contains minute clusters and each minute cluster should contain second clusters.
So I would expect that when I query data from 13:10:00 to 13:10:30, the query should affect only rows inside cluster of hour 13, minute 30 and seconds from 0 to 30. Am I wrong?
I'm asking this, because actually it seems clusters are not working on my project, since I have a test table of 140 MB, but when I add WHERE condition on my clustered columns, BigQuery still says the query will affect all the table size, while I would expect that using clustered columns in Where condition, the amount of data queried should be smaller. Any help? Thank you.

Execute A set of queries as Batch In AW Athena

I'm trying to execute AWS Athena queries as batch using aws-java-sdk-athena. I'm able to establish the connection,run individually the queries, but no idea how to run 3 queries as batch. Any help appreciated.
Query
1.select * from table1 limit 2
2.select * from table2 limit 2
3.select * from table3 limit 2
You can run multiple queries in parallel in Athena. They will be executed in background. So if you start your queries using e.g.
StartQueryExecutionResult startQueryExecutionResult = client.startQueryExecution(startQueryExecutionRequest);
you will get an executionId. This can be then used to query the status of the running queries to check if they finished already. You can get the execution status of the query using getQueryExecutionId or batchGetQueryExecution.
Limits
There are some limits in Athena. You can run up to 20 SELECT queries in parallel.
See documentation:
20 DDL queries at the same time. DDL queries include CREATE TABLE and CREATE TABLE ADD PARTITION queries.
20 DML queries at the same time. DML queries include SELECT and CREATE TABLE AS (CTAS) queries.

BigQuery: Running query against large dataset

I have approximately 100TB of data that I need to backfill by running query against to transform fields, then write the transformation to another table. This table is partitioned by ingestion time timestamp. I have both action as a part of single query as you can see below. I am planning to run this query multiple times in smaller chunks manually by ingestion timestamp ranges.
Is there a better way handle this process rather than running query in manual chunks? For example maybe using Dataflow or other framework.
CREATE TABLE IF NOT EXISTS dataset.table
PARTITION BY DATE(timestamp) AS
with load as (SELECT *, _TABLE_SUFFIX as tableId
FROM `project.dataset.table_*`
WHERE _TABLE_SUFFIX BETWEEN '1' AND '1531835999999'
),................
...................
You need to accurately dose the queries you run as there are very limiting quote enforcement.
Partitioned tables
Maximum number of partitions per partitioned table — 4,000
Maximum number of partitions modified by a single job — 2,000
Each job operation (query or load) can affect a maximum of 2,000 partitions. Any query or load job that affects more than 2,000 partitions is rejected by Google BigQuery.
Maximum number of partition modifications per day per table — 5,000
You are limited to a total of 5,000 partition modifications per day for a partitioned table. A partition can be modified by using an operation that appends to or overwrites data in the partition. Operations that modify partitions include: a load job, a query that writes results to a partition, or a DML statement (INSERT, DELETE, UPDATE, or MERGE) that modifies data in a partition.
More than one partition may be affected by a single job. For example, a DML statement can update data in multiple partitions (for both ingestion-time and partitioned tables). Query jobs and load jobs can also write to multiple partitions but only for partitioned tables. Google BigQuery uses the number of partitions affected by a job when determining how much of the quota the job consumes. Streaming inserts do not affect this quota.
Maximum rate of partition operations — 50 partition operations every 10 seconds
Most of the time you hit the second limitation, single job no more than 2000, and if you parallelise further you hit the last one, 50 partition operations every 10 seconds.
On the other hand the DML MERGE syntax could come into your help.
If you have a sales representative reach out to the BQ team and if they can increase some of your quotas they will respond positive.
Also I've seen people using multiple projects to run jobs past of the quotas.

Cassandra get more than 10k rows

I am getting stuck with Cassandra all() query.
I am using the Django platform. My query is to get all rows from Cassandra table. But, CQL has some limit to 10k rows at a time.
Before, I have less than 10k rows in Cassandra table. But, now the count has increased up-to 12k.
How do I get the all() query to return all 12k rows?
CQL have a default limitation to 10k rows. That means there's an implicit limit to 10k when you perform any SELECT. If you want you can override that by specifying a new LIMIT value, eg:
SELECT * FROM mytable LIMIT 500000;