Execute A set of queries as Batch In AW Athena - amazon-web-services

I'm trying to execute AWS Athena queries as batch using aws-java-sdk-athena. I'm able to establish the connection,run individually the queries, but no idea how to run 3 queries as batch. Any help appreciated.
Query
1.select * from table1 limit 2
2.select * from table2 limit 2
3.select * from table3 limit 2

You can run multiple queries in parallel in Athena. They will be executed in background. So if you start your queries using e.g.
StartQueryExecutionResult startQueryExecutionResult = client.startQueryExecution(startQueryExecutionRequest);
you will get an executionId. This can be then used to query the status of the running queries to check if they finished already. You can get the execution status of the query using getQueryExecutionId or batchGetQueryExecution.
Limits
There are some limits in Athena. You can run up to 20 SELECT queries in parallel.
See documentation:
20 DDL queries at the same time. DDL queries include CREATE TABLE and CREATE TABLE ADD PARTITION queries.
20 DML queries at the same time. DML queries include SELECT and CREATE TABLE AS (CTAS) queries.

Related

Redshift spectrum struggles with huge joins

I am trying to find out if I misconfigured something or am I hitting limits of single node redshift cluster?
I am using:
single node ra3 instance,
spectrum layer for files in s3,
files I am using are partitioned in S3 in parquet format and archived using snappy,
data I am trying to join it into is loaded into my redshift cluster (16m rows I will mention later is in my cluster),
data in external tables has numRows property according to the documentation
I am trying to perform a spatial join 16m rows into 10m rows using ST_Contains() and it just never finishes. I know that query is correct because it is able to join 16m rows with 2m rows in 6 seconds.
(query in Athena on same data completes in 2 minutes)
Case with 10m rows has been running for 60 minutes now and seems like it will just never finish. Any thoughts?

BQ google update columns - Exceeded rate limits

Scenario is to update column descriptions in tables(About 1500 columns in 50 tables). Due to multiple restrictions I have been asked to use the bq query command to execute the ALTER TABLE sql for updating column descriptions, thorugh cloud CLI. query -
bq query --nouse_legacy_sql \ 'ALTER TABLE `<Table>` ALTER COLUMN <columnname> SET OPTIONS(DESCRIPTION="<Updated Description>")';
Issue is if I bunch the bq queries together for 1500 columns it is 1500 sql statements.
This is causing the standard Exceeded rate limits: too many table update operations for this table error.
Any suggestions on how to execute it better.
You are hitting the rate limit:
Maximum rate of table metadata update operations per table: 5 operations per 10 seconds
You will need to stagger the updates to make sure it happens in batch of 5 operations per 10 seconds. You could also try to alter all the columns in a single table with a single statement to reduce the number of calls required.

How to stop AWS glue from creating 1 individual table instead of multiple tables

I have folder structure as following in S3
Data
table1/output/table1.csv
table2/output/table2.csv
table3/output/table3.csv
My ideal goal is to have a Glue Crawler to have 3 respective tables created. Instead what is created is 1 table called data with partitions table1, table2, table3 and output. I have messed around with various combinations in the configuration page but still no luck. Any recommendations?

Is there any way I an delete more than 20k mutation in Google Cloud Spanner?

I have millions of records in Spanner table and I would like to delete rows from Spanner using some query condition. For Eg: delete from spanner table where id > 2000. I'm not able to run this query in Spanner UI, because of Spanner 20k mutation limit in single op's. So is there any way I could delete this record from spanner table by doing some tweaks in api level code or do we have a work around for this type of use-case.
You can use gcloud command line as :
gcloud spanner databases execute-sql <database_id> --instance=<instance_id> --enable-partitioned-dml --sql="delete from YourTable where id > 2000"
NOTE: SQL query must be fully partitionable and idempotent
According to the official documentation Deleting rows in a table, I think you should consider Particioned DML execution model:
If you want to delete a large amount of data, you should use
Partitioned DML, because Partitioned DML handles transaction limits
and is optimized to handle large-scale deletions
Partitioned DML enables large-scale, database-wide operations with
minimal impact on concurrent transaction processing by partitioning
the key space and running the statement over partitions in separate,
smaller-scoped transactions.

How to find average time to load data from S3 into Redshift

I have more than 8 schemas and 200+ tables and data is loaded by CSV files in different schema.
I want to to know the SQL script for how to find average time to load the data from S3 into Redshift for all 200 tables.
You can examine the STL System Tables for Logging to discover how long queries took to run.
You'd probably need to parse the Query text to discover which tables were loaded, but you could use the historical load times to calculate a typical load time for each table.
Some particularly useful tables are:
STL_QUERY_METRICS: Contains metrics information, such as the number of rows processed, CPU usage, input/output, and disk use, for queries that have completed running in user-defined query queues (service classes).
STL_QUERY: Returns execution information about a database query.
STL_LOAD_COMMITS: This table records the progress of each data file as it is loaded into a database table.
Run this query to find out how fast your COPY queries are working.
select q.starttime, s.query, substring(q.querytxt,1,120) as querytxt,
s.n_files, size_mb, s.time_seconds,
s.size_mb/decode(s.time_seconds,0,1,s.time_seconds) as mb_per_s
from (select query, count(*) as n_files,
sum(transfer_size/(1024*1024)) as size_MB, (max(end_Time) -
min(start_Time))/(1000000) as time_seconds , max(end_time) as end_time
from stl_s3client where http_method = 'GET' and query > 0
and transfer_time > 0 group by query ) as s
LEFT JOIN stl_Query as q on q.query = s.query
where s.end_Time >= dateadd(day, -7, current_Date)
order by s.time_Seconds desc, size_mb desc, s.end_time desc
limit 50;
Once you find out how many mb/s you're pushing through from S3 you can roughly determine how long it will take each file based on the size.
There's a smart way to do it. You ought to have an ETL script that migrates data from S3 to Redshift.
Assuming that you have a shell script, just capture the time stamp before the ETL logic starts for that table (let's call that start), capture another timestamp after the ETL logic ends for that table (let's call that end) and take the difference towards the end of the script:
#!bin/sh
.
.
.
start=$(date +%s) #capture start time
#ETL Logic
[find the right csv on S3]
[check for duplicates, whether the file has already been loaded etc]
[run your ETL logic, logging to make sure that file has been processes on s3]
[copy that table to Redshift, log again to make sure that table has been copied]
[error logging, trigger emails, SMS, slack alerts etc]
[ ... ]
end=$(date +%s) #Capture end time
duration=$((end-start)) #Difference (time taken by the script to execute)
echo "duration is $duration"
PS: The duration will be in seconds and you can maintain a log file, entry to a DB table etc. The timestamp will be in epoc and you can use functions (depending upon where you're logging) like:
sec_to_time($duration) --for MySQL
SELECT (TIMESTAMP 'epoch' + 1511680982 * INTERVAL '1 Second ')AS mytimestamp -- for Amazon Redshift (and then take the difference of the two instances in epoch).