In Redshift query plan Network is taking 3 hours - amazon-web-services

In Redshift one of the query is taking 3 hours to execute while analyzing its query plan it seems the network is taking all the time. How I could troubleshoot and resolve this problem.
Below is my query execution plan:
QUERY PLAN: Filter: ((to_date((productrequestdate)::text, 'YYYY-MM-DD'::text) <= '2022-01-31'::date) AND (to_date((productrequestdate)::text, 'YYYY-MM-DD'::text) >= '2020-01-01'::date) AND ((mstrclientid)::text = 'GSKUS'::text) AND (quantityrequested >= 0))
QUERY PLAN: -> XN Seq Scan on brsit_sample_transparency (cost=0.00..0.30 rows=1 width=4980)
QUERY PLAN: Filter: ((to_date((productrequestdate)::text, 'YYYY-MM-DD'::text) <= '2022-01-31'::date) AND (to_date((productrequestdate)::text, 'YYYY-MM-DD'::text) >= '2020-01-01'::date) AND ((mstrclientid)::text = 'GSKUS'::text) AND (quantityrequested >= 0))
QUERY PLAN: -> XN Seq Scan on verri_sample_transparency (cost=0.00..0.30 rows=1 width=4980)
QUERY PLAN: Filter: ((to_date((productrequestdate)::text, 'YYYY-MM-DD'::text) <= '2022-01-31'::date) AND (to_date((productrequestdate)::text, 'YYYY-MM-DD'::text) >= '2020-01-01'::date) AND (quantityrequested >= 0) AND ((mstrclientid)::text = 'GSKUS'::text))
QUERY PLAN: -> XN Seq Scan on gskus_sample_transparency (cost=0.00..33348.33 rows=5558 width=993)
QUERY PLAN: -> XN Multi Scan (cost=0.00..33404.53 rows=5560 width=4980)
QUERY PLAN: -> XN Subquery Scan bi_sample_transparency_view (cost=0.00..33460.13 rows=5560 width=1488)
QUERY PLAN: Sort Key: productndc10
QUERY PLAN: -> XN Sort (cost=1000000033805.99..1000000033819.89 rows=5560 width=1488)
QUERY PLAN: Send to leader
QUERY PLAN: -> XN Network (cost=1000000033805.99..1000000033819.89 rows=5560 width=1488)
QUERY PLAN: Merge Key: productndc10
QUERY PLAN:XN Merge (cost=1000000033805.99..1000000033819.89 rows=5560 width=1488)

As you say this is a problematic step in the plan (network transfer before SORT which isn't really a plan step but an activity that needs to be performed). With only 5560 rows being reported it doesn't seem like this should be a ton of data but your column count is high and I don't know the sizes of these columns. It could be there is a lot of data moving even for this limited number of rows. Or it could be that the reported number of rows is not indicative of the number of rows being moved in the network activity which can happen but this would need to be a huge difference. You can look at stl_dist for this query to see exactly how much data (bytes) is being moved.
Another possibility here is that your query was a victim not the culprit. You see Redshift is a cluster and clusters are connected by networks and these networks are common infrastructure for all queries running on the cluster. If there was a really bad query running during this window which browned out the internode network (bandwidth hog) then you query was caught up in this traffic jam. Does your query run normally most of the time but just went slow this time? What was the cluster activity like at this time? Were other queries impacted? I've debugged plenty of "slow" queries that were victims. That said it is always good in a clustered database like Redshift to not transfer excessive amounts of data on the network due to its clustered nature.
If you want to debug this query further (it is the culprit) then the query text, stl_dist information, and explain plan could shine some more light on the situation.

Related

AWS Logs Insight - percent of failed DNS queries?

I am currently learning about AWS Logs Insights, and I was wondering if the following is possible.
Let's say I gather logs from Route53. So I have an event for each query that reaches the AWS DNS servers of course. Now, I know I can count the number of queries per resolverIp for example as such:
stats count(*) by resolverIp
I also know that I can count the number of the queries, per resolverIp, that returned the NXDOMAIN responseCode, as such:
filter responseCode="NXDOMAIN" | stats count(*) by resolverIp
My question is, is there a way to get a percentage of the later (number of queries that returned NXDOMAIN per resolverIp) from the former (number of queries per resolverIp)?
This query gives you the percentage sorted by resolver IP
stats
sum(strcontains(responseCode,"NXDOMAIN")) / count(*)
* 100 by resolverIp

BQ google update columns - Exceeded rate limits

Scenario is to update column descriptions in tables(About 1500 columns in 50 tables). Due to multiple restrictions I have been asked to use the bq query command to execute the ALTER TABLE sql for updating column descriptions, thorugh cloud CLI. query -
bq query --nouse_legacy_sql \ 'ALTER TABLE `<Table>` ALTER COLUMN <columnname> SET OPTIONS(DESCRIPTION="<Updated Description>")';
Issue is if I bunch the bq queries together for 1500 columns it is 1500 sql statements.
This is causing the standard Exceeded rate limits: too many table update operations for this table error.
Any suggestions on how to execute it better.
You are hitting the rate limit:
Maximum rate of table metadata update operations per table: 5 operations per 10 seconds
You will need to stagger the updates to make sure it happens in batch of 5 operations per 10 seconds. You could also try to alter all the columns in a single table with a single statement to reduce the number of calls required.

Getting a "Disk Full" error from Redshift Spectrum

I am facing the problem of frequent Disk Full error on Redshift Spectrum, as a result, I have to repeatedly scale up the cluster. It seems that the caching would be deleted.
Ideally, I would like the scaling up to keep the caching, and finding a way to know how much disk space would be needed in a query.
Is there any document out there that talks about the caching of Redshift Spectrum, or they are using the same mechanism to Redshift?
EDIT: As requested by Jon Scott, I am updating my question
SELECT p.postcode,
SUM(p.like_count),
COUNT(l.id)
FROM post AS p
INNER JOIN likes AS l
ON l.postcode = p.postcode
GROUP BY 1;
The total of zipped data on S3 is about 1.8 TB. Athena took 10 minutes, scanned 700 GBs and told me Query exhausted resources at this scale factor
EDIT 2: I used a 16 TB SSD cluster.
You did not mention the size of the Redshift cluster you are using but the simple answer is to use a larger Redshift cluster (more nodes) or use a larger node type (more disk per node).
The issue is occurring because Redshift Spectrum is not able to push the full join execution down to the Spectrum layer. A majority of the data is being returned to the Redshift cluster simply to execute the join.
You could also restructure the query so that more work can be pushed down to Spectrum, in this case by doing the grouping and counting before joining. This will be most effective if the total number of rows output from each subquery is significantly fewer than the rows that would be returned for the join otherwise.
SELECT p.postcode
, p.like_count
, l.like_ids
FROM (--Summarize post data
SELECT p.postcode
, SUM(p.like_count)
FROM post AS p
GROUP BY 1
) AS p
INNER JOIN (--Summarize likes data
SELECT l.postcode
, COUNT(l.id) like_ids
FROM likes AS l
GROUP BY 1
) AS l
-- Join pre-summarized data only
ON l.postcode = p.postcode
;

How to find average time to load data from S3 into Redshift

I have more than 8 schemas and 200+ tables and data is loaded by CSV files in different schema.
I want to to know the SQL script for how to find average time to load the data from S3 into Redshift for all 200 tables.
You can examine the STL System Tables for Logging to discover how long queries took to run.
You'd probably need to parse the Query text to discover which tables were loaded, but you could use the historical load times to calculate a typical load time for each table.
Some particularly useful tables are:
STL_QUERY_METRICS: Contains metrics information, such as the number of rows processed, CPU usage, input/output, and disk use, for queries that have completed running in user-defined query queues (service classes).
STL_QUERY: Returns execution information about a database query.
STL_LOAD_COMMITS: This table records the progress of each data file as it is loaded into a database table.
Run this query to find out how fast your COPY queries are working.
select q.starttime, s.query, substring(q.querytxt,1,120) as querytxt,
s.n_files, size_mb, s.time_seconds,
s.size_mb/decode(s.time_seconds,0,1,s.time_seconds) as mb_per_s
from (select query, count(*) as n_files,
sum(transfer_size/(1024*1024)) as size_MB, (max(end_Time) -
min(start_Time))/(1000000) as time_seconds , max(end_time) as end_time
from stl_s3client where http_method = 'GET' and query > 0
and transfer_time > 0 group by query ) as s
LEFT JOIN stl_Query as q on q.query = s.query
where s.end_Time >= dateadd(day, -7, current_Date)
order by s.time_Seconds desc, size_mb desc, s.end_time desc
limit 50;
Once you find out how many mb/s you're pushing through from S3 you can roughly determine how long it will take each file based on the size.
There's a smart way to do it. You ought to have an ETL script that migrates data from S3 to Redshift.
Assuming that you have a shell script, just capture the time stamp before the ETL logic starts for that table (let's call that start), capture another timestamp after the ETL logic ends for that table (let's call that end) and take the difference towards the end of the script:
#!bin/sh
.
.
.
start=$(date +%s) #capture start time
#ETL Logic
[find the right csv on S3]
[check for duplicates, whether the file has already been loaded etc]
[run your ETL logic, logging to make sure that file has been processes on s3]
[copy that table to Redshift, log again to make sure that table has been copied]
[error logging, trigger emails, SMS, slack alerts etc]
[ ... ]
end=$(date +%s) #Capture end time
duration=$((end-start)) #Difference (time taken by the script to execute)
echo "duration is $duration"
PS: The duration will be in seconds and you can maintain a log file, entry to a DB table etc. The timestamp will be in epoc and you can use functions (depending upon where you're logging) like:
sec_to_time($duration) --for MySQL
SELECT (TIMESTAMP 'epoch' + 1511680982 * INTERVAL '1 Second ')AS mytimestamp -- for Amazon Redshift (and then take the difference of the two instances in epoch).

Cloud spanner: select count * takes over a minute

I created a test table in cloud spanner and populated it with 120 million rows. i have created a composite primary key for the table.
when i run a simple "select count(*) from " query, it takes approximately a minute for cloud spanner web UI to return results.
Is anyone else facing similar problem?
Cloud Spanner does not materialize counts, so queries will like "select count(*) ...." will scan the entire table to return the count of rows, hence the higher time to execute.
If you require faster counts, recommend keeping a sharded counter updated transactionally with changes to the table.
#samiz - you answer "recommend keeping a sharded counter updated transactionally with changes to the table"
how can we detect how many sharded counter need for the table? there is no retry n transaction...
thank you