All,
I have a fact table in redshift with around 90 million rows, max columns are integers and have AUTO sort key and EVEN dist key. 2 nodes cluster. Running a simple select statement taking forever and aborting. Any help.
select * from facts_invoice
Basically trying to feed this data to Powerbi and seems like the slowness coming from Redshift itself. In Snowflake, I used 200 Billions select * before and never took more than 10-15 minutes.
Related
I am trying to find out if I misconfigured something or am I hitting limits of single node redshift cluster?
I am using:
single node ra3 instance,
spectrum layer for files in s3,
files I am using are partitioned in S3 in parquet format and archived using snappy,
data I am trying to join it into is loaded into my redshift cluster (16m rows I will mention later is in my cluster),
data in external tables has numRows property according to the documentation
I am trying to perform a spatial join 16m rows into 10m rows using ST_Contains() and it just never finishes. I know that query is correct because it is able to join 16m rows with 2m rows in 6 seconds.
(query in Athena on same data completes in 2 minutes)
Case with 10m rows has been running for 60 minutes now and seems like it will just never finish. Any thoughts?
there are two redshift table named A & B and a Quicksight dashboard where it takes A MINUS B as query to display content for a visual. If we use DIRECT query option and it is getting timedout because query is not completing in 2 mins(Quicksight have hard limit to run query within 2 mins) . Is there a way to use such large datasets as input Quicksight dasboard visual ?
Can't use SPICE engine because it have limit 1B or 1TB size limit.Also, it have 15 mins of delay to refresh data.
You will likely need to provide more information to fully resolve. MINUS can be a very expensive operation especially if you haven't optimized the tables for this operation. Can you provide information about your table setup and the EXPLAIN plan of the query you are running?
Barring improving the query, one way to work around a poorly performing query behind quicksight is to move this query to a materialized view. This way the result of the query can be stored for later retrieval but needs to be refreshed when the source data changes. It sounds like your data only changes every 15 min (did I get this right?) then this may be an option.
I am facing the problem of frequent Disk Full error on Redshift Spectrum, as a result, I have to repeatedly scale up the cluster. It seems that the caching would be deleted.
Ideally, I would like the scaling up to keep the caching, and finding a way to know how much disk space would be needed in a query.
Is there any document out there that talks about the caching of Redshift Spectrum, or they are using the same mechanism to Redshift?
EDIT: As requested by Jon Scott, I am updating my question
SELECT p.postcode,
SUM(p.like_count),
COUNT(l.id)
FROM post AS p
INNER JOIN likes AS l
ON l.postcode = p.postcode
GROUP BY 1;
The total of zipped data on S3 is about 1.8 TB. Athena took 10 minutes, scanned 700 GBs and told me Query exhausted resources at this scale factor
EDIT 2: I used a 16 TB SSD cluster.
You did not mention the size of the Redshift cluster you are using but the simple answer is to use a larger Redshift cluster (more nodes) or use a larger node type (more disk per node).
The issue is occurring because Redshift Spectrum is not able to push the full join execution down to the Spectrum layer. A majority of the data is being returned to the Redshift cluster simply to execute the join.
You could also restructure the query so that more work can be pushed down to Spectrum, in this case by doing the grouping and counting before joining. This will be most effective if the total number of rows output from each subquery is significantly fewer than the rows that would be returned for the join otherwise.
SELECT p.postcode
, p.like_count
, l.like_ids
FROM (--Summarize post data
SELECT p.postcode
, SUM(p.like_count)
FROM post AS p
GROUP BY 1
) AS p
INNER JOIN (--Summarize likes data
SELECT l.postcode
, COUNT(l.id) like_ids
FROM likes AS l
GROUP BY 1
) AS l
-- Join pre-summarized data only
ON l.postcode = p.postcode
;
On google cloud, I've created an ingestion-time partitioned table clustered on columns Hour, Minute and Second. From my knowledge on clustered table, this means that my rows are distributed in clusters organized by hour, each hour cluster contains minute clusters and each minute cluster should contain second clusters.
So I would expect that when I query data from 13:10:00 to 13:10:30, the query should affect only rows inside cluster of hour 13, minute 30 and seconds from 0 to 30. Am I wrong?
I'm asking this, because actually it seems clusters are not working on my project, since I have a test table of 140 MB, but when I add WHERE condition on my clustered columns, BigQuery still says the query will affect all the table size, while I would expect that using clustered columns in Where condition, the amount of data queried should be smaller. Any help? Thank you.
I created a test table in cloud spanner and populated it with 120 million rows. i have created a composite primary key for the table.
when i run a simple "select count(*) from " query, it takes approximately a minute for cloud spanner web UI to return results.
Is anyone else facing similar problem?
Cloud Spanner does not materialize counts, so queries will like "select count(*) ...." will scan the entire table to return the count of rows, hence the higher time to execute.
If you require faster counts, recommend keeping a sharded counter updated transactionally with changes to the table.
#samiz - you answer "recommend keeping a sharded counter updated transactionally with changes to the table"
how can we detect how many sharded counter need for the table? there is no retry n transaction...
thank you