Getting a "Disk Full" error from Redshift Spectrum - amazon-web-services

I am facing the problem of frequent Disk Full error on Redshift Spectrum, as a result, I have to repeatedly scale up the cluster. It seems that the caching would be deleted.
Ideally, I would like the scaling up to keep the caching, and finding a way to know how much disk space would be needed in a query.
Is there any document out there that talks about the caching of Redshift Spectrum, or they are using the same mechanism to Redshift?
EDIT: As requested by Jon Scott, I am updating my question
SELECT p.postcode,
SUM(p.like_count),
COUNT(l.id)
FROM post AS p
INNER JOIN likes AS l
ON l.postcode = p.postcode
GROUP BY 1;
The total of zipped data on S3 is about 1.8 TB. Athena took 10 minutes, scanned 700 GBs and told me Query exhausted resources at this scale factor
EDIT 2: I used a 16 TB SSD cluster.

You did not mention the size of the Redshift cluster you are using but the simple answer is to use a larger Redshift cluster (more nodes) or use a larger node type (more disk per node).
The issue is occurring because Redshift Spectrum is not able to push the full join execution down to the Spectrum layer. A majority of the data is being returned to the Redshift cluster simply to execute the join.
You could also restructure the query so that more work can be pushed down to Spectrum, in this case by doing the grouping and counting before joining. This will be most effective if the total number of rows output from each subquery is significantly fewer than the rows that would be returned for the join otherwise.
SELECT p.postcode
, p.like_count
, l.like_ids
FROM (--Summarize post data
SELECT p.postcode
, SUM(p.like_count)
FROM post AS p
GROUP BY 1
) AS p
INNER JOIN (--Summarize likes data
SELECT l.postcode
, COUNT(l.id) like_ids
FROM likes AS l
GROUP BY 1
) AS l
-- Join pre-summarized data only
ON l.postcode = p.postcode
;

Related

Redshift spectrum struggles with huge joins

I am trying to find out if I misconfigured something or am I hitting limits of single node redshift cluster?
I am using:
single node ra3 instance,
spectrum layer for files in s3,
files I am using are partitioned in S3 in parquet format and archived using snappy,
data I am trying to join it into is loaded into my redshift cluster (16m rows I will mention later is in my cluster),
data in external tables has numRows property according to the documentation
I am trying to perform a spatial join 16m rows into 10m rows using ST_Contains() and it just never finishes. I know that query is correct because it is able to join 16m rows with 2m rows in 6 seconds.
(query in Athena on same data completes in 2 minutes)
Case with 10m rows has been running for 60 minutes now and seems like it will just never finish. Any thoughts?

Max function in RedShift

In snowflake MAX is a metadata query which means, I do not use any compute, the result comes from the service layer metadata store. In RedShift also does it work in a similar way or does it have to do a scan of all the rows of the table(which may again be distributed across nodes) to get the MAX value.
Thanks

Big Query table partitions count limit

I am currently working with Big Query and understand that there is a partition limit of up to 4,000 partitions.
Does anyone know if this limit apply to Active Storage Tier only or both Active & Long Term Storage Tier?
Reason for asking because I have a partitioned table, partitioned by hour and have been using it for more than 6 months already but we don't get any error prompting partition limit exceed 4,000 when we insert new data.
I have did a count on the number of partition attached image below:
As we can see the total partitions is 6,401 and we are still able to insert new data.
At the same we also create a new partitioned table and try moving data into this newly created partitioned table but we encountered some error saying we have exceeded the limit of 4,000.
In addition, I also tried to insert data incrementally but I still get error as follow:
Steps to reproduce error:
Create a partitioned table (partition by hour)
Start moving data by month from another table
My finding:
The mentioned partition limit is only applicable to active storage tier.
Can anyone help to confirm on this?
As I understood the limitation, you can't modify more than 4000 partitions in one job. Your jobs that you describe first are supposedly working because they are modifying only a few partitions.
When you try to move more than 4000 partitions in one go, you will hit the limitation as you described.
I noticed I was hitting this limitation on both Active Storage and Long Term Storage. This is a BigQuery-wide limitation.

BigQuery: Running query against large dataset

I have approximately 100TB of data that I need to backfill by running query against to transform fields, then write the transformation to another table. This table is partitioned by ingestion time timestamp. I have both action as a part of single query as you can see below. I am planning to run this query multiple times in smaller chunks manually by ingestion timestamp ranges.
Is there a better way handle this process rather than running query in manual chunks? For example maybe using Dataflow or other framework.
CREATE TABLE IF NOT EXISTS dataset.table
PARTITION BY DATE(timestamp) AS
with load as (SELECT *, _TABLE_SUFFIX as tableId
FROM `project.dataset.table_*`
WHERE _TABLE_SUFFIX BETWEEN '1' AND '1531835999999'
),................
...................
You need to accurately dose the queries you run as there are very limiting quote enforcement.
Partitioned tables
Maximum number of partitions per partitioned table — 4,000
Maximum number of partitions modified by a single job — 2,000
Each job operation (query or load) can affect a maximum of 2,000 partitions. Any query or load job that affects more than 2,000 partitions is rejected by Google BigQuery.
Maximum number of partition modifications per day per table — 5,000
You are limited to a total of 5,000 partition modifications per day for a partitioned table. A partition can be modified by using an operation that appends to or overwrites data in the partition. Operations that modify partitions include: a load job, a query that writes results to a partition, or a DML statement (INSERT, DELETE, UPDATE, or MERGE) that modifies data in a partition.
More than one partition may be affected by a single job. For example, a DML statement can update data in multiple partitions (for both ingestion-time and partitioned tables). Query jobs and load jobs can also write to multiple partitions but only for partitioned tables. Google BigQuery uses the number of partitions affected by a job when determining how much of the quota the job consumes. Streaming inserts do not affect this quota.
Maximum rate of partition operations — 50 partition operations every 10 seconds
Most of the time you hit the second limitation, single job no more than 2000, and if you parallelise further you hit the last one, 50 partition operations every 10 seconds.
On the other hand the DML MERGE syntax could come into your help.
If you have a sales representative reach out to the BQ team and if they can increase some of your quotas they will respond positive.
Also I've seen people using multiple projects to run jobs past of the quotas.

Pivoting 1,620 columns to rows in 360gb text file in aws

I have a pipe delimited text file that is 360GB, compressed (gzip).
It has over 1,620 columns. I can't show the exact field names, but here's basically what it is:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1
Seriously, there are over 800 of these property name/value fields.
There are roughly 280 million rows.
The file is in an S3 bucket.
I need to get the data into Redshift, but the column limit in Redshift is 1,600.
The users want me to pivot the data. For example:
primary_key|key|value
12345|is_male|1
12345|is_college_educated|1
What is a good way to pivot the file in the aws environment? The data is in a single file, but I'm planning on splitting the data into many different files to allow for parallel processing.
I've considered using Athena. I couldn't find anything that states the maximum number of columns allowed by Athena. But, I found a page about Presto (on which Athena is based) that says “there is no exact hard limit, but we've seen stuff break with more than few thousand.” (https://groups.google.com/forum/#!topic/presto-users/7tv8l6MsbzI).
Thanks.
First, pivot your data, then load to Redshift.
In more detail, the steps are:
Run a spark job (using EMR or possibly AWS Glue) which reads in your
source S3 data and writes out (to a different s3 folder) a pivoted
version. by this i mean if you have 800 value pairs, then you would
write out 800 rows. At the same time, you can split the file into multiple parts to enable parallel load.
"COPY" this pivoted data into Redshift
What I learnt from most of the time from AWS is, if you are reaching a limit, you are doing it in a wrong way or not in a scalable way. Most of the time architects designed with scalability, performance in mind.
We had similar problems, having 2000 columns. Here is how we solved it.
Split the file across 20 different tables, 100+1 (primary key) column each.
Do a select across all those tables in a single query to return all the data you want.
If you say you want to see all the 1600 columns in a select, then the business user is looking at wrong columns for their analysis or even for machine learning.
To load 10TB+ of data we had split the data into multiple files and load them in parallel, that way loading was faster.
Between Athena and Redshift, performance is the only difference. Rest of them are same. Redshift performs better than Athena. Initial Load time and Scan Time is higher than Redshift.
Hope it helps.