I am trying to find out if I misconfigured something or am I hitting limits of single node redshift cluster?
I am using:
single node ra3 instance,
spectrum layer for files in s3,
files I am using are partitioned in S3 in parquet format and archived using snappy,
data I am trying to join it into is loaded into my redshift cluster (16m rows I will mention later is in my cluster),
data in external tables has numRows property according to the documentation
I am trying to perform a spatial join 16m rows into 10m rows using ST_Contains() and it just never finishes. I know that query is correct because it is able to join 16m rows with 2m rows in 6 seconds.
(query in Athena on same data completes in 2 minutes)
Case with 10m rows has been running for 60 minutes now and seems like it will just never finish. Any thoughts?
Related
What I have
2 datasets on HDFS in Parquet format:
1.6T (after parquet uncompressing it's going to be 2.8T), 31 columns, I assume there is no data skew and all data distributed across HDFS evenly
200G (after parquet uncompressing it will be 360G), 5 columns, no data skew and data distributed evenly
I use AWS EMR cluster to run PySpark job.
What I need to do
Since experimenting isn't cheap, I want to calculate PySpark job configuration based on the input configuration and my assumptions before I even run it on the cluster.
Here are some details. I need to join datasets by one id column to enrich first dataset (1.6T) with data (only 3 columns: string, string, struct<int, string, string>) from the second dataset (200G).
Question
How can I decide what are the number of executors, cpu cores, memory, and [disk?] I need to request for my PySpark job?
(And is there any general formula for that?)
I am facing the problem of frequent Disk Full error on Redshift Spectrum, as a result, I have to repeatedly scale up the cluster. It seems that the caching would be deleted.
Ideally, I would like the scaling up to keep the caching, and finding a way to know how much disk space would be needed in a query.
Is there any document out there that talks about the caching of Redshift Spectrum, or they are using the same mechanism to Redshift?
EDIT: As requested by Jon Scott, I am updating my question
SELECT p.postcode,
SUM(p.like_count),
COUNT(l.id)
FROM post AS p
INNER JOIN likes AS l
ON l.postcode = p.postcode
GROUP BY 1;
The total of zipped data on S3 is about 1.8 TB. Athena took 10 minutes, scanned 700 GBs and told me Query exhausted resources at this scale factor
EDIT 2: I used a 16 TB SSD cluster.
You did not mention the size of the Redshift cluster you are using but the simple answer is to use a larger Redshift cluster (more nodes) or use a larger node type (more disk per node).
The issue is occurring because Redshift Spectrum is not able to push the full join execution down to the Spectrum layer. A majority of the data is being returned to the Redshift cluster simply to execute the join.
You could also restructure the query so that more work can be pushed down to Spectrum, in this case by doing the grouping and counting before joining. This will be most effective if the total number of rows output from each subquery is significantly fewer than the rows that would be returned for the join otherwise.
SELECT p.postcode
, p.like_count
, l.like_ids
FROM (--Summarize post data
SELECT p.postcode
, SUM(p.like_count)
FROM post AS p
GROUP BY 1
) AS p
INNER JOIN (--Summarize likes data
SELECT l.postcode
, COUNT(l.id) like_ids
FROM likes AS l
GROUP BY 1
) AS l
-- Join pre-summarized data only
ON l.postcode = p.postcode
;
I have approximately 100TB of data that I need to backfill by running query against to transform fields, then write the transformation to another table. This table is partitioned by ingestion time timestamp. I have both action as a part of single query as you can see below. I am planning to run this query multiple times in smaller chunks manually by ingestion timestamp ranges.
Is there a better way handle this process rather than running query in manual chunks? For example maybe using Dataflow or other framework.
CREATE TABLE IF NOT EXISTS dataset.table
PARTITION BY DATE(timestamp) AS
with load as (SELECT *, _TABLE_SUFFIX as tableId
FROM `project.dataset.table_*`
WHERE _TABLE_SUFFIX BETWEEN '1' AND '1531835999999'
),................
...................
You need to accurately dose the queries you run as there are very limiting quote enforcement.
Partitioned tables
Maximum number of partitions per partitioned table — 4,000
Maximum number of partitions modified by a single job — 2,000
Each job operation (query or load) can affect a maximum of 2,000 partitions. Any query or load job that affects more than 2,000 partitions is rejected by Google BigQuery.
Maximum number of partition modifications per day per table — 5,000
You are limited to a total of 5,000 partition modifications per day for a partitioned table. A partition can be modified by using an operation that appends to or overwrites data in the partition. Operations that modify partitions include: a load job, a query that writes results to a partition, or a DML statement (INSERT, DELETE, UPDATE, or MERGE) that modifies data in a partition.
More than one partition may be affected by a single job. For example, a DML statement can update data in multiple partitions (for both ingestion-time and partitioned tables). Query jobs and load jobs can also write to multiple partitions but only for partitioned tables. Google BigQuery uses the number of partitions affected by a job when determining how much of the quota the job consumes. Streaming inserts do not affect this quota.
Maximum rate of partition operations — 50 partition operations every 10 seconds
Most of the time you hit the second limitation, single job no more than 2000, and if you parallelise further you hit the last one, 50 partition operations every 10 seconds.
On the other hand the DML MERGE syntax could come into your help.
If you have a sales representative reach out to the BQ team and if they can increase some of your quotas they will respond positive.
Also I've seen people using multiple projects to run jobs past of the quotas.
I have partitioned the data by date and here is how it is stored in s3.
s3://dataset/date=2018-04-01
s3://dataset/date=2018-04-02
s3://dataset/date=2018-04-03
s3://dataset/date=2018-04-04
...
Created hive external table on top of this. I am executing this query,
select count(*) from dataset where `date` ='2018-04-02'
This partition has two parquet files like this,
part1 -xxxx- .snappy.parquet
part2 -xxxx- .snappy.parquet
each file size is 297MB. , So not a big file and not many files to scan.
And the query is returning 12201724 records. However it takes 3.5 mins to return this, since one partition itself is taking this time, running even the count query on whole dataset ( 7 years ) of data takes hours to return the results. Is there anyway, I can speed up this ?
Amazon Athena is, effectively, a managed Presto service. It can query data stored in Amazon S3 without having to run any clusters.
It is charged based upon the amount of data read from disk, so it runs very efficiently when using partitions and parquet files.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
I have a pipe delimited text file that is 360GB, compressed (gzip).
It has over 1,620 columns. I can't show the exact field names, but here's basically what it is:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1
Seriously, there are over 800 of these property name/value fields.
There are roughly 280 million rows.
The file is in an S3 bucket.
I need to get the data into Redshift, but the column limit in Redshift is 1,600.
The users want me to pivot the data. For example:
primary_key|key|value
12345|is_male|1
12345|is_college_educated|1
What is a good way to pivot the file in the aws environment? The data is in a single file, but I'm planning on splitting the data into many different files to allow for parallel processing.
I've considered using Athena. I couldn't find anything that states the maximum number of columns allowed by Athena. But, I found a page about Presto (on which Athena is based) that says “there is no exact hard limit, but we've seen stuff break with more than few thousand.” (https://groups.google.com/forum/#!topic/presto-users/7tv8l6MsbzI).
Thanks.
First, pivot your data, then load to Redshift.
In more detail, the steps are:
Run a spark job (using EMR or possibly AWS Glue) which reads in your
source S3 data and writes out (to a different s3 folder) a pivoted
version. by this i mean if you have 800 value pairs, then you would
write out 800 rows. At the same time, you can split the file into multiple parts to enable parallel load.
"COPY" this pivoted data into Redshift
What I learnt from most of the time from AWS is, if you are reaching a limit, you are doing it in a wrong way or not in a scalable way. Most of the time architects designed with scalability, performance in mind.
We had similar problems, having 2000 columns. Here is how we solved it.
Split the file across 20 different tables, 100+1 (primary key) column each.
Do a select across all those tables in a single query to return all the data you want.
If you say you want to see all the 1600 columns in a select, then the business user is looking at wrong columns for their analysis or even for machine learning.
To load 10TB+ of data we had split the data into multiple files and load them in parallel, that way loading was faster.
Between Athena and Redshift, performance is the only difference. Rest of them are same. Redshift performs better than Athena. Initial Load time and Scan Time is higher than Redshift.
Hope it helps.