How to configure PySpark job based on Parquet input?

How to configure PySpark job based on Parquet input? - amazon-web-services

What I have
2 datasets on HDFS in Parquet format:
1.6T (after parquet uncompressing it's going to be 2.8T), 31 columns, I assume there is no data skew and all data distributed across HDFS evenly
200G (after parquet uncompressing it will be 360G), 5 columns, no data skew and data distributed evenly
I use AWS EMR cluster to run PySpark job.
What I need to do
Since experimenting isn't cheap, I want to calculate PySpark job configuration based on the input configuration and my assumptions before I even run it on the cluster.
Here are some details. I need to join datasets by one id column to enrich first dataset (1.6T) with data (only 3 columns: string, string, struct<int, string, string>) from the second dataset (200G).
Question
How can I decide what are the number of executors, cpu cores, memory, and [disk?] I need to request for my PySpark job?
(And is there any general formula for that?)

Related

Redshift spectrum struggles with huge joins

I am trying to find out if I misconfigured something or am I hitting limits of single node redshift cluster?
I am using:
single node ra3 instance,
spectrum layer for files in s3,
files I am using are partitioned in S3 in parquet format and archived using snappy,
data I am trying to join it into is loaded into my redshift cluster (16m rows I will mention later is in my cluster),
data in external tables has numRows property according to the documentation
I am trying to perform a spatial join 16m rows into 10m rows using ST_Contains() and it just never finishes. I know that query is correct because it is able to join 16m rows with 2m rows in 6 seconds.
(query in Athena on same data completes in 2 minutes)
Case with 10m rows has been running for 60 minutes now and seems like it will just never finish. Any thoughts?

AWS Athena Query Partitioning

I am trying to use AWS Athena to provide analytics for an existing platform. Currently the flow looks like this:
Data is pumped into a Kinesis Firehose as JSON events.
The Firehose converts the data to parquet using a table in AWS Glue and writes to S3 either every 15 mins or when the stream reaches 128 MB (max supported values).
When the data is written to S3 it is partitioned with a path /year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/...
An AWS Glue crawler update a table with the latest partition data every 24 hours and makes it available for queries.
The basic flow works. However, there are a couple of problems with this...
The first (and most important) is that this data is part of a multi-tenancy application. There is a property inside each event called account_id. Every query that will ever be issued will be issued by a specific account and I don't want to be scanning all account data for every query. I need to find a scalable way query only the relevant data. I did look into trying to us Kinesis to extract the account_id and use it as a partition. However, this currently isn't supported and with > 10,000 accounts the AWS 20k partition limit quickly becomes a problem.
The second problem is file size! AWS recommend that files not be < 128 MB as this has a detrimental effect on query times as the execution engine might be spending additional time with the overhead of opening Amazon S3 files. Given the nature of the Firehose I can only ever reach a maximum size of 128 MB per file.

With that many accounts you probably don't want to use account_id as partition key for many reasons. I think you're fine limits-wise, the partition limit per table is 1M, but that doesn't mean it's a good idea.
You can decrease the amount of data scanned significantly by partitioning on parts of the account ID, though. If your account IDs are uniformly distributed (like AWS account IDs) you can partition on a prefix. If your account IDs are numeric partitioning on the first digit would decrease the amount of data each query would scan by 90%, and with two digits 99% – while still keeping the number of partitions at very reasonable levels.
Unfortunately I don't know either how to do that with Glue. I've found Glue very unhelpful in general when it comes to doing ETL. Even simple things are hard in my experience. I've had much more success using Athena's CTAS feature combined with some simple S3 operation for adding the data produced by a CTAS operation as a partition in an existing table.
If you figure out a way to extract the account ID you can also experiment with separate tables per account, you can have 100K tables in a database. It wouldn't be very different from partitions in a table, but could be faster depending on how Athena determines which partitions to query.
Don't worry too much about the 128 MB file size rule of thumb. It's absolutely true that having lots of small files is worse than having few large files – but it's also true that scanning through a lot of data to filter out just a tiny portion is very bad for performance, and cost. Athena can deliver results in a second even for queries over hundreds of files that are just a few KB in size. I would worry about making sure Athena was reading the right data first, and about ideal file sizes later.
If you tell me more about the amount of data per account and expected life time of accounts I can give more detailed suggestions on what to aim for.
Update: Given that Firehose doesn't let you change the directory structure of the input data, and that Glue is generally pretty bad, and the additional context you provided in a comment, I would do something like this:
Create an Athena table with columns for all properties in the data, and date as partition key. This is your input table, only ETL queries will be run against this table. Don't worry that the input data has separate directories for year, month, and date, you only need one partition key. It just complicates things to have these as separate partition keys, and having one means that it can be of type DATE, instead of three separate STRING columns that you have to assemble into a date every time you want to do a date calculation.
Create another Athena table with the same columns, but partitioned by account_id_prefix and either date or month. This will be the table you run queries against. account_id_prefix will be one or two characters from your account ID – you'll have to test what works best. You'll also have to decide whether to partition on date or a longer time span. Dates will make ETL easier and cheaper, but longer time spans will produce fewer and larger files, which can make queries more efficient (but possibly more expensive).
Create a Step Functions state machine that does the following (in Lambda functions):
Add new partitions to the input table. If you schedule your state machine to run once per day it can just add the partition that correspond to the current date. Use the Glue CreatePartition API call to create the partition (unfortunately this needs a lot of information to work, you can run a GetTable call to get it, though. Use for example ["2019-04-29"] as Values and "s3://some-bucket/firehose/year=2019/month=04/day=29" as StorageDescriptor.Location. This is the equivalent of running ALTER TABLE some_table ADD PARTITION (date = '2019-04-29) LOCATION 's3://some-bucket/firehose/year=2019/month=04/day=29' – but doing it through Glue is faster than running queries in Athena and more suitable for Lambda.
Start a CTAS query over the input table with a filter on the current date, partitioned by the first character(s) or the account ID and the current date. Use a location for the CTAS output that is below your query table's location. Generate a random name for the table created by the CTAS operation, this table will be dropped in a later step. Use Parquet as the format.
Look at the Poll for Job Status example state machine for inspiration on how to wait for the CTAS operation to complete.
When the CTAS operation has completed list the partitions created in the temporary table created with Glue GetPartitions and create the same partitions in the query table with BatchCreatePartitions.
Finally delete all files that belong to the partitions of the query table you deleted and drop the temporary table created by the CTAS operation.
If you decide on a partitioning on something longer than date you can still use the process above, but you also need to delete partitions in the query table and the corresponding data on S3, because each update will replace existing data (e.g. with partitioning by month, which I would recommend you try, every day you would create new files for the whole month, which means that the old files need to be removed). If you want to update your query table multiple times per day it would be the same.
This looks like a lot, and looks like what Glue Crawlers and Glue ETL does – but in my experience they don't make it this easy.
In your case the data is partitioned using Hive style partitioning, which Glue Crawlers understand, but in many cases you don't get Hive style partitions but just Y/M/D (and I didn't actually know that Firehose could deliver data this way, I thought it only did Y/M/D). A Glue Crawler will also do a lot of extra work every time it runs because it can't know where data has been added, but you know that the only partition that has been added since yesterday is the one for yesterday, so crawling is reduced to a one-step-deal.
Glue ETL is also makes things very hard, and it's an expensive service compared to Lambda and Step Functions. All you want to do is to convert your raw data form JSON to Parquet and re-partition it. As far as I know it's not possible to do that with less code than an Athena CTAS query. Even if you could make the conversion operation with Glue ETL in less code, you'd still have to write a lot of code to replace partitions in your destination table – because that's something that Glue ETL and Spark simply doesn't support.
Athena CTAS wasn't really made to do ETL, and I think the method I've outlined above is much more complex than it should be, but I'm confident that it's less complex than trying to do the same thing (i.e. continuously update and potentially replace partitions in a table based on the data in another table without rebuilding the whole table every time).
What you get with this ETL process is that your ingestion doesn't have to worry about partitioning more than by time, but you still get tables that are optimised for querying.

Loading parquet file from S3 to DynamoDB

I have been looking at options to load (basically empty and restore) Parquet file from S3 to DynamoDB. Parquet file itself is created via spark job that runs on EMR cluster. Here are few things to keep in mind,
I cannot use AWS Data pipeline
File is going to contain millions of rows (say 10 million), so would need an efficient solution. I believe boto API (even with batch write) might not be that efficient ?
Are there any other alternatives ?

Can you just refer to the Parquet files in a Spark RDD and have the workers put the entries to dynamoDB? Ignoring the challenge of caching the DynamoDB client in each worker for reuse in different rows, it some bit of scala to take a row, build an entry for dynamo and PUT that should be enough.
BTW: Use DynamoDB on demand here, as it handles peak loads well without you having to commit to some SLA.

Look at the answer below:
https://stackoverflow.com/a/59519234/4253760
To explain the process:
Create desired dataframe
Use .withColumn to create new column and use psf.collect_list to convert to desired collection/json format, in the new column in the
same dataframe.
Drop all un-necessary (tabular) columns and keep only the JSON format Dataframe columns in Spark.
Load the JSON data into DynamoDB as explained in the answer.
My personal suggestion: whatever you do, do NOT use RDD. RDD interface even in Scala is 2-3 times slower than Dataframe API of any language.
Dataframe API's performance is programming language agnostic, as long as you dont use UDF.

Hive query on s3 partition is too slow

I have partitioned the data by date and here is how it is stored in s3.
s3://dataset/date=2018-04-01
s3://dataset/date=2018-04-02
s3://dataset/date=2018-04-03
s3://dataset/date=2018-04-04
...
Created hive external table on top of this. I am executing this query,
select count(*) from dataset where `date` ='2018-04-02'
This partition has two parquet files like this,
part1 -xxxx- .snappy.parquet
part2 -xxxx- .snappy.parquet
each file size is 297MB. , So not a big file and not many files to scan.
And the query is returning 12201724 records. However it takes 3.5 mins to return this, since one partition itself is taking this time, running even the count query on whole dataset ( 7 years ) of data takes hours to return the results. Is there anyway, I can speed up this ?

Amazon Athena is, effectively, a managed Presto service. It can query data stored in Amazon S3 without having to run any clusters.
It is charged based upon the amount of data read from disk, so it runs very efficiently when using partitions and parquet files.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog

Pivoting 1,620 columns to rows in 360gb text file in aws

I have a pipe delimited text file that is 360GB, compressed (gzip).
It has over 1,620 columns. I can't show the exact field names, but here's basically what it is:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1
Seriously, there are over 800 of these property name/value fields.
There are roughly 280 million rows.
The file is in an S3 bucket.
I need to get the data into Redshift, but the column limit in Redshift is 1,600.
The users want me to pivot the data. For example:
primary_key|key|value
12345|is_male|1
12345|is_college_educated|1
What is a good way to pivot the file in the aws environment? The data is in a single file, but I'm planning on splitting the data into many different files to allow for parallel processing.
I've considered using Athena. I couldn't find anything that states the maximum number of columns allowed by Athena. But, I found a page about Presto (on which Athena is based) that says “there is no exact hard limit, but we've seen stuff break with more than few thousand.” (https://groups.google.com/forum/#!topic/presto-users/7tv8l6MsbzI).
Thanks.

First, pivot your data, then load to Redshift.
In more detail, the steps are:
Run a spark job (using EMR or possibly AWS Glue) which reads in your
source S3 data and writes out (to a different s3 folder) a pivoted
version. by this i mean if you have 800 value pairs, then you would
write out 800 rows. At the same time, you can split the file into multiple parts to enable parallel load.
"COPY" this pivoted data into Redshift

What I learnt from most of the time from AWS is, if you are reaching a limit, you are doing it in a wrong way or not in a scalable way. Most of the time architects designed with scalability, performance in mind.
We had similar problems, having 2000 columns. Here is how we solved it.
Split the file across 20 different tables, 100+1 (primary key) column each.
Do a select across all those tables in a single query to return all the data you want.
If you say you want to see all the 1600 columns in a select, then the business user is looking at wrong columns for their analysis or even for machine learning.
To load 10TB+ of data we had split the data into multiple files and load them in parallel, that way loading was faster.
Between Athena and Redshift, performance is the only difference. Rest of them are same. Redshift performs better than Athena. Initial Load time and Scan Time is higher than Redshift.
Hope it helps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js