Performance issue with AWS EMR S3DistCp - amazon-web-services

I am using S3DistCp on an EMR cluster in order to aggregate around 200K small files (for a total of 3.4GB) from a S3 bucket to another path in the same bucket. It is working but it is extremely slow (around 600MB transferred after more than 20 minutes).
Here is my EMR configuration:
1 master m5.xlarge
3 cores m5.xlarge
release label 5.29.0
The command:
s3-dist-cp --s3Endpoint=s3-eu-central-1.amazonaws.com --src=s3://my-bucket/input/ --dest=s3://my-bucket/output/ --groupBy=.*input/(entry).*(.json.gz) --targetSize=128
Am I missing something ? I have read that S3DistCp can transfer a lot of files in a blink but I can't figure how. Both EMR and bucket are in the same region by the way.
Thank you.

Here are the recomendations
use R type instance. It will provide more memory compared to M type instances
use coalesce to merge the files in source as you have many small files
Check the number of mapper tasks. The more the task, the lesser the performance

Related

S3 write concurrency using AWS Glue

I have a suspicion we are hitting an S3 write concurrency issue with an AWS Glue job. I am testing 10 DPUs writing 10k objects, 1 MB each (~10 GB total) and it is taking 2+ hours for just the write stage of the job. It seems like across 10 DPUs I should be able to distribute good enough to get much better throughput. I am hitting several different bucket prefixes and do not think I'm getting throttled by S3 or anything.
I see that my job is using EMRFS (the default S3 FileSystem API implementation for Glue), so that is good for write throughput from my understanding. I found some suggestions that say to adjust fs.s3.maxConnections, hive.mv.files.threads and set hive.blobstore.use.blobstore.as.scratchdir = false.
Where can I see what the current settings for these are in my Glue jobs and how can I configure them? While I see many settings and configurations in the Spark UI logs I can generate from my jobs, I'm not finding these settings.
How can I see what actual S3 write concurrency I'm getting in each worker in the job? Is this something I can see in the Spark UI logs or is there another metric somewhere that would show this?

How to speed up download of millions of files from AWS S3

I've been trying to download these files all summer from the IRS AWS bucket, but it is so excruciatingly slow. Despite having a decent internet connection, the files start downloading at about 60 kbps and get progressively slower over time. That being said, there are literally millions of files, but each file is very small approx 10-50 kbs.
The code I use to download the bucket is:
aws s3 sync s3://irs-form-990/ ./ --exclude "*" --include "2018*" --include "2019*
Is there a better way to do this?
Here is also a link to the bucket itself.
My first attempt would be to provision an instance in us-east-1 with io type EBS volume of required size. From what I see there is about 14GB of data from 2018 and 15 GB from 2019. Thus an instance with 40-50 GB should be enough. Or as pointed out in the comments, you can have two instances, one for 2018 files, and the second for 2019 files. This way you can download the two sets in parallel.
Then you attach an IAM role to the instance which allows S3 access. With this, you execute your AWS S3 sync command on the instance. The traffic between S3 and your instance should be much faster then to your local workstation.
Once you have all the files, you zip them and then download the zip file. Zip should help a lot as the IRS files are txt-based XMLs. Alternatively, maybe you could just process the files on the instance itself, without the need to download them to your local workstation.
General recommendation on speeding up transfer between S3 and instances are listed in the AWS blog:
How can I improve the transfer speeds for copying data between my S3 bucket and EC2 instance?

Spark and continuous processing of data

I am new to Spark but I am reading up as much as I can. I have a small project where multiple data files (in gzip) are going to continuously land in an S3 bucket every hour. I need to be able to open/read these gzip files and consolidate/aggregate data across them. So, I need to look at them in a holistic fashion. How, what techniques and tools from Amazon AWS can be used? Do I create interim files in a S3 folder or hold Dataframes in memory or use some database and blow away the data after each hour? So, I am looking for ideas more than a piece of code.
So far, in AWS, I have written a pyspark script that reads 1 file at a time and create an output file back in output S3 folder. But that leaves me with multiple output files for each hour. Would be nice if there was 1 file for a given hour.
From technology perspective, I am using an EMR cluster with just 1 master and 1 core node, Pyspark and S3.
Thanks
You could use an AWS Glue ETL job written in PySpark. Glue jobs can be scheduled to run every hour.
I suggest reading the entire dataset, performing your operations, and then moving the data to another long-term storage location.
If you are working on a few GB of data, a PySpark job should complete within minutes. There's no need to keep an EMR cluster running for an hour if you'll only need it for 10 minutes. Consider using short-lived EMR clusters or a Glue ETL job.
Athena supports querying GZipped data. If you're performing some sort of analysis, maybe executing an Athena query with a time range will work?
You could also use a CTAS (Create Table As Select) statement in Athena to copy data to a new location, and performing basic ETL on it at the same time.
What exactly does your PySpark code do?

AWS Glue pricing against AWS EMR

I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue.
I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1 million above free tier and is calculated at $1 for the 1 million additional requests.
On EMR I have considered m3.xlarge for both EC2 & EMR (pricing at $0.266 & $0.070 respectively) with 6 nodes, running for 10 minutes for 30 days.
On calculating for a month, I see that AWS Glue works out to be around $14.64, whereas for EMR it works out to be around $10.08. I have not taken into account other additional expenses such as S3, RDS, Redshift, etc. & DEV Endpoint which is optional, since my objective is to compare ETL job price benefits
Looks like EMR is cheaper when compared to AWS Glue. Is the EMR pricing correct, can someone please suggest if anything missing? I have tried the AWS price calculator for EMR, but confused, and not clear if normalized hours are billed into it.
Regards
Yuva
Yes, EMR does work out to be cheaper than Glue, and this is because Glue is meant to be serverless and fully managed by AWS, so the user doesn't have to worry about the infrastructure running behind the scenes, but EMR requires a whole lot of configuration to set up. So it's a trade off between user friendliness and cost, and for more technical users EMR can be the better option.
#user2889316 - Did you check my question wherein I had provided a comparison numbers?
Also please note Glue is roughly about 0.44 per hour / DPU for a job. I don't think you will have any AWS Glue JOB that is expected to running throughout the day? Are you talking about the Glue Dev end point or the Job?
A AWS Glue job requires a minimum of 2 DPUs to run, which means 0.88 per hour, which I think roughly about $21 per day? This is only for the GLUE job and there are additional charges such as S3, and any database / connection charges / crawler charges, etc.
Corresponding instance for EMR is m3.xlarge & its charges are (pricing at $0.266 & $0.070 respectively). This would be approximately less than $16 for 2 instance per day? plus other S3, database charges, etc. Am considering 2 EMR instances against the default DPUs for AWS Glue job.
Hope this would give you an idea.
Thanks
If your infrastructure doesn't need drastic scaling (and is mostly with fixed configuration), use EMR. But if it is needed, Glue is better choice as it is serverless. By just changing DPUs, your infrastructure is scaled. However in EMR, you have to decide on cluster type, number of nodes, auto-scaling rules. For each change, you will need to change cluster creation script, test it, deploy it - basically add overhead of standard release cycle for change. With change in infra config, you may want to change spark config to optimize jobs accordingly. So time to make new version release is higher with change in infra configuration. If you add high configuration to start, it will cost more. If you add low configuration to start, you need frequent changes in script.
Having said that, AWS Glue has fixed infra configuration for each DPU - e.g. 16GB memory per core. If your ETL demands more memory per core, you may have to shift to EMR. However, if your ETL is designed such a way that it will not exceed 11GB driver memory with 1 executor or 5.5GB with 2 executors (e.g. Take additional data volume in parallel on new core or divide volume in 5gb/11gb batch and run in for loop on same core), Glue is right choice.
If your ETL is complex and all jobs are going to keep cluster busy throughout day, I would recommend to go with EMR with dedicated devops team to manage EMR infra.
If you use Spot instance of EMR instead of On-Demand it will cost 1/3rd of on-Demand price and will turn out to be much cheaper. AWS Glue doesn't have that pricing benefits.

What controls the number of partitions when reading Parquet files?

My setup:
Two Spark clusters. One on EC2 and one on Amazon EMR. Both with Spark 1.3.1.
The EMR cluster was installed with emr-bootstrap-actions. The EC2 cluster was installed with Spark's default EC2 scripts.
The code:
Read a folder containing 12 Parquet files and count the number of partitions
val logs = sqlContext.parquetFile(“s3n://mylogs/”)
logs.rdd.partitions.length
Observations:
On EC2 this code gives me 12 partitions (one per file, makes sense).
On EMR this code gives me 138 (!) partitions.
Question:
What controls the number of partitions when reading Parquet files?
I read the exact same folder on S3, with the exact same Spark release. This leads me to believe that there might be some configuration settings which control how partitioning happens. Does anyone have more info on this?
Insights would be greatly appreciated.
Thanks.
UPDATE:
It seems that the many partitions are created by EMR's S3 file system implementation (com.amazon.ws.emr.hadoop.fs.EmrFileSystem).
When removing
<property><name>fs.s3n.impl</name><value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value></property>
from core-site.xml (hereby reverting to Hadoop's S3 filesystem), I end up with 12 partitions.
When running with EmrFileSystem, it seems that the number of partitions can be controlled with:
<property><name>fs.s3n.block.size</name><value>xxx</value></property>
Could there be a cleaner way of controlling the # of partitions when using EmrFileSystem?