Slow performance with pyspark on Hadoop

Slow performance with pyspark on Hadoop - python-2.7

I am running parser jobs to parse json files and load data from them to HIVE tables, I am using python (pySpark) to create first create DataFrames, collect data from json files, to load one bulk load data to HIVE tables.
There is no issue when I am processing with (300 to 500) json files which is approx. loading 2 to 4 millions records to HIVE tables with process time of approx. 24 mins to 34 mins.
When we increate number to 1000 json files it start taking 3 hours to process files and load data (~9 millions ) to HIVE tables, as we increase json files number system slow down dramatically up to 22 hours for (8000 to 9000) json files, may be 84 millions in volume ...but job fails with error of
: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 60561 for dhpxxxx) can't be found in cache
Here are the different parameter values during execution.
--deploy-mode client
--driver-memory 50g
--conf spark.driver.maxResultSize=12g
--executor-cores 4
--executor-memory 25g
--num-executors 100
This is how I am submitting my code.
/xxx/xxx/current_loaction/spark2-client/bin/spark-submit --master yarn --deploy-mode client --driver-memory 50g --conf spark.driver.maxResultSize=12g --files /xxx/xxx/current_location/spark2-client/yyyy/hive-site.xml --executor-cores 4 --executor-memory 25g --num-executors 100 process_multi_files.py
Is there a way to increase the performance of current parameters, remined that other users are also running there jobs on Hadoop cluster.
Total number of active nodes are 27, Memory Total is ~ 4.50TB

Since the performance seems directly correlated to the number of files. I would try to reduce the number of json files by introducing a pre-processing step to merge json files before I read them to be transformed into hive tables. Compressing the files to gz might also help. (How to read gz compressed file by pyspark)

Related

Data ingestion configuration for spark in aws

I am working on batch files and which we receive 1GB csv input file at a time to EMR. What is the ideal configuration for
Master and Core for 1 GB data and how do you arrive at that conclusion and is there a standard procedure? I am using the below I want to downgrade the instances 1 core instance. My concern is if more data comes in how can I upgrade my configuration?
1 instance - Master- 4 VCore, 16GiB memory, EBS-64GB
2 instances - Core- 4 VCore, 16GiB memory, EBS-64GB
The ingestion code has simple transformation and converting to parquet.

Glue Dynamic Frame is way slower than regular Spark

In the image below we have the same glue job run with three different configurations in terms of how we write to S3:
We used a dynamic frame to write to S3
We used a pure spark frame to write to S3
Same as 1 but reducing the number of worker nodes from 80 to 60
All things equal, the dynamic frame took 75 minutes to do the job, regular Spark took 10 minutes. The output were 100 GB of data.
The dynamic frame is super sensitive to the number of worker nodes, failing due to memory issues after 2 hours of processing when slightly reducing the number of worker nodes. This is surpraising as we would expect Glue, being an AWS service, to handle better the S3 writing operations.
The code difference was this:
if dynamic:
df_final_dyn = DynamicFrame.fromDF(df_final, glueContext, "df_final")
glueContext.write_dynamic_frame.from_options(
frame=df_final_dyn, connection_type="s3", format="glueparquet", transformation_ctx="DataSink0",
connection_options={"path": "s3://...",
"partitionKeys": ["year", "month", "day"]})
else:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
df_final.write.mode("overwrite").format("parquet").partitionBy("year", "month", "day")\
.save("s3://.../")
Why such an inefficiency?

How does AWS Athena manage to load 10GB/s from s3? I've managed 230 mb/s from a c6gn.16xlarge

When running this query on AWS Athena, it manages to query a 63GB Traders.csv file
SELECT * FROM Trades WHERE TraderID = 1234567
Tt takes 6.81 seconds, scanning 63.82GB in so doing (almost exactly the size of the Trades.csv file, so is doing a full table scan).
What I'm shocked at is the unbelievable speed of data drawn from s3. It seems like AWS Athena's strategy is to use an unbelievably massive box with a ton of RAM and incredible s3 loading ability to get around the lack of indexing (although on a standard SQL DB you would have an index on TraderID and load millions times less data).
But in my experiments I only managed to get these data reads from S3 (which are still impressive):
InstanceType
Mb/s
Network Card Gigabits
t2.2xlarge
113
low
t3.2xlarge
140
up to 5
c5n.2xlarge
160
up to 25
c6gn.16xlarge
230
100
(that's megabytes rather than megabits)
I'm using an internal VPC Endpoint for the s3 on eu-west-1. Anyone got any tricks/tips for getting s3 to load fast? Has anyone got over 1GB/s read speeds from s3? Is this even possible?

It seems like AWS Athena's strategy is to use an unbelievably massive
box with a ton of RAM
No, it's more like many small boxes, not a single massive box. Athena is running your query in parallel, on multiple servers at once. The exact details of that are not published anywhere as far as I am aware, but they make very clear in the documentation that your queries run in parallel.

AWS EMR | Total numbers of Mappers when pointing to AWS S3

I am bit curious to know how EMR cluster will decide total number of mappers, if we are triggering Hive workloads pointing to S3 location. In S3 data is not stored in form of blocks, so which component will create Input splits and assigns mapper to it?

There are two ways to find the number of mappers needed to process your input data files:
The number of mappers depends on the number of Hadoop splits. If your files are smaller than HDFS or Amazon S3 split size, the number of mappers is equal to the number of files. If some or all of your files are larger than HDFS or Amazon S3 split size (fs.s3.block.size) the number of mappers is equal to the sum of each file divided by the HDFS/Amazon S3 block size.
The examples below assume 64 MB of block size (S3 or HDFS).
Example 1: You have 100 files of 60 MB each on HDFS = 100 mappers. Since each file is less than the block size, the number of mappers equals the number of files.
Example 2: You have 100 files of 80 MB each on Amazon S3 = 200 mappers. Each data file is larger than our block size, which means each file requires two mappers to process the file.
100 files * 2 mappers each = 200 mappers
Example 3: You have two 60 MB, one 120 MB, and two 10 MB files = 6 mappers. The 60 MB files require two mappers, 120 MB file requires two mappers, and two 10 MB files require a single mapper each.
An easy way to estimate the number of mappers needed is to run your job on any Amazon EMR cluster and note the number of mappers calculated by Hadoop for your job. You can see this total by looking at JobTracker GUI or at the output of your job. Here is a sample of job output with the number of mappers highlighted:
13/01/13 01:12:30 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/01/13 01:12:30 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms) =0 13/01/13 01:12:30 INFO mapred.JobClient: Rack-local map tasks=20 13/01/13 01:12:30 INFO mapred.JobClient:
Launched map tasks=20
13/01/13 01:12:30 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=2329458
Reference: Amazon EMR Best Practices

Best AWS Instance for Partitioning Big Data

the problem that I am having right now is trying to find the best AWS instance for partitioning large data (scaling to greater than 1TB).
The data that I am receiving is structured data, and am hoping to partition it by either /year/month/day/ or /year/month/day/hour of the created at time. So far I have tried using EMR with the following configurations to partition 260GB of parquet data in /year/month/day (spark.dynamicAllocation.enabled == true):
3 r5.2xlarge (8 vCPU, 64GB) --> > 1 hour to just write to HDFS
2 c5.4xlarge (16 vCPU, 32GB) --> >> 1 hour to just write to HDFS (was 28% slower than the 3 r5.2xlarge)
2 r5d.4xlarge (16 vCPU, 128GB) --> 54 minutes to just write to HDFS (note, HDFS is on NVMe SSD)
This is a graph of what the 3 r5.2xlarge is producing:
This is a graph of what the 2 c5.4xlarge is producing (note, the two peaks are due to running the job twice):
This is a graph of what the 2 r5d.4xlarge is producing:
Is it possible for me to reach ~10 minutes? If so, would that mean adding more nodes or a different instance type?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js