AWS EMR | Total numbers of Mappers when pointing to AWS S3 - amazon-web-services

I am bit curious to know how EMR cluster will decide total number of mappers, if we are triggering Hive workloads pointing to S3 location. In S3 data is not stored in form of blocks, so which component will create Input splits and assigns mapper to it?

There are two ways to find the number of mappers needed to process your input data files:
The number of mappers depends on the number of Hadoop splits. If your files are smaller than HDFS or Amazon S3 split size, the number of mappers is equal to the number of files. If some or all of your files are larger than HDFS or Amazon S3 split size (fs.s3.block.size) the number of mappers is equal to the sum of each file divided by the HDFS/Amazon S3 block size.
The examples below assume 64 MB of block size (S3 or HDFS).
Example 1: You have 100 files of 60 MB each on HDFS = 100 mappers. Since each file is less than the block size, the number of mappers equals the number of files.
Example 2: You have 100 files of 80 MB each on Amazon S3 = 200 mappers. Each data file is larger than our block size, which means each file requires two mappers to process the file.
100 files * 2 mappers each = 200 mappers
Example 3: You have two 60 MB, one 120 MB, and two 10 MB files = 6 mappers. The 60 MB files require two mappers, 120 MB file requires two mappers, and two 10 MB files require a single mapper each.
An easy way to estimate the number of mappers needed is to run your job on any Amazon EMR cluster and note the number of mappers calculated by Hadoop for your job. You can see this total by looking at JobTracker GUI or at the output of your job. Here is a sample of job output with the number of mappers highlighted:
13/01/13 01:12:30 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/01/13 01:12:30 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms) =0 13/01/13 01:12:30 INFO mapred.JobClient: Rack-local map tasks=20 13/01/13 01:12:30 INFO mapred.JobClient:
Launched map tasks=20
13/01/13 01:12:30 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=2329458
Reference: Amazon EMR Best Practices

Related

Glue Dynamic Frame is way slower than regular Spark

In the image below we have the same glue job run with three different configurations in terms of how we write to S3:
We used a dynamic frame to write to S3
We used a pure spark frame to write to S3
Same as 1 but reducing the number of worker nodes from 80 to 60
All things equal, the dynamic frame took 75 minutes to do the job, regular Spark took 10 minutes. The output were 100 GB of data.
The dynamic frame is super sensitive to the number of worker nodes, failing due to memory issues after 2 hours of processing when slightly reducing the number of worker nodes. This is surpraising as we would expect Glue, being an AWS service, to handle better the S3 writing operations.
The code difference was this:
if dynamic:
df_final_dyn = DynamicFrame.fromDF(df_final, glueContext, "df_final")
glueContext.write_dynamic_frame.from_options(
frame=df_final_dyn, connection_type="s3", format="glueparquet", transformation_ctx="DataSink0",
connection_options={"path": "s3://...",
"partitionKeys": ["year", "month", "day"]})
else:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
df_final.write.mode("overwrite").format("parquet").partitionBy("year", "month", "day")\
.save("s3://.../")
Why such an inefficiency?

Best AWS Instance for Partitioning Big Data

the problem that I am having right now is trying to find the best AWS instance for partitioning large data (scaling to greater than 1TB).
The data that I am receiving is structured data, and am hoping to partition it by either /year/month/day/ or /year/month/day/hour of the created at time. So far I have tried using EMR with the following configurations to partition 260GB of parquet data in /year/month/day (spark.dynamicAllocation.enabled == true):
3 r5.2xlarge (8 vCPU, 64GB) --> > 1 hour to just write to HDFS
2 c5.4xlarge (16 vCPU, 32GB) --> >> 1 hour to just write to HDFS (was 28% slower than the 3 r5.2xlarge)
2 r5d.4xlarge (16 vCPU, 128GB) --> 54 minutes to just write to HDFS (note, HDFS is on NVMe SSD)
This is a graph of what the 3 r5.2xlarge is producing:
This is a graph of what the 2 c5.4xlarge is producing (note, the two peaks are due to running the job twice):
This is a graph of what the 2 r5d.4xlarge is producing:
Is it possible for me to reach ~10 minutes? If so, would that mean adding more nodes or a different instance type?

Slow performance with pyspark on Hadoop

I am running parser jobs to parse json files and load data from them to HIVE tables, I am using python (pySpark) to create first create DataFrames, collect data from json files, to load one bulk load data to HIVE tables.
There is no issue when I am processing with (300 to 500) json files which is approx. loading 2 to 4 millions records to HIVE tables with process time of approx. 24 mins to 34 mins.
When we increate number to 1000 json files it start taking 3 hours to process files and load data (~9 millions ) to HIVE tables, as we increase json files number system slow down dramatically up to 22 hours for (8000 to 9000) json files, may be 84 millions in volume ...but job fails with error of
: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 60561 for dhpxxxx) can't be found in cache
Here are the different parameter values during execution.
--deploy-mode client
--driver-memory 50g
--conf spark.driver.maxResultSize=12g
--executor-cores 4
--executor-memory 25g
--num-executors 100
This is how I am submitting my code.
/xxx/xxx/current_loaction/spark2-client/bin/spark-submit --master yarn --deploy-mode client --driver-memory 50g --conf spark.driver.maxResultSize=12g --files /xxx/xxx/current_location/spark2-client/yyyy/hive-site.xml --executor-cores 4 --executor-memory 25g --num-executors 100 process_multi_files.py
Is there a way to increase the performance of current parameters, remined that other users are also running there jobs on Hadoop cluster.
Total number of active nodes are 27, Memory Total is ~ 4.50TB
Since the performance seems directly correlated to the number of files. I would try to reduce the number of json files by introducing a pre-processing step to merge json files before I read them to be transformed into hive tables. Compressing the files to gz might also help. (How to read gz compressed file by pyspark)

Spark - filter v. large data (400 GB) in small cluster (16) too long to save to s3

I am doing a very simple job in a very large scale.
I have 480 GB in JSON files in an S3 bucket.
val events = spark.read.text("s3a://input/")
val filteredEvents = events filter { _.contains("...") }
filteredEvents.saveAsTextFile("s3a://output/")
After doing a lot of work for ~5 minutes, there is a last task that takes forever. I can see a lot of partial files on the S3 bucket but the job is not finished yet; there is a temporary folder and no success message. I waited for ~20 minutes and no change. Just this one last task that shows a huge scheduler delay.
I suppose this might be the workers sending data back to the scheduler; can't each worker write directly to S3?
My cluster has 16 m3.2xlarge nodes.
Am I trying a job to big with a small cluster?

HDFS replication and data distribution

I have a Hadoop cluster with 4 DataNodes. I am confused between the two issues : data replication and data distribution.
Suppose that I have a 2 GB file and my replication factor is 2 & block size is 128 MB. When I put this file into hdfs, I see that 2 copies of each 128 MB blocks are created and they are placed in datanode3 and datanode4. But datanode2 & datanode1 are not used. The data is replicated because of the replication factor but I expect to see some data blocks in datanode1 and datanode2. Is something wrong?
Let's say that I have 20 DataNodes and replication factor is 2. If I put a file (2 GB) on HDFS, I again expect to see two copies of each 128 MB but also expect to see these 128 MB blocks are distributed between 20 DataNodes.
Ideally, the 2GB file should get distributed among all the available DataNodes.
File Size: 2GB = 2048MB
Block Size: 128MB
Replication Factor: 2
With above configuration you should have: 2048 / 128 * 2 blocks i.e. 32 blocks. And these blocks should get distributed almost equally between all DataNodes. Considering you have 4 DataNodes, each of them should have around 8 blocks.
The reason I could think of for not having above situation is if the DataNodes are down. Check if all the DataNodes are up: sudo -u hdfs hdfs dfsadmin -report