Best AWS Instance for Partitioning Big Data - amazon-web-services

the problem that I am having right now is trying to find the best AWS instance for partitioning large data (scaling to greater than 1TB).
The data that I am receiving is structured data, and am hoping to partition it by either /year/month/day/ or /year/month/day/hour of the created at time. So far I have tried using EMR with the following configurations to partition 260GB of parquet data in /year/month/day (spark.dynamicAllocation.enabled == true):
3 r5.2xlarge (8 vCPU, 64GB) --> > 1 hour to just write to HDFS
2 c5.4xlarge (16 vCPU, 32GB) --> >> 1 hour to just write to HDFS (was 28% slower than the 3 r5.2xlarge)
2 r5d.4xlarge (16 vCPU, 128GB) --> 54 minutes to just write to HDFS (note, HDFS is on NVMe SSD)
This is a graph of what the 3 r5.2xlarge is producing:
This is a graph of what the 2 c5.4xlarge is producing (note, the two peaks are due to running the job twice):
This is a graph of what the 2 r5d.4xlarge is producing:
Is it possible for me to reach ~10 minutes? If so, would that mean adding more nodes or a different instance type?

Related

Data ingestion configuration for spark in aws

I am working on batch files and which we receive 1GB csv input file at a time to EMR. What is the ideal configuration for
Master and Core for 1 GB data and how do you arrive at that conclusion and is there a standard procedure? I am using the below I want to downgrade the instances 1 core instance. My concern is if more data comes in how can I upgrade my configuration?
1 instance - Master- 4 VCore, 16GiB memory, EBS-64GB
2 instances - Core- 4 VCore, 16GiB memory, EBS-64GB
The ingestion code has simple transformation and converting to parquet.

AWS Glue Crawler - how to get proper data types

I have a Glue Crawler that reads data from S3 and auto-assign data types.
Data type for a first column should be Number (Integer) but it is showing as Decimal (38,10).
Here is how Athena is showing data for first 10 records (after Glue Crawler ran):
1 3568813.0000000000
2 3568814.0000000000
3 3570809.0000000000
4 3570810.0000000000
5 3573970.0000000000
6 3573971.0000000000
7 3573972.0000000000
8 3573973.0000000000
9 3573974.0000000000
10 3573975.0000000000
(ideally - we should see clean numbers without trailing zeros..)
Is there a way to 'force' a proper Data Type inferring with Glue Crawler? If not - how can this be fixed after the crawling operation?
Thanks.

AWS EMR | Total numbers of Mappers when pointing to AWS S3

I am bit curious to know how EMR cluster will decide total number of mappers, if we are triggering Hive workloads pointing to S3 location. In S3 data is not stored in form of blocks, so which component will create Input splits and assigns mapper to it?
There are two ways to find the number of mappers needed to process your input data files:
The number of mappers depends on the number of Hadoop splits. If your files are smaller than HDFS or Amazon S3 split size, the number of mappers is equal to the number of files. If some or all of your files are larger than HDFS or Amazon S3 split size (fs.s3.block.size) the number of mappers is equal to the sum of each file divided by the HDFS/Amazon S3 block size.
The examples below assume 64 MB of block size (S3 or HDFS).
Example 1: You have 100 files of 60 MB each on HDFS = 100 mappers. Since each file is less than the block size, the number of mappers equals the number of files.
Example 2: You have 100 files of 80 MB each on Amazon S3 = 200 mappers. Each data file is larger than our block size, which means each file requires two mappers to process the file.
100 files * 2 mappers each = 200 mappers
Example 3: You have two 60 MB, one 120 MB, and two 10 MB files = 6 mappers. The 60 MB files require two mappers, 120 MB file requires two mappers, and two 10 MB files require a single mapper each.
An easy way to estimate the number of mappers needed is to run your job on any Amazon EMR cluster and note the number of mappers calculated by Hadoop for your job. You can see this total by looking at JobTracker GUI or at the output of your job. Here is a sample of job output with the number of mappers highlighted:
13/01/13 01:12:30 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/01/13 01:12:30 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms) =0 13/01/13 01:12:30 INFO mapred.JobClient: Rack-local map tasks=20 13/01/13 01:12:30 INFO mapred.JobClient:
Launched map tasks=20
13/01/13 01:12:30 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=2329458
Reference: Amazon EMR Best Practices

Slow performance with pyspark on Hadoop

I am running parser jobs to parse json files and load data from them to HIVE tables, I am using python (pySpark) to create first create DataFrames, collect data from json files, to load one bulk load data to HIVE tables.
There is no issue when I am processing with (300 to 500) json files which is approx. loading 2 to 4 millions records to HIVE tables with process time of approx. 24 mins to 34 mins.
When we increate number to 1000 json files it start taking 3 hours to process files and load data (~9 millions ) to HIVE tables, as we increase json files number system slow down dramatically up to 22 hours for (8000 to 9000) json files, may be 84 millions in volume ...but job fails with error of
: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 60561 for dhpxxxx) can't be found in cache
Here are the different parameter values during execution.
--deploy-mode client
--driver-memory 50g
--conf spark.driver.maxResultSize=12g
--executor-cores 4
--executor-memory 25g
--num-executors 100
This is how I am submitting my code.
/xxx/xxx/current_loaction/spark2-client/bin/spark-submit --master yarn --deploy-mode client --driver-memory 50g --conf spark.driver.maxResultSize=12g --files /xxx/xxx/current_location/spark2-client/yyyy/hive-site.xml --executor-cores 4 --executor-memory 25g --num-executors 100 process_multi_files.py
Is there a way to increase the performance of current parameters, remined that other users are also running there jobs on Hadoop cluster.
Total number of active nodes are 27, Memory Total is ~ 4.50TB
Since the performance seems directly correlated to the number of files. I would try to reduce the number of json files by introducing a pre-processing step to merge json files before I read them to be transformed into hive tables. Compressing the files to gz might also help. (How to read gz compressed file by pyspark)

HDFS replication and data distribution

I have a Hadoop cluster with 4 DataNodes. I am confused between the two issues : data replication and data distribution.
Suppose that I have a 2 GB file and my replication factor is 2 & block size is 128 MB. When I put this file into hdfs, I see that 2 copies of each 128 MB blocks are created and they are placed in datanode3 and datanode4. But datanode2 & datanode1 are not used. The data is replicated because of the replication factor but I expect to see some data blocks in datanode1 and datanode2. Is something wrong?
Let's say that I have 20 DataNodes and replication factor is 2. If I put a file (2 GB) on HDFS, I again expect to see two copies of each 128 MB but also expect to see these 128 MB blocks are distributed between 20 DataNodes.
Ideally, the 2GB file should get distributed among all the available DataNodes.
File Size: 2GB = 2048MB
Block Size: 128MB
Replication Factor: 2
With above configuration you should have: 2048 / 128 * 2 blocks i.e. 32 blocks. And these blocks should get distributed almost equally between all DataNodes. Considering you have 4 DataNodes, each of them should have around 8 blocks.
The reason I could think of for not having above situation is if the DataNodes are down. Check if all the DataNodes are up: sudo -u hdfs hdfs dfsadmin -report