distcp causing skewness in HDFS - hdfs

I have a folder(around 2 TB in size) in HDFS, which was created using save method from Apache Spark. It is almost evenly distributed across nodes (I checked this using hdfs fsck).
When I try to distcp this folder (intra-cluster), and run hdfs fsck on the destination folder, it turns out to be highly skewed, that is, few nodes have a lot of blocks whereas few nodes have very less blocks stored on them. This skewness on HDFS is causing performance issues.
We tried moving the data using mv from source to destination (intra-cluster), and this time the skewness in the destination was fine, that is, the data was evenly distributed.
Is there any way to reduce the skewness in HDFS when using distcp?

The number of mappers in the distcp were equal to the number of nodes which were heavily loaded.
So I increased the number of mappers in distcp using the -m option to the number of machines present in the cluster, and the output was much lesser skewed.
An added benefit: the distcp job completed much quicker than what it used to take earlier.

Related

what is best way to copy 100GB data between two AWS volumes?

I have two volumes attached to the same instance and it is taking 5 hours to transfer 100GB from one to the other using linux mv.
The c5.large instance supposedly uses enhanced network architecture and has a network speed of .74 Gigabits/s = .0925 Gigabytes per second. So I was expecting .74/8*60*60=333GB per hour. I am 15 times slower.
Where did I go wrong? Is there a better solution?
I use c.large instances and the speed is up to .74 Gigabits/s in practice e.g. downloading from S3 buckets, is about .45MBits/s which is more than an order of magnitude less than that nominal value (for a c4.xlarge node)
I suggest you chop your data into 1GB packages and use the following script to download them onto the attached storage option of your choice.
for i in {part001..part100}
do
echo " $i Download"
fnam=$i.csv.bz2
wget -O /tmp/data/$fnam http://address/to/the/data/$fnam
echo "$(date) $i Unzip"
bunzip2 /tmp/data/$fnam
done

Difference in default partitioning by instance type

My understanding was that spark will choose the 'default' number of partitions, solely based on the size of the file or if its a union of many parquet files, the number of parts.
However, in reading in a set of large parquet files, I see the that default # of partitions for an EMR cluster with a single d2.2xlarge is ~1200. However, in a cluster of 2 r3.8xlarge I'm getting default partitions of ~4700.
What metrics does Spark use to determine the default partitions?
EMR 5.5.0
spark.default.parallelism - Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
2X number of CPU cores available to YARN containers.
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#spark-defaults
Looks like it matches non EMR/AWS Spark as well
I think there was some transient issue because I restarted that EMR cluster with d2.2xlarge and it gave me the number of partitions I expected, which matched the r3.8xlarge, which was the number of files on s3.
If anyone knows why this kind of things happens though, I'll gladly mark yours as the answer.

How to use S3 and EBS in tandem for cost effective analytics on AWS?

I receive very large (5TB) .csv files from my clients on S3 buckets. I have to process these files, add columns to them and store them back.
I might need to work with the files in the same way as I increase the number of features for future improved models.
Clearly because S3 stores data as objects, every time I make a change, I have to read and write 5TB of data.
What is the best approach I can take to process these data cost effectively and promptly:
Store a 5TB file on S3 as object, every time read the object, do
the processing and save the result back to S3
Store the 5TB on S3 as object, read the object, chunk it to smaller objects and save them back to S3 as multiple objects so in future just work with the chunks I am interested in
Save every thing on EBS from start, mount it to the EC2 and do the processing
Thank you
First, a warning -- the maximum size of an object in Amazon S3 is 5TB. If you are going to add information that results in a larger object, then you will likely hit that limit.
The smarter way of processing this amount of data is to do it in parallel and preferably in multiple, smaller files rather than a single 5TB file.
Amazon EMR (effectively, a managed Hadoop environment) is excellent for performing distributed operations across large data sets. It can process data from many files in parallel and can compress/decompress data on-the-fly. It's complex to learn, but very efficient and capable.
If you are sticking with your current method of processing the data, I would recommend:
If your application can read directly from S3, use that as the source. Otherwise, copy the file(s) to EBS.
Process the data
Store the output locally in EBS, preferably in smaller files (GBs rather than TBs)
Copy the files to S3 (or keep them on EBS if that meets your needs)

how can I merge hdfs edit logs files?

I have 20005 edit logs files in the NameNode which is a large number to me, is there a way I can merge them to fsimage ? I have restarted the NameNode, it did not help.
If you do not have HA enabled for NN, then you need to have a Secondary NameNode that does this.
If you have HA enabled, then your Standby NN does this.
If you have those, check for their logs and see what happens and why it fails. It is possible that you do not have enough RAM, and you need to increase the heap size of these roles, but that should be verified before with the logs.
If you do not have one of those beside the NN, then fix this and it will happen automatically, relevant configs that affect checkpoint timing:
dfs.namenode.checkpoint.period (default: 3600s)
dfs.namenode.checkpoint.txns (default: 1 million txn)
You can run the following commands as well, but this is a temporary fix:
hdfs dfsadmin -safemode enter
hdfs dfsadmin -rollEdits
hdfs dfsadmin -saveNamespace
hdfs dfsadmin -safemode leave
Note: after entering safemode HDFS gets read only until you leave safemode.

When is data deleted from data nodes in case of hdfs dfs -rmr on a folder?

We know that as we run the rmr command, edit log is created. Do the data nodes wait for updates to FSImage before purging the data or that too happens concurrently? Is there any pre-condition around acknowledgement of transaction from Journal nodes? Just trying to understand how HDFS edits work wherein you could have massive change in disk size.. How long will it take before 'hdfs dfs -du -s -h /folder' and 'hdfs dfsadmin -report' reflect the decrease in size? We tried deleting 2TB of data and after 1 hour, the data nodes local folder (/data/yarn/datanode) still was not reduced by 2TB.
After deleting the data from HDFS hadoop keeps that data in trash folder and you need to run below command to free the disk space
Hadoop fs -expunge
Then the space will be released by the HDFS.
Or you can run below command while deleting the data to skip trash
Hadoop fs -rmr -skipTrash /folder
It will not move the data into trash.
Note: A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace.