What are the steps to transfer a 15 GB file to HDFS cluster - hdfs

I have a 20 GB file and as per my understanding HDFS cluster is nothing but well coordinated machines and if i want to transfer a 20 GB file how can i transfer it to HDFS and what happens internally when we transfer i to HDFS

File size doesn't matter.
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
hadoop fs -put /path/to/file hdfs://namenode.address:port/path/in/hdfs
what happens internally when we transfer i to HDFS
It is split into HDFS blocks and distributed onto datanodes

Related

Performance issue with AWS EMR S3DistCp

I am using S3DistCp on an EMR cluster in order to aggregate around 200K small files (for a total of 3.4GB) from a S3 bucket to another path in the same bucket. It is working but it is extremely slow (around 600MB transferred after more than 20 minutes).
Here is my EMR configuration:
1 master m5.xlarge
3 cores m5.xlarge
release label 5.29.0
The command:
s3-dist-cp --s3Endpoint=s3-eu-central-1.amazonaws.com --src=s3://my-bucket/input/ --dest=s3://my-bucket/output/ --groupBy=.*input/(entry).*(.json.gz) --targetSize=128
Am I missing something ? I have read that S3DistCp can transfer a lot of files in a blink but I can't figure how. Both EMR and bucket are in the same region by the way.
Thank you.
Here are the recomendations
use R type instance. It will provide more memory compared to M type instances
use coalesce to merge the files in source as you have many small files
Check the number of mapper tasks. The more the task, the lesser the performance

Why we need distcp command to copy data from hdfs to s3, when we can directly write the data to s3 location?

Please help me understand the use of distcp, we are using s3 and in some scripts I can see they are directly writing data to s3 and many cases writing data to hdfs and then using distcp to copy data to s3.
So when to use distcp and when can we write to cloud directly?
First of all you need to be very clear why to use distcp.
Distcp is mainly used to transfer across hadoop cluster. Lets say you have two remote hadoop cluster 1 in California and other 1 is in Arizona and cluster1 is your primary cluster and cluster2 is your secondary means that you are doing all the processing on cluster1 and dumping new data to cluster2 once the processing is completed on cluster2.
In this scenrio you will distcp(copy) you data from cluster1 to cluster2 because both cluster are different and you can copy your data very fast as it copies data in parallel using mappers. So you can think of distcp as similar to ftp which is used for local data copy across different servers.
In your case i think hdfs you mentioned is other hadoop cluser from which you are copying your data to aws s3 or vice versa.
Hope it clears your doubt

100 GiB Files upload to AWS Ec2 Instance

We have the n number of files with total size of around 100 GiB. We need to upload all the files to EC2 Linux instance which is hosted in AWS (US region).
My office(in India) internet connection is 4Mbps dedicated leased line. Its taking more than 45 min to upload 500 MB file to EC2 instance. which is too slow.
How do we transfer this kind of bulk upload with minimum time period..?
If it is 100s of TB we can go with snowball import and export but this is 100 GiB.
It should be 3x faster than you experience.
If there are many small files you can try to "zip" them to send fewer large files.
And make sure you dont bottleneck the linux server by encrypting the data (ssh/sftp). Ftp may be your fastest way.
But 100GB will always take at least 57 hours with your max speed..

What controls the number of partitions when reading Parquet files?

My setup:
Two Spark clusters. One on EC2 and one on Amazon EMR. Both with Spark 1.3.1.
The EMR cluster was installed with emr-bootstrap-actions. The EC2 cluster was installed with Spark's default EC2 scripts.
The code:
Read a folder containing 12 Parquet files and count the number of partitions
val logs = sqlContext.parquetFile(“s3n://mylogs/”)
logs.rdd.partitions.length
Observations:
On EC2 this code gives me 12 partitions (one per file, makes sense).
On EMR this code gives me 138 (!) partitions.
Question:
What controls the number of partitions when reading Parquet files?
I read the exact same folder on S3, with the exact same Spark release. This leads me to believe that there might be some configuration settings which control how partitioning happens. Does anyone have more info on this?
Insights would be greatly appreciated.
Thanks.
UPDATE:
It seems that the many partitions are created by EMR's S3 file system implementation (com.amazon.ws.emr.hadoop.fs.EmrFileSystem).
When removing
<property><name>fs.s3n.impl</name><value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value></property>
from core-site.xml (hereby reverting to Hadoop's S3 filesystem), I end up with 12 partitions.
When running with EmrFileSystem, it seems that the number of partitions can be controlled with:
<property><name>fs.s3n.block.size</name><value>xxx</value></property>
Could there be a cleaner way of controlling the # of partitions when using EmrFileSystem?

Using Elastic MapReduce on s3

I plan to run mapreduce job on the data stored in S3. Data size is around 1PB. Will EMR copy entire 1TB data to spawned VMs with replication factor 3 (if my rf = 3)? If yes, does amazon charge for copying data from S3 to MapReduce cluster?
Also, is it possible to use EMR for the data not residing in s3?
Amazon Elastic Map Reduce accesses data directly from Amazon S3. It does not copy the data to HDFS. (It might use some local temp storage, I'm not 100% sure.)
However, it certainly won't trigger your HDFS replication factor, since the data is not stored in HDFS. For example, Task Nodes that don't have HDFS can still access data in S3.
There is no Data Transfer charge for data movements between Amazon S3 and Amazon EMR within the same Region, but it will count towards the S3 Request count.
Amazon Elastic Map Reduce can certainly be used on data not residing in Amazon S3 -- it's just a matter of loading the data from your data source, such as using scp to copy the data into HDFS. Please note that the contents of HDFS will disappear when your cluster terminates. That's why S3 is a good place to store data for EMR -- it is persistent and there is no limit on the amount of data that is stored.