copying local file to hdfs requires to be on hdfs cluster? - hdfs

I understand that copyFromLocal or put command is used to copy local files to HDFS. My question is, if one has to be on cluster if we want to run the command to copy files to HDFS?
suppose, I have 3 node cluster as a1,a2 and a3 wherein a1 is the master node and a2 and a3 are the data nodes.
1. To copy any files to data nodes, I need to login to any of the nodes (a1 ,a2 or a3 )
2. To copy any files to data nodes from any other location outside cluster say machine x1, how can I copy the files from x1 to cluster?
thanks
-Brijesh

You can upload your files using ssh:
cat your_local_file_to_upload | ssh username#YOUR_HADOOP_GATEWAY "hadoop dfs -put - hadoopFolderName/file_name_in_hdfs"
Here, YOUR_HADOOP_GATEWAY is the IP of one of the nodes, or a machine that is configured to act as a gateway to your hadoop cluster
It works for binary files too.
If you want to download files, you can similarly do the following:
ssh username#YOUR_HADOOP_GATEWAY "hdfs dfs -cat src_path_in_HDFS " > local_dst_path

Also, take a look at WebHDFS which is a REST API for interacting with the cluster and usually runs on the same host as the name node.

Related

Transferring Pdf files from Local folder to AWS

I have a monthly activity where i get hundreds of PDF files in a folder and i need to transfer those to an AWS server . Currently i do this activity manually . But i need to automate this process of transfer of all pdf files form my local folder to a specific folder in AWS .
Also this process takes a lot of time ( approx 5 hours for 500 pdf files) . Is there a way to spped up the process?
While doing the copy from local to AWS you must be using some tool like winSCP or any SSH client, so you could automate the same using the script.
scp [-r] /you/pdf/dir youruser#aswhost:/home/user/path/
If you want to do it with speed, you could run multiple scp command in parallel of multiple terminal and may split files while creating to some logical grouped directories.
You can zip the files and transfer them. After transfer unzip the files.
Or else write a program which iterates over all files in your folder and uploads files to s3 using S3 api methods.

Working with AWS S3 Large Public Data Set

AWS has several public "big data" data sets available. Some are hosted for free on EBS, and others, like NASA NEX climate data are hosted on S3. I have found more discussion on how to work with those that are hosted in EBS, but have been unable to get an S3 data set within an EC2 with reasonable enough speed to actually work with the data.
So my issue is getting the public big data sets (~256T) "into" an EC2. One approach I tried was to mount the public S3 to my EC2, as in this tutorial. However, when attempting to use python to evaluate this mounted data, the processing times were very, very slow.
I am starting to think utilizing the AWS CLI (cp or sync) may be the correct approach, but am still having difficulty finding documentation on this with respect to large, public S3 data sets.
In short, is mounting the best way to work with AWS' S3 public big data sets, is the CLI better, is this an EMR problem, or does the issue lie entirely in instance size and / or bandwidth?
Very large data sets are typically analysed with the help of distributed processing tools such as Apache Hadoop (which is available as part of the Amazon EMR service). Hadoop can split processing between multiple servers (nodes), achieving much better speed and throughput by working in parallel.
I took a look at one of the data set directories and found these files:
$ aws s3 -ls s3://nasanex/NEX-DCP30/NEX-quartile/rcp26/mon/atmos/tasmax/r1i1p1/v1.0/CONUS/
2013-09-29 17:58:42 1344734800 tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc
2013-10-09 05:08:17 83 tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc.md5
2013-09-29 18:18:00 1344715511 tasmax_ens-avg_amon_rcp26_CONUS_201101-201512.nc
2013-10-09 05:14:49 83 tasmax_ens-avg_amon_rcp26_CONUS_201101-201512.nc.md5
2013-09-29 18:15:33 1344778298 tasmax_ens-avg_amon_rcp26_CONUS_201601-202012.nc
2013-10-09 05:17:37 83 tasmax_ens-avg_amon_rcp26_CONUS_201601-202012.nc.md5
2013-09-29 18:20:42 1344775120 tasmax_ens-avg_amon_rcp26_CONUS_202101-202512.nc
2013-10-09 05:07:30 83 tasmax_ens-avg_amon_rcp26_CONUS_202101-202512.nc.md5
...
Each data file in this directory is 1.3TB (together with an MD5 file to verify file contents via a checksum).
I downloaded one of these files:
$ aws s3 cp s3://nasanex/NEX-DCP30/NEX-quartile/rcp26/mon/atmos/tasmax/r1i1p1/v1.0/CONUS/tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc .
Completed 160 of 160 part(s) with 1 file(s) remaining
The aws s3 cp command used multi-part download to retrieve the file. It still took considerable time because 1.3TB is a lot of data!
The result is a local file that can be accessed via Python:
$ ls -l
total 1313244
-rw-rw-r-- 1 ec2-user ec2-user 1344734800 Sep 29 2013 tasmax_ens-avg_amon_rcp26_CONUS_200601-201012.nc
It is in .nc format, which I think is a NetCDF.
I recommend processing one file at a time, since EBS data volumes are 16TiB maximum size.

Load Data using Apache-Spark on AWS

I am using Apache-Spark on Amazon Web Service (AWS)-EC2 to load and process data. I've created one master and two slave nodes. On the master node, I have a directory data containing all data files with csv format to be processed.
Now before we submit the driver program (which is my python code) to run, we need to copy the data directory data from the master to all slave nodes. For my understanding, I think it is because each slave node needs to know data file location in its own local file systems so it can load data file. For example,
from pyspark import SparkConf, SparkContext
### Initialize the SparkContext
conf = SparkConf().setAppName("ruofan").setMaster("local")
sc = SparkContext(conf = conf)
### Create a RDD containing metadata about files in directory "data"
datafile = sc.wholeTextFiles("/root/data") ### Read data directory
### Collect files from the RDD
datafile.collect()
When each slave node runs the task, it loads data file from its local file system.
However, before we submit my application to run, we also have to put the directory data into the Hadoop Distributed File System (HDFS) using $ ./ephemeral-hdfs/bin/hadoop fs -put /root/data/ ~.
Now I get confused about this process. Does each slave node load data files from its own local file system or HDFS? If it loads data from the local file system, why do we need to put data into HDFS? I would appreciate if anyone can help me.
Just to clarify for others that may come across this post.
I believe your confusion is due to not providing a protocol in the file location. When you do the following line:
### Create a RDD containing metadata about files in directory "data"
datafile = sc.wholeTextFiles("/root/data") ### Read data directory
Spark assumes the file path /root/data is in HDFS. In other words it looks for the files at hdfs:///root/data.
You only need the files in one location, either locally on every node (not the most efficient in terms of storage) or in HDFS that is distributed across the nodes.
If you wish to read files from local, use file:///path/to/local/file. If you wish to use HDFS use hdfs:///path/to/hdfs/file.
Hope this helps.
One quick suggestion is to load csv from S3 instead of having it in local.
Here is a sample scala snippet which can be used to load a bucket from S3
val csvs3Path = "s3n://REPLACE_WITH_YOUR_ACCESS_KEY:REPLACE_WITH_YOUR_SECRET_KEY#REPLACE_WITH_YOUR_S3_BUCKET"
val dataframe = sqlContext.
read.
format("com.databricks.spark.csv").
option("header", "true").
load(leadsS3Path)

Moving a changing file to a new server using Gzip

I have a file in AWS S3 that is updating every second (actually collecting new data). I want to move the collected file to my local server periodically. Here are a few things that I am considering.
The transportation needs to be done in a zipped somehow to reduce the network burden since the cost the S3 is based on the network load.
After moving the data out of AWS S3, the data on S3 need to be deleted. In another way, the sum of the data on my server and the data on AWS should be the complete dataset and there should be intersection between these two datasets. Otherwise, next time, when we move the data, there will be duplicates for the dataset on my server.
The dataset on S3 is collecting all the time, and the new data is appended to the file using standard in. There is something running on the cron job to collect the data.
Here is a pseudo code that shows the idea of how the file has been built on S3.
* * * * * nohup python collectData.py >> data.txt
Which requires that the data transportation cannot break the pipeline, otherwise, the new data will be lost.
One of the option is to mount S3 bucket as local directory (for example, using RioFS project) and use standard shell tools (like rm, cp, mv ..) to remove an old file and upload a new file to Amazon S3.

zookeeper jar not found in HBase MR job

I have a web UI that tries to spawn a MR job on HBase table. I keep getting this error though:
java.io.FileNotFoundException: File /opt/hadoop/mapreduce/system/job_201205251929_0007/libjars/zookeeper-3.3.2.jar does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)
at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:629)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:448)
I am running with hbase 0.90.4. HBase manages its own zookeeper. And, I confirmed that I have /opt/hadoop/mapreduce/system/job_201205251929_0007/libjars/zookeeper-3.3.2.jar in my HDFS. Is it looking in Local FS?
I found that I did not have core-site.xml on my classpath and it was taking local FS for fs.default.name instead of the HDFS. The jar existed in HDFS but it was looking in local FS.
Any jar files accessed in the mapper or reducer need to be in the local filesystem on all the nodes in the cluster. Check your local FS.