Copying data from a server to HDFS - hdfs

My requirement is to check a distant server for files. If I receive the required files, I need to transfer them to my HDFS.
What can be the solution for same? Can I use oozie to do this or I need some other tool for same?

Related

Push into S3 or Pull into S3 which is faster for a small file

So I have a use case where I need to put files from on-prem FTP to S3.
The size of each file (XML) is 5KB max.
The no of files is 100 files per minutes.
No, the use case is such that as soon as files come at FTP location I need to put into S3 bucket immediately.
What could be the best way to achieve that.
Here are my option
Using AWS CLI at my FTP location.(push mechanism )
Using lambda (pull mechanism.
Writing java application to put the file into S3 from FTP.
Or is there anything built in that I can leverage in.
Basically, i need to put the file in S3 as soon as possible because UI is built on top of S3 and if the file does not arrive immediately I might be in trouble.
The easiest would be to use the AWS Command-Line Interface (CLI), or an API call if you wish to do it from application code.
It doesn't really make sense doing it via Lambda, because Lambda would need to somehow retrieve the file from FTP and then copy it to S3 (so it is doing double work).
You can certainly write a Java application to do it, or simply call the AWS CLI (written in Python) since it will work out-of-the-box.
You could either use aws s3 sync to copy all new/updated files, or copy specific files with aws s3 cp. If you have so many files, it's probably best to specify the files otherwise it will waste time scanning many historical files that don't need to be copied.
The ultimate best case would be for the files to be sent to S3 directly, without involving FTP at all!

How to pull data from API and store it in HDFS

I am aware of flume and Kafka but these are event driven tools. I don't need it to be event driven or real time but may be just schedule the import once in a day.
What are the data ingestion tools available for importing data from API's in HDFS?
I am not using HBase either but HDFS and Hive.
I have used R language for that for quite a time but I am looking for a more robust,may be native solution to Hadoop environment.
Look into using Scala or Python for this. There are a couple ways to approach pulling from an API into HDFS. The first approach would be to write a script which runs on your edge node(essentially just a linux server) and pulls data from the API and lands it in a directory on the linux file system. Then your script can use HDFS file system commands to put the data into HDFS.
The second approach would be to use Scala or Python with Spark to call the API and directly load the data into HDFS using a Spark submit job. Again this script would be run from an edge node it is just utilizing Spark to bypass having to land the data in the LFS.
The first option is easier to implement. The second option is worth looking into if you have huge data volumes or an API that could be parallelized by making calls to muliple IDs/accounts at once.

EMR and Pig running two steps, will common files from S3 be cached for reuse?

I want to run an EMR Pig job which at the moment is logically separated in to two scripts (and therefore two steps), however some of the data files are common between these two scripts, my question is will Pig recognize this when running the second step (second script) and reuse the files read from S3 or will it clear everything and do it from scratch?
If your EMR cluster is reading input data from S3, it doesn't copy the data to HDFS at all.
Amazon EMR does not copy the data to local disk; instead the mappers
open multithreaded HTTP connections to Amazon S3, pull the data, and
process them in streams
However, it is not "caching" these streams for multiple passes:
For iterative data processing jobs where data needs processing
multiple times with multiple passes, this is not an efficient
architecture. That’s because the data is pulled from Amazon S3 over
the network multiple times.
For this scenario, it is probably better to copy the common data to HDFS first with S3DistCp. If some of the artifacts you produce in your first step are useful in the second step, you could write those to HDFS for the second step to pick up in either case.
Reference: AWS EMR Best Practices Guide
You can use s3distcp to download files in HDFA on EMR. use files from HDFS for both pig scripts. So that files won't get downloaded each time.

How data gets into HDFS files system

I am trying to understand how data from multiple sources and systems gets into HDF? I want to push web server log files form 30+ systems. These logs are sitting on 18 different servers.
Thx
Veer
You can create a map-reduce job. The input for your mapper would be a file sitting on a server, and your reducer would deduct to which path to put the file in hdfs. You can either aggregate all of your files in your reducer, or simply write the file as is at the given path.
You can use Oozie to schedule the job, or you can run it sporadically by submitting the map-reduce job on the server which hosts the job tracker service.
You could also create an java application that uses the hdfs api. The FileSystem object can be used to do standard file system operation, like writing a file to a given path.
Either way, you need to request the creation through hdfs api, because the name node is responsible for splitting the file in blocks and writing it on distributed servers.

How to put input file automatically in hdfs?

In Hadoop we always putting input file manually through -put command. Is there any way we can automate this process ?
There is no automated process of inputing a file into the Hadoop filesystem. However, it is possible to -put or -get multiple files with one command.
Here is the website for the Hadoop shell commands
http://hadoop.apache.org/common/docs/r0.18.3/hdfs_shell.html
I am not sure how many files you are dropping into HDFS, but one solution for watching for files and then dropping them in is Apache Flume. These slides provide a decent intro.
You can thing of automatic this process with Fabric library and python. Write hdfs put command in a function and you can call it for multiple file and perform same operations of multiple hosts in network. Fabric should be really helpful to automate in your scenario.