How to put input file automatically in hdfs? - hdfs

In Hadoop we always putting input file manually through -put command. Is there any way we can automate this process ?

There is no automated process of inputing a file into the Hadoop filesystem. However, it is possible to -put or -get multiple files with one command.
Here is the website for the Hadoop shell commands
http://hadoop.apache.org/common/docs/r0.18.3/hdfs_shell.html

I am not sure how many files you are dropping into HDFS, but one solution for watching for files and then dropping them in is Apache Flume. These slides provide a decent intro.

You can thing of automatic this process with Fabric library and python. Write hdfs put command in a function and you can call it for multiple file and perform same operations of multiple hosts in network. Fabric should be really helpful to automate in your scenario.

Related

Where are the files stored in Hue in Amazon EMR

If I go to the hue link here at http://ec2-****:8888/hue/home/ I can access the hue dashboard and create and save files etc. However, I'm not able to see those files while browsing through the system using SSH. Where are these files stored in the system?
This is not how it works Alex, you cannot see that information in your filesystem.
Hue is giving you a view of the underlying Hadoop Distributed File System (HDFS).
The information in this filesystem is spread across several nodes in your Hadoop cluster.
If you need to find something in that filesystem, you cannot use the typical file manipulation tools provided by the operating system, but their Hadoop counterparts.
For your use case, Hadoop provide you the hdfs dfs command, or equivalently, hadoop fs.
Let's say you wanna find test1.sql in the Hadoop filesystem. You can issue the following command once you ssh in your node:
hadoop fs -ls -R / | grep test1.sql
Or:
hadoop fs -find / -name test1.sql
Please, see the complete reference of the options available.
You can retrieve the file to your local filesystem, once located by the previous commands, by issuing the following one:
hadoop fs -get /path/to/test1.sql test1.sql
This operation could be also achieved from the Hue File Browser.
In the specific case of Amazon EMR, this distributed filesystem can be supported by different storage systems basically, HDFS, for ephemeral workloads, and EMRFS, an implementation of HDFS over S3:
EMRFS is an implementation of the Hadoop file system used for reading and writing regular files from Amazon EMR directly to Amazon S3.

How to write Parquet files on HDFS using C++?

I need to write in-memory data records to HDFS file in Parquet format using C++ language. I know there is a parquet-cpp library on github but i can't find example code.
Could anybody share copy or link to example code if you have any? Thanks.
There are examples for parquet-cpp in the github repo in the examples directory. They just deal with Parquet though, and do not involve HDFS access.
For HDFS access from C++, you will need libhdfs from Apache Hadoop. Or you may use Apache Arrow, which has HDFS integration, as desribed here.

How to pull data from API and store it in HDFS

I am aware of flume and Kafka but these are event driven tools. I don't need it to be event driven or real time but may be just schedule the import once in a day.
What are the data ingestion tools available for importing data from API's in HDFS?
I am not using HBase either but HDFS and Hive.
I have used R language for that for quite a time but I am looking for a more robust,may be native solution to Hadoop environment.
Look into using Scala or Python for this. There are a couple ways to approach pulling from an API into HDFS. The first approach would be to write a script which runs on your edge node(essentially just a linux server) and pulls data from the API and lands it in a directory on the linux file system. Then your script can use HDFS file system commands to put the data into HDFS.
The second approach would be to use Scala or Python with Spark to call the API and directly load the data into HDFS using a Spark submit job. Again this script would be run from an edge node it is just utilizing Spark to bypass having to land the data in the LFS.
The first option is easier to implement. The second option is worth looking into if you have huge data volumes or an API that could be parallelized by making calls to muliple IDs/accounts at once.

EMR and Pig running two steps, will common files from S3 be cached for reuse?

I want to run an EMR Pig job which at the moment is logically separated in to two scripts (and therefore two steps), however some of the data files are common between these two scripts, my question is will Pig recognize this when running the second step (second script) and reuse the files read from S3 or will it clear everything and do it from scratch?
If your EMR cluster is reading input data from S3, it doesn't copy the data to HDFS at all.
Amazon EMR does not copy the data to local disk; instead the mappers
open multithreaded HTTP connections to Amazon S3, pull the data, and
process them in streams
However, it is not "caching" these streams for multiple passes:
For iterative data processing jobs where data needs processing
multiple times with multiple passes, this is not an efficient
architecture. That’s because the data is pulled from Amazon S3 over
the network multiple times.
For this scenario, it is probably better to copy the common data to HDFS first with S3DistCp. If some of the artifacts you produce in your first step are useful in the second step, you could write those to HDFS for the second step to pick up in either case.
Reference: AWS EMR Best Practices Guide
You can use s3distcp to download files in HDFA on EMR. use files from HDFS for both pig scripts. So that files won't get downloaded each time.

How data gets into HDFS files system

I am trying to understand how data from multiple sources and systems gets into HDF? I want to push web server log files form 30+ systems. These logs are sitting on 18 different servers.
Thx
Veer
You can create a map-reduce job. The input for your mapper would be a file sitting on a server, and your reducer would deduct to which path to put the file in hdfs. You can either aggregate all of your files in your reducer, or simply write the file as is at the given path.
You can use Oozie to schedule the job, or you can run it sporadically by submitting the map-reduce job on the server which hosts the job tracker service.
You could also create an java application that uses the hdfs api. The FileSystem object can be used to do standard file system operation, like writing a file to a given path.
Either way, you need to request the creation through hdfs api, because the name node is responsible for splitting the file in blocks and writing it on distributed servers.