How data gets into HDFS files system - hdfs

I am trying to understand how data from multiple sources and systems gets into HDF? I want to push web server log files form 30+ systems. These logs are sitting on 18 different servers.
Thx
Veer

You can create a map-reduce job. The input for your mapper would be a file sitting on a server, and your reducer would deduct to which path to put the file in hdfs. You can either aggregate all of your files in your reducer, or simply write the file as is at the given path.
You can use Oozie to schedule the job, or you can run it sporadically by submitting the map-reduce job on the server which hosts the job tracker service.
You could also create an java application that uses the hdfs api. The FileSystem object can be used to do standard file system operation, like writing a file to a given path.
Either way, you need to request the creation through hdfs api, because the name node is responsible for splitting the file in blocks and writing it on distributed servers.

Related

Reliable File transfer from Django to external FTP

I have a Django application that needs to upload files generated upon the update of a model objects. More concretely, once a database entry is modified, a function has to be fired to generate a tiny CSV file and upload it to an (S)FTP server. My question is on how to make the uploads reliable and performant (possibly couple of thousands transactions a day). That is, make sure the files are always uploaded with possibly few duplicates without overloading the Django instance.
One option I explored was to send these tiny CSV files to a cloud queue (e.g. AWS SNS/SQS) and have another CRON job (e.g. AWS Lambda) to fetch the queue and upload the files.
Any ideias for the architecture?

How to pull data from API and store it in HDFS

I am aware of flume and Kafka but these are event driven tools. I don't need it to be event driven or real time but may be just schedule the import once in a day.
What are the data ingestion tools available for importing data from API's in HDFS?
I am not using HBase either but HDFS and Hive.
I have used R language for that for quite a time but I am looking for a more robust,may be native solution to Hadoop environment.
Look into using Scala or Python for this. There are a couple ways to approach pulling from an API into HDFS. The first approach would be to write a script which runs on your edge node(essentially just a linux server) and pulls data from the API and lands it in a directory on the linux file system. Then your script can use HDFS file system commands to put the data into HDFS.
The second approach would be to use Scala or Python with Spark to call the API and directly load the data into HDFS using a Spark submit job. Again this script would be run from an edge node it is just utilizing Spark to bypass having to land the data in the LFS.
The first option is easier to implement. The second option is worth looking into if you have huge data volumes or an API that could be parallelized by making calls to muliple IDs/accounts at once.

Best way to automate a process to be run from command line (via AWS)

I am working on a web application to provide a software as a web-based service using AWS, but I'm stuck on the implementation.
I will be using a Content Management System (probably Joomla) to manage user logins and front-end tasks such as receiving file uploads as the input. The program that provides the service needs to be run from the command line. However, I am not sure what the best way to automate this process (starting the program once the input file has been received) would be. It is an intensive program that will take at least an hour on each program, and should be run sequentially if there is more than one input at any one time, so there needs to be a queue where each element in the queue records the file path of the input file, the file path of the output folder, and ideally the email to send a notification to when the job is done.
I have looked into Amazon Data Pipeline and AWS Simple Workflow Service, and Simple Queue Services and Simple Notification Services, but I'm still not sure how exactly these could be used to trigger the start of the process, starting from the input file being uploaded.
Any help would be greatly appreciated!
There are a number of ways to architect this type of process; here is one approach what would work:
On the upload, upload the file to an S3 bucket, so that it can be accessed by any instance later.
Within the upload process, send a message to an SQS queue, which includes the bucket/key of the file uploaded, and the email of the user that uploaded it.
Either with Lambda, or with a cron process on a purpose built instance(s), check the SQS queue, and process each request.
Into the processing phase, add the email notification to the user when the process is complete.
You can absolutely use data pipeline to automate this process.
Take a look at managed preconditions and the following samples.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-preconditions.html
https://github.com/awslabs/data-pipeline-samples/tree/master/samples

Using java code to count the number of lines in a file on S3

Using java code, is it possible to count the number of lines in a file on AWS s3 without downloading it to local machine.
Depends what you mean by download.
There is no remote processing in S3 - you can't upload code that will execute in the S3 service. Possible alternatives:
If the issue is that the file is too big to store in memory or on your local disk, you can still download the file in chunks and process each chunk separately. You just use the Java InputStream (or whatever other API you are using) and download a chunk, say 4KB, process it (scan for line endings), and continue without storing to disk. Downside here is that you are still doing all this I/O from S3 to download the file to your machine.
Use AWS lambda - create a lambda function that does the processing for you. This code runs in the amazon cloud, so no I/O to your machine, only inside the cloud. The function would be the same as the previous option, just runs remotely.
Use EC2 - If you need more control of your code, custom operating systems, etc, you can have a dedicated VM on ec2 that handles this.
Given the information in your question, I would say that the lambda function is probably the best option.

How to put input file automatically in hdfs?

In Hadoop we always putting input file manually through -put command. Is there any way we can automate this process ?
There is no automated process of inputing a file into the Hadoop filesystem. However, it is possible to -put or -get multiple files with one command.
Here is the website for the Hadoop shell commands
http://hadoop.apache.org/common/docs/r0.18.3/hdfs_shell.html
I am not sure how many files you are dropping into HDFS, but one solution for watching for files and then dropping them in is Apache Flume. These slides provide a decent intro.
You can thing of automatic this process with Fabric library and python. Write hdfs put command in a function and you can call it for multiple file and perform same operations of multiple hosts in network. Fabric should be really helpful to automate in your scenario.