Programmatically write files into HDFS

Programmatically write files into HDFS - hdfs

I am looking at options in Java programs that can write files into HDFS with the following requirements.
1) Transaction Support: Each file, when being written, either fully written successfully or failed totally without any partial file blocks written.
2) Compression Support/File Formats: Can specify compression type or file format when writing contents.
I know how to write data into a file on HDFS by opening a FSDataOutputStream shown here. Just wondering if there is some libraries of out of the box solutions that provides the support I mentioned above.
I stumbled upon Flume, which provides HDFS sink that can support transaction, compression, file rotation, etc. But it doesn't seem to provide an API to be used as a library. The features Flume provides are highly coupled with the Flume architectural components, like source, channel, and sinks and doesn't seem to be usable independently. All I need is merely on the HDFS loading part.
Does anyone have some good suggestions?

I think using Flume as "gateway" to HDFS would be good solution. Your program sends data to Flume (using one of interfaces provided by its sources), Flume writes to HDFS.
This way you don't need to support bunch of custom code for interaction with HDFS. On the other hand, you need to install and configure Flume, but in my experience it is much easier (see this comment for installation recommendations).
Finally, Flume HDFS sink is open-source component, so you are free to reuse its code under the terms of Apache license. Get the sources here: https://git-wip-us.apache.org/repos/asf?p=flume.git;a=tree;f=flume-ng-sinks/flume-hdfs-sink;h=b9414a2ebc976240005895e3eafe37b12fad4716;hb=trunk

Related

How to read and write Parquet files to/from ADLS using the Arrow/Parquet C++ libraries?

I have a need to access Parquet formatted data on Azure (ADLS). We are using the C++ libraries that are available for both Apache Arrow and Parquet. Reading/writing to local disk is relatively straightforward using the Parquet C++ library. However if one wants to do the same, but with ADLS, the effort appears to be complicated. Through research I was able to utilize a new Arrow filesystem class to read/write from/to GCS, but there is not such a class for ADLS. There appears to be one for S3, but Azure is our focus at present.
Does anyone have an idea of how to do this? Have you done it successfully? Thanks!

Avro message for Google Cloud Pub-Sub?

What is a best data format for publishing and consuming to/from Pub-Sub? I am looking at Avro message format due to it's binary format.
Usecases are there would be real time Microservice applications publishing Avro messages to pub-sub. Given that avro message is best suited when batching up messages(along with a schema attached with the binary message) and then publishing the messages, would that be a better suitable format for this usecase involving microservice?

Google Cloud Documentation contains some JSON examples but when looking for efficiency the main suggestion is to use the available client libraries, except if your needs don't met what client libraries can offer or if you are running on Google App Engine standard environment, in which case the use of two APIs is suggested.
In fact, the most important factor for efficiency is using the gRPC API instead of the REST API (which libraries' calls do by default). As mentioned here:
There are two major factors at work here: more efficient data encoding
and HTTP/2. gRPC keeps data in binary both in client memory and on the
wire by building on HTTP/2 and Protocol Buffers. This eliminates
processing and space required for string encoding schemes such as
Base64 or JSON. In addition, HTTP/2 itself makes things go faster with
multiplexed requests over a single connection and header compression.
I did not find data format explicit mentions anywhere. I suggest you to use your preferred language for the message, as for example Python. Client library description here and sample code here.
Based on this StackOverflow post, you can optimize your PubSub system efficienctly by:
Making sure you are using gRPC
Batching where possible, to reduce the number of calls and eliminate latency.
Only compressing when needed and after benchmarking (implies extra logic in your application)
Finally, if you intend to deploy a robust PubSub system, have a look on this Anusha Ramesh post. She is Project Manager at Google now and suggests and elaborates on three tips:
Don't underestimate the importance of capacity planning.
Make sure your pub/sub system is fault-tolerant.
NSM: Never Stop Monitoring.

There isn't going to be one correct answer for the best format to use for the messages for all use cases. Avro is certainly a popular choice. Protocol buffers would be another possibility, as would Thrift. For Pub/Sub, the data is all just bytes and it is up to the publisher and the subscriber to determine the interpretation of this data. People have run comparisons on the different data formats, so you may want to make the decision based on your needs in terms of performance and message sizes.
Pub/Sub itself uses Protocol buffers for defining its data types. With regard to batching, the Cloud Pub/Sub client libraries do batching themselves for publish, so you don't necessarily have to worry about that on your own. You can control the batch settings to optimize throughput and latency based on your use case by calling, for example, setBatchSettings in the Publisher.Builder for Java (other languages have an equivalent as well). You may decide to do your own batching if you want to associate some metadata with a set of messages instead of with each individual message or you have very specific needs in terms of how messages are batched together. Otherwise, depending on the client library to do the batching is probably the correct decision.

Service in front of Hadoop

I would like to expose a web service in front of Hadoop, that is used to forward data to Hadoop ecosystem. I have two branches in Hadoop, slower, that works on whole data periodically, and fast, that does some computation on every input, and stores the data for periodical job. But the user does not see the slower branch, and has a feeling that only the fast job is done, not knowing for the slower job that runs on data aggregated during time.
How to organize my architecture best? I am new to Hadoop architecture, I read about Oozie, and have a feeling that it can help me to some point. But I don't know how to connect the service with Hadoop, how to pass the data through service, since Hadoop works primarily on files, and is distributed system.
Data should get into system in a streaming fashion. There should be "real time" branch, that works with individual values that get into system, and they would also be accumulated for periodic batch processing.
Any help would be great, thanks.

You might want to look into hue . This provides a set of web front-ends: there's one for HDFS (the filesystem) where you can upload files; there are means to track jobs too.
If you aim more regular and automated putting of files into HDFS, please elaborate your question further: where and what is the data initially (logs? db? bunch of gzipped csv-s?), what should trigger retrieval/
One can as well use API-s to deal with the filesystem and to track jobs.
As for what oozie concerns, this is more of an orchestrating tool, use it to organize related jobs into workflows.

Why HDFS is write once and read multiple times?

I am a new learner of Hadoop.
While reading about Apache HDFS I learned that HDFS is write once file system. Some other distributions ( Cloudera) provides append feature. It will be good to know rational behind this design decision. In my humble opinion, this design creates lots of limitations on Hadoop and make it suitable for limited set of problems( problems similar to log analytic).
Experts comment will help me to understand HDFS in better manner.

HDFS follows the write-once, read-many approach for its files and applications. It assumes that a file in HDFS once written will not be modified, though it can be access ‘n’ number of times (though future versions of Hadoop may support this feature too)! At present, in HDFS strictly has one writer at any time. This assumption enables high throughput data access and also simplifies data coherency issues. A web crawler or a MapReduce application is best suited for HDFS.
As HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming data access is extremely important in HDFS. As HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data. HDFS overlooks a few POSIX requirements in order to implement streaming data access.
http://www.edureka.co/blog/introduction-to-apache-hadoop-hdfs/

There are three major reasons that HDFS has the design it has,
HDFS was designed by slavishly copying the design of Google's GFS, which was intended to support batch computations only
HDFS was not originally intended for anything but batch computation
Design a real distributed file system that can support high performance batch operations as well as real-time file modifications is difficult and was beyond the budget and experience level of the original implementors of HDFS.
There is no inherent reason that Hadoop couldn't have been built as a fully read/write file system. MapR FS is proof of that. But implementing such a thing was far outside of the scope and capabilities of the original Hadoop project and the architectural decisions in the original design of HDFS essentially preclude changing this limitation. A key factor is the presence of the NameNode since HDFS requires that all meta-data operations such as file creation, deletion or file length extensions round-trip through the NameNode. MapR FS avoids this by completely eliminating the NameNode and distributing meta-data throughout the cluster.
Over time, not having a real mutable file system has become more and more annoying as the workload for Hadoop-related systems such as Spark and Flink have moved more and more toward operational, near real-time or real-time operation. The responses to this problem have included
MapR FS. As mentioned ... MapR implemented a fully functional high performance re-implementation of HDFS that includes POSIX functionality as well as noSQL table and streaming API's. This system has been in performance for years at some of the largest big data systems around.
Kudu. Cloudera essentially gave up on implementing viable mutation on top of HDFS and has announced Kudu with no timeline for general availability. Kudu implements table-like structures rather than fully general mutable files.
Apache Nifi and the commercial version HDF. Hortonworks also has largely given up on HDFS and announced their strategy as forking applications into batch (supported by HDFS) and streaming (supported by HDF) silos.
Isilon. EMC implemented the HDFS wire protocol as part of their Isilon product line. This allows Hadoop clusters to have two storage silos, one for large-scale, high-performance, cost-effective batch based on HDFS and one for medium-scale mutable file access via Isilon.
other. There are a number of essentially defunct efforts to remedy the write-once nature of HDFS. These include KFS (Kosmix File System) and others. None of these have significant adoption.

An advantage of this technique is that you don't have to bother with synchronization. Since you write once, your reader are guaranteed that the data will not be manipulated while they read.

Though this design decision does impose restrictions, HDFS was built keeping in mind efficient streaming data access.
Quoting from Hadoop - The Definitive Guide:
HDFS is built around the idea that the most efficient data processing pattern is a
write-once, read-many-times pattern. A dataset is typically generated or copied
from source, and then various analyses are performed on that dataset over time.
Each analysis will involve a large proportion, if not all, of the dataset, so the time
to read the whole dataset is more important than the latency in reading the first
record.

Hadoop and Django, is it possible?

From what I understood, Hadoop is a distributed storage system thingy. However what I don't really get is, can we replace normal RDBMS(MySQL, Postgresql, Oracle) with Hadoop? Or is Hadoop is just another type of filesystem and we CAN run RDBMS on it?
Also, can Django integrated with Hadoop? Usually, how web frameworks (ASP.NET, PHP, Java(JSP,JSF, etc) ) integrate themselves with Hadoop?
I am a bit confused with the Hadoop vs RDBMS and I would appreciate any explanation.
(Sorry, I read the documentation many times, but maybe due to my lack of knowledge in English, I find the documentation is a bit confusing most of the time)

What is Hadoop?
Imagine the following challange: you have a lot of data, and with a lot I mean at least Terabytes. You want to transform this data or extract some informations and process it into a format which is indexed, compressed or "digested" in a way so you can work with it.
Hadoop is able to parallelize such a processing job and, here comes the best part, takes care of things like redundant storage of the files, distribution of the task over different machines on the cluster etc (Yes, you need a cluster, otherwise Hadoop is not able to compensate the performance loss of the framework).
If you take a first look at the Hadoop ecosystem you will find 3 big terms: HDFS(Hadoop Filesystem), Hadoop itself(with MapReduce) and HBase(the "database" sometimes column store, which does not fits exactly)
HDFS is the Filesystem used by both Hadoop and HBase. It is a extra layer on top of the regular filesystem on your hosts. HDFS slices the uploaded Files in chunks (usually 64MB) and keeps them available in the cluster and takes care of their replication.
When Hadoop gets a task to execute, it gets the path of the input files on the HDFS, the desired output path, a Mapper and a Reducer Class. The Mapper and Reducer is usually a Java class passed in a JAR file.(But with Hadoop Streaming you can use any comandline tool you want). The mapper is called to process every entry (usually by line, e.g.: "return 1 if the line contains a bad F* word") of the input files, the output gets passed to the reducer, which merges the single outputs into a desired other format (e.g: addition of numbers). This is a easy way to get a "bad word" counter.
The cool thing: the computation of the mapping is done on the node: you process the chunks linearly and you move just the semi-digested (usually smaller) data over the network to the reducers.
And if one of the nodes dies: there is another one with the same data.
HBase takes advantage of the distributed storage of the files and stores its tables, splitted up in chunks on the cluster. HBase gives, contrary to Hadoop, random access to the data.
As you see HBase and Hadoop are quite different to RDMBS. Also HBase is lacking of a lot of concepts of RDBMS. Modeling data with triggers, preparedstatements, foreign keys etc. is not the thing HBase was thought to do (I'm not 100% sure about this, so correct me ;-) )
Can Django integrated with Hadoop?
For Java it's easy: Hadoop is written in Java and all the API's are there, ready to use.
For Python/Django I don't know (yet), but I'm sure you can do something with Hadoop streaming/Jython as a last resort.
I've found the following: Hadoopy and Python in Mappers and Reducers.

Hue, The Web UI for Hadoop is based on Django!

Django can connect with most RDMS, so you can use it with a Hadoop based solution.
Keep in mind, Hadoop is many things, so specifically, you want something with low latency such as HBase, don't try to use it with Hive or Impala.
Python has a thrift based binding, happybase, that let you query Hbase.

Basic (!) example of Django integration with Hadoop
[REMOVED LINK]
I use Oozie REST api for job execution, and 'hadoop cat' for grabbing job results (due to HDFS' distributed nature). The better appoach is to use something like Hoop for getting HDFS data. Anyway, this is not a simple solution.
P.S. I've refactored this code and placed it into https://github.com/Obie-Wan/django_hadoop.
Now it's a separate django app.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js