Is HDFS required to use with MapReduce? - mapreduce

We're exploring the use of MR to parallelize long-running processes. All of our data currently resides in RDBMS. We understand that HDFS is the underlying file-based data storage for MR, but were not sure of the following:
Do we have to move all RDBMS data to HDFS to use MR?
Is such a move permanent or temporary only for the life of the MR process?
Can we use MR for its parallel features while jobs still acces data from traditional sources (not HDFS)

I don't think you have to move all RDBMS data to HDFS to use MR. Let's take a look at how Sqoop load data from RDBMS to HBase/HDFS.
Sqoop will load data by MapReduce with the help of [DBInputFormat]1 (which is a connector that allows Hadoop MapReduce programs to read rows from SQL databases).

If performance & scalability is your first priority, yes, you have to
move all your data from RDBMS to HDFS for efficient processing.
MR jobs process data from in and out of HDFS. After the data is
processed you can import the data from HDFS by MR or just using HDFS
apis to other sources.
No, you cannot use MR for its parallel features while jobs still
access data from traditional sources. MR jobs splits the input data
and pass on the it to various maps. With the traditional sources it
wont be possible.

Related

Does Dask communicate with HDFS to optimize for data locality?

In Dask distributed documentation, they have the following information:
For example Dask developers use this ability to build in data locality
when we communicate to data-local storage systems like the Hadoop File
System. When users use high-level functions like
dask.dataframe.read_csv('hdfs:///path/to/files.*.csv') Dask talks to
the HDFS name node, finds the locations of all of the blocks of data,
and sends that information to the scheduler so that it can make
smarter decisions and improve load times for users.
However, it seems that the get_block_locations() was removed from the HDFS fs backend, so my question is: what is the current state of Dask regarding to HDFS ? Is it sending computation to nodes where data is local ? Is it optimizing the scheduler to take into account data locality on HDFS ?
Quite right, with the appearance of arrow's HDFS interface, which is now preferred over hdfs3, the consideration of block locations is no longer part of workloads accessing HDFS, since arrow's implementation doesn't include the get_block_locations() method.
However, we already wanted to remove the somewhat convoluted code which made this work, because we found that the inter-node bandwidth on test HDFS deployments was perfectly adequate that it made little practical difference in most workloads. The extra constrains on the size of the blocks versus the size of the partitions you would like in-memory created an additional layer of complexity.
By removing the specialised code, we could avoid the very special case that was being made for HDFS as opposed to external cloud storage (s3, gcs, azure) where it didn't matter which worker accessed which part of the data.
In short, yes the docs should be updated.

Handling Very Large volume(500TB) data using spark

I have large volume of data nearly 500TB , I have to do some ETL on that data.
This data is there in the AWS S3, so I planning to use AWS EMR setup to process this data but I am not sure what should be the config I should select .
What kind of cluster I need(master and how many slaves)?
Do I need to process chunk by chunk(10GB) or can I process all data at once?
What should be Master and slave(executor) memory both Ram and storage?
What kind of processor (speed) I need?
Based on this I want to calculate the cost of AWS EMR and start process the data
Based upon your question, you have little or no experience with Hadoop. Get some training first so that you understand how the Hadoop ecosystem works. Plan on spending three months to get to a starter level.
You have a lot of choices to make, some are fundamental to a project's success. For example, what language (Scala, Java or Python)? Which tools (Spark, Hive, Pig, etc.). What format is your data in (CSV, XML, JSON, Parquet, etc.). Do you only need batch processing or do you require near real-time analysis, etc. etc. etc.
You may find other AWS services more applicable such as Athena or Redshift depending on what format your data is in and what information you are trying to extract / process.
With 500 TB in AWS, open a ticket with support. Explain what you have, what you want and your time frame. An SA will be available to direct you on a path.

When should we go for Apache Spark

Would it be wise to replace MR completely with Spark. Here are the areas where we still use MR and need your input to go ahead with Apache Spark option-
ETL : Data validation and transformation. Sqoop and custom MR programs using MR API.
Machine Learning : Mahout algorithms to arrive at recommendations, classification and clustering
NoSQL Integration : Interfacing with NoSQL Databases using MR API
Stream Processing : We are using Apache Storm for doing stream processing in batches.
Hive Query : We are already using Tez engine for speeding up Hive queries and see 10X performance improvement when compared with MR engine
ETL - Spark has much less boiler-plate code needed than MR. Plus you can code in Scala, Java and Python (not to mention R, but probably not for ETL). Scala especially, makes ETL easy to implement - there is less code to write.
Machine Learning - ML is one of the reasons Spark came about. With MapReduce, the HDFS interaction makes many ML programs very slow (unless you have some HDFS caching, but I don't know much about that). Spark can run in-memory so you can have programs build ML models with different parameters to run recursively against a dataset which is in-memory, so no file system interaction (except for the initial load).
NoSQL - There are many NoSQL datasources which can easily be plugged into Spark using SparkSQL. Just google which one you are interested in, it's probably very easy to connect.
Stream Processing - Spark Streaming works in micro-batches and one of the main selling points of Storm over Spark Streaming is that it is true streaming rather than micro batches. As you are already using batches Spark Streaming should be a good fit.
Hive Query - There is a Hive on Spark project which is going on. Check the status here. It will allow Hive to execute queries via your Spark Cluster and should be comparable to Hive on Tez.

Using Apache Spark to serve real time web services queries

We have a use case, where we are downloading large volumes (order of 100 gigabytes per day) of data from hundreds of data sources, massaging and processing this data and then exposing this data to our customers via RESTful API. Today the base data size is ca. 20TB and expected to grow heavily in the future.
For the massaging/processing part, we believe spark can be a very good choice for us. Now for exposing processed/massaged data through an API, one option is to store processed data to a read only database like ElephantDB and make web services to talk to ElephantDB (at least this is how Nathan has proposed in his Big Data book). I was just wondering what would be the implication of we make web services implementation to use SparkSQL to access processed data from Spark. What could be the architecture/design dangers in this case?
Every body is talking about Spark is fast and what not and using SparkSQL for interactive queries. But is it already in a stage to serve large volume of web services queries via SparkSQL where we have very strict SLA for latency serve hundreds and thousands of web services requests per second? If Apache Spark could handle this, we could avoid maintaining yet another system like ElephantDB or Cassandra or what not.
Would like to hear from the experts on this board.
If the results are stored in files, you have no indexes, and SparkSQL also doesn't create indexes. The only thing that can be somewhat fast is reading columns from Parquet files and caching tables.
But in general it's not a good use case to use SparkSQL to serve web requests simply because Spark wasn't made for that.
So you're batch processing the raw data, yes?
The ideal way would be to store the outcome on a key-value format, as you mention with ElephandDB, and also project Voldemort has been shown to be a good fit as read-only storage.
I recommend you to read this article (combining batch and realtime layers) by Nathan Marz: How to beat the CAP theorem
It has however been questioned by Jay Kreps in his article Questioning the Lambda Architecture. The main concern (with the lambda architecture) is that there is problematic to maintain the "same" system logic in different distributed systems to produce the same result.
But since you are using Spark, you can use the same logic with Spark Streaming. Which was not "in the market" when Nathan Marz and Jay Kreps wrote their articles.
You can still use SparkSQL to query the raw data interactively, but since Spark was first implemented as scheduled batch jobs, this will not be the perfect use case. But as you've probably noticed, is that it takes some time to submit spark jobs, this is an overhead that "kills" the idea of fast queries.
Please look into github.com/spark-jobserver/spark-jobserver, the job-server supports sub-second low-latency jobs via long-running job contexts. And can share Spark RDDs between different jobs, which can be proved to be very optimized for different interactive logic on the same dataset. Combine machine learning result and ad-hoc (SparkSQL) queries via HTTP requests. Read more about spark job-server, there are some talks about it online on different Spark Summits.

Why HDFS is write once and read multiple times?

I am a new learner of Hadoop.
While reading about Apache HDFS I learned that HDFS is write once file system. Some other distributions ( Cloudera) provides append feature. It will be good to know rational behind this design decision. In my humble opinion, this design creates lots of limitations on Hadoop and make it suitable for limited set of problems( problems similar to log analytic).
Experts comment will help me to understand HDFS in better manner.
HDFS follows the write-once, read-many approach for its files and applications. It assumes that a file in HDFS once written will not be modified, though it can be access ‘n’ number of times (though future versions of Hadoop may support this feature too)! At present, in HDFS strictly has one writer at any time. This assumption enables high throughput data access and also simplifies data coherency issues. A web crawler or a MapReduce application is best suited for HDFS.
As HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming data access is extremely important in HDFS. As HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data. HDFS overlooks a few POSIX requirements in order to implement streaming data access.
http://www.edureka.co/blog/introduction-to-apache-hadoop-hdfs/
There are three major reasons that HDFS has the design it has,
HDFS was designed by slavishly copying the design of Google's GFS, which was intended to support batch computations only
HDFS was not originally intended for anything but batch computation
Design a real distributed file system that can support high performance batch operations as well as real-time file modifications is difficult and was beyond the budget and experience level of the original implementors of HDFS.
There is no inherent reason that Hadoop couldn't have been built as a fully read/write file system. MapR FS is proof of that. But implementing such a thing was far outside of the scope and capabilities of the original Hadoop project and the architectural decisions in the original design of HDFS essentially preclude changing this limitation. A key factor is the presence of the NameNode since HDFS requires that all meta-data operations such as file creation, deletion or file length extensions round-trip through the NameNode. MapR FS avoids this by completely eliminating the NameNode and distributing meta-data throughout the cluster.
Over time, not having a real mutable file system has become more and more annoying as the workload for Hadoop-related systems such as Spark and Flink have moved more and more toward operational, near real-time or real-time operation. The responses to this problem have included
MapR FS. As mentioned ... MapR implemented a fully functional high performance re-implementation of HDFS that includes POSIX functionality as well as noSQL table and streaming API's. This system has been in performance for years at some of the largest big data systems around.
Kudu. Cloudera essentially gave up on implementing viable mutation on top of HDFS and has announced Kudu with no timeline for general availability. Kudu implements table-like structures rather than fully general mutable files.
Apache Nifi and the commercial version HDF. Hortonworks also has largely given up on HDFS and announced their strategy as forking applications into batch (supported by HDFS) and streaming (supported by HDF) silos.
Isilon. EMC implemented the HDFS wire protocol as part of their Isilon product line. This allows Hadoop clusters to have two storage silos, one for large-scale, high-performance, cost-effective batch based on HDFS and one for medium-scale mutable file access via Isilon.
other. There are a number of essentially defunct efforts to remedy the write-once nature of HDFS. These include KFS (Kosmix File System) and others. None of these have significant adoption.
An advantage of this technique is that you don't have to bother with synchronization. Since you write once, your reader are guaranteed that the data will not be manipulated while they read.
Though this design decision does impose restrictions, HDFS was built keeping in mind efficient streaming data access.
Quoting from Hadoop - The Definitive Guide:
HDFS is built around the idea that the most efficient data processing pattern is a
write-once, read-many-times pattern. A dataset is typically generated or copied
from source, and then various analyses are performed on that dataset over time.
Each analysis will involve a large proportion, if not all, of the dataset, so the time
to read the whole dataset is more important than the latency in reading the first
record.