MapReduce 2 vs YARN applications

MapReduce 2 vs YARN applications - mapreduce

I’m a bit confused about how new MapReduce2 applications should be developed to work with YARN and what happen with the old ones.
I currently have MapReduce1 applications which basically consist in:
Drivers which configure the jobs to be submitted to the cluster (previous JobTracker and now the ResourceManager).
Mappers + Reducers
From one side I see that applications coded in MapReduce1 are compatible in MapReduce2 / YARN, with a few caveats, just recompiling with new CDH5 libraries (I work with Cloudera distribution).
But from other side I see information about writing YARN applications in a different way than MapReduce ones (using YarnClient, ApplicationMaster, etc):
http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
But for me, YARN is just the architecture and how the cluster manage your MR app.
My questions are:
Are YARN applications including MapReduce applications?
Should I write my code like a YARN application, forgetting drivers
and creating Yarn clients, ApplicationMasters and so on?
Can I still develop the client classes with drivers + job settings?
Are MapReduce1 (recompiled with MR2 libraries) jobs managed by YARN
in the same way that YARN applications?
What differences are between MapReduce1 applications and YARN applications regarding the way in which YARN will manage them internally?
Thanks in advance

HADOOP Version 1
The JobTracker is responsible for resource management---managing the slave nodes--- major functions involve
tracking resource consumption/availability
job life-cycle management---scheduling individual tasks of the job, tracking progress, providing fault tolerance for tasks.
Issues with Hadoop v1
JobTracker is responsible for all spawned MR applications, it is a single point of failure---If JobTracker goes down, all applications in the cluster are killed. Moreover, if the cluster has a large number of applications, JobTracker becomes the performance bottleneck, to address the issues of scalability and job management Hadoop v2 was released.
Hadoop v2
The fundamental idea of YARN is to split the two major responsibilities of the Job-Tracker—that is, resource management and job scheduling/monitoring—into separate daemons: a global ResourceManager and a per-application ApplicationMaster (AM). The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, operating system for managing applications in a distributed manner.
To interact with the new resourceManagement and Scheduling, A Hadoop YARN mapReduce Application is developed---MRv2 has nothing to do with the mapReduce programming API
Application programmers will see no difference between MRv1 and MRv2, MRv2 is fully backward compatible---Yes a MR application(.jar), can be run on both the frameworks without any change in code.
The Hadoop 2.x already contains the code for MR Client and AppMaster, the programmer just needs to focus on their MapReduce Applications.
MapReduce was previously integrated in Hadoop Core---the only API to interact with data in HDFS. Now In Hadoop v2 it runs as a separate Application, Hadoop v2 allows other application programming frameworks---e.g MPI---to process HDFS data.

Refer to Apache documentation page on YARN architecture and related SE posts:
Hadoop gen1 vs Hadoop gen2
Are YARN applications including MapReduce applications?
YARN support Mapreduce applications. It also runs Spark jobs unlike in Hadoop 1.x.
Should I write my code like a YARN application, forgetting drivers and creating Yarn clients, ApplicationMasters and so on?
Yes. You should forget about all these application components and write your application. Have a look at sample code
Can I still develop the client classes with drivers + job settings? ¿Are MapReduce1 (recompiled with MR2 libraries) jobs managed by YARN in the same way that YARN applications?
Yes. You can do. But look at this compatibility article.
What differences are between MapReduce1 applications and YARN applications regarding the way in which YARN will manage them internally?
Refer to this SE post:
What additional benefit does Yarn bring to the existing map reduce?

YARN is just a cluster manager.
First the applications have to developed for YARN (if not already implemented). Here are few of the applications which are supported on YARN. If you want a new appplications to run on YARN this is the guide.
Then the same MR/Spark/Hama programs can be run on YARN.

Related

Does anyone know how to use AWS App2Container(A2C)?

AWS App2Container (A2C) is a recently launched feature by AWS. It is a CLI tool to help you lift and shift applications that run in your on-premises data centres or on virtual machines so that they run in containers that are managed by Amazon ECS or Amazon EKS. Since there is not much info on the internet about this, apart from the AWS document so does anybody knows how to implement it and what are the dependencies required for it?

This is a fairly new service so most people will be relying on reading at the moment.
For JAVA applications the setup instructions on Linux indicate that you just download the app2container package and then run the following over your code
sudo app2container containerize --application-id java-app-id
For .NET applications the setup instructions on Windows indicate that it is exactly the same process, run the install file and that will have all dependencies.
The best way to try and implement this will be by following these tutorials step by step. Also remember at this time it is JAVA or .NET only.

Medium Hadoop / Spark Cluster Administration

Please let me know if this question is more appropriate for a different channel but I was wondering what the recommended tools are for being able to install, configure and deploy hadoop/spark across a large number of remote servers. I'm already familiar with how to setup all of the software but I'm trying to determine what I should start using that would allow me to easily deploy across a large number of servers. I've started to look into configuration management tools (ie. chef, puppet, ansible) but was wondering what the best and most user friendly option to start off with is out there. I also do not want to use spark-ec2. Should I be creating homegrown scripts to loop through a hosts file containing IP? Should I use pssh? pscp? etc. I want to just be able to ssh with as many servers as needed and install all of the software.

If you have some experience in scripting language then you can go for chef. The recipes are already available for deployment and configuration of cluster and it's very easy to start with.
And if wants to do it by your own then you can use sshxcute java API which runs the script on remote server. You can build up the commands there and pass them to sshxcute API to deploy the cluster.

Check out Apache Ambari. Its a great tool for central management of configs, adding new nodes, monitoring the cluster, etc. This would be your best bet.

How to handle DB migration using AWS deployment tools

Amazon Web Services offer a number of continuous deployment and management tools such as Elastic Beanstalk, OpsWorks, Cloud Formation and Code Deploy depending on your needs. The basic idea being to facilitate code deployment and upgrade with zero downtime. They also help manage best architectural practice using AWS resources.
For simplicity lets assuming a basic architecture where you have a 2 tear structure; a collection of application servers behind a load balancer and then a persistence layer using a multi-zone RDS DB.
The actual code upgrade across a fleet of instances (app servers) is easy to understand. For a very simplistic overview the AWS service upgrades each node in turn handing connections off so the instance in question is not being used.
However, I can't understand how DB upgrades are managed. Assume that we are going from version 1.0.0 to 2.0.0 of an application and that there is a requirement to change the DB structure. Normally you would use a script or a library like Flyway to perform the upgrade. However, if there is a fleet of servers to upgrade there is a point where both 1.0.0 and 2.0.0 applications exist across the fleet each requiring a different DB structure.
I need to understand how this is actually achieved (high level) to know what the best way/time of performing the DB migration is. I guess there are a couple of ways they could be achieving this but I am struggling to see how they can do it and allow both 1.0.0 and 2.0.0 to persist data without loss.
If they migrate the DB structure with the first app node upgrade and at the same time create a cached version of the 1.0.0. Users connected to the 1.0.0 app persist using the cached version of the DB and users connected to the 2.0.0 app persist to the new migrated DB. Once all the app nodes are migrated, the cached data is merged into the DB.
It seems unlikely they can do this as the merge would be pretty complex but I can't see another way. Any pointers/help would be appreciated.

This is a common problem to encounter once your application infrastructure gets into multiple application nodes. In the olden days, you could take your application offline for "maintenance windows" during which you could:
Replace application with a "System Maintenance, back soon" page.
Perform database migrations (schema and/or data)
Deploy new application code
Put application back online
In 2015, and really for several years this approach is not acceptable. Your users expect 24/7 operation, so there must be a better way. Of course there is, the answer is a series of patterns for Database Refactorings.
The basic concept to always keep in mind is to assume you have to maintain two concurrent versions of your application, and there can be no breaking changes between these two versions. This means that you have a current application (v1.0.0) currently in production and (v2.0.0) that is scheduled to be deployed. Both these versions must work on the same schema. Once v2.0.0 is fully deployed across all application servers, you can then develop v3.0.0 that allows you to complete any final database changes.

interfacing with hadoop services in c++

From a c++ application, what are the best ways to connect to various service components of hadoop, like namenode, datanodes, jobtracker etc., so that its configuration can be changed or monitored.
Is it that one need to create JVMs for each of these components to interact with them from c++ application or the web interfaces these component provide, allow configuration of parameters dynamically for example changing the replication factor, updating xml files, reporting status of job etc. ?

You should checkout
Hadoop Streaming.
Hadoop-C++
Hadoop Streaming basically provides an interface to hadoop for any language which can input from stdin and output to stdout.

How to replicate code changes across multiple AWS instances?

We have a load balanced setup in AWS with two instances. We do pretty frequent code updates, utilizing SVN. I need to know how easy it is to update the code changes across all the instances in our cluster. Can we simply do 'snapshots' and create new volumes each time for the instances?...or?...

I would not do updates via EBS snapshots. Think of EBS volumes as a hard disk - you would not change your harddisk if you have an update for your software.
As you have your code in a version control system, code updates should be quite simple like logging in to your (multiple) servers and doing a git pull or svn update. This should fetch the latest code files from your servers. Depending on the type of application you would have to do some other tasks afterwards, running build scripts, emptying cache etc.
The problem is that this kind of setup does not scale well. If you have n servers, you will have to login and do this command n times. Therefore it makes sense to look into some remote management tools that you can use in one step. With a lot of these tools, you also get a complete configuration management stack: you define a set of recipes or tasks (like installed packages, configuration files, fetch the latest code, necessary build steps) for each of your servers, and when you boot up a new server it fetches the lastest version of its configuration and installs itself.
Popular configuration management tools include Puppet or Salt. Both tools have remote execution included which should make your task to publish your code base easier, you would only have to fire one command on your master server and it automatically executes this task on all its minions / slave servers.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js