From a c++ application, what are the best ways to connect to various service components of hadoop, like namenode, datanodes, jobtracker etc., so that its configuration can be changed or monitored.
Is it that one need to create JVMs for each of these components to interact with them from c++ application or the web interfaces these component provide, allow configuration of parameters dynamically for example changing the replication factor, updating xml files, reporting status of job etc. ?
You should checkout
Hadoop Streaming.
Hadoop-C++
Hadoop Streaming basically provides an interface to hadoop for any language which can input from stdin and output to stdout.
Related
I have an existing Windows desktop application written in C++ that needs to add support for SNMP so that a few pieces of status information are available on some SNMP OIDs. I found the net-snmp project and have been trying to understand how this can best fit into the existing program.
Questions:
Do I need to run snmpd, or can I just integrate the agent code into my application? I would prefer that starting my application does everything necessary rather than worry about deploying and running multiple processes, but the documentation doesn't speak much about doing this. The net-snmp agent daemon tutorial has an option for running the sample code as the full-agent rather than sub-agent, but I'm not sure about any limitations of doing this.
What would the PROs/CONs be for running a full agent in my application vs using snmpd and putting a subagent in my application? Is there a 3rd option I should also consider?
If I can integrate the full agent into the existing program, how do I pass it a configuration file via the API? Can I avoid the config file all together by passing these parameters in via function call instead?
I need to use JNI in my Dataflow pipeline. The JNI uses C++ library that has a ton of external dependencies on other system libraries. What would be the best way to make sure that the libraries are where they should be in the operating system when a worker runs the DoFn that uses the C++ library?
I found that the DataflowPipelineOptions.setWorkerHarnessContainerImage might allow me to specify custom docker image from the Google Container Registry that I could potentially install bunch of libraries on, but the documentation doesn't say much more. Are there any requirements for the docker image in terms of installed packages, entry points, etc...?
Apache Beam recently published an example of calling sub-processes from a Dataflow worker. The solution downloads the binary dynamically within the DoFn's #Setup method and then executes the binary for each record processed by the pipeline. The solution also handles collecting the output from the process and propagating failures to the pipeline.
Please let me know if this question is more appropriate for a different channel but I was wondering what the recommended tools are for being able to install, configure and deploy hadoop/spark across a large number of remote servers. I'm already familiar with how to setup all of the software but I'm trying to determine what I should start using that would allow me to easily deploy across a large number of servers. I've started to look into configuration management tools (ie. chef, puppet, ansible) but was wondering what the best and most user friendly option to start off with is out there. I also do not want to use spark-ec2. Should I be creating homegrown scripts to loop through a hosts file containing IP? Should I use pssh? pscp? etc. I want to just be able to ssh with as many servers as needed and install all of the software.
If you have some experience in scripting language then you can go for chef. The recipes are already available for deployment and configuration of cluster and it's very easy to start with.
And if wants to do it by your own then you can use sshxcute java API which runs the script on remote server. You can build up the commands there and pass them to sshxcute API to deploy the cluster.
Check out Apache Ambari. Its a great tool for central management of configs, adding new nodes, monitoring the cluster, etc. This would be your best bet.
I’m a bit confused about how new MapReduce2 applications should be developed to work with YARN and what happen with the old ones.
I currently have MapReduce1 applications which basically consist in:
Drivers which configure the jobs to be submitted to the cluster (previous JobTracker and now the ResourceManager).
Mappers + Reducers
From one side I see that applications coded in MapReduce1 are compatible in MapReduce2 / YARN, with a few caveats, just recompiling with new CDH5 libraries (I work with Cloudera distribution).
But from other side I see information about writing YARN applications in a different way than MapReduce ones (using YarnClient, ApplicationMaster, etc):
http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
But for me, YARN is just the architecture and how the cluster manage your MR app.
My questions are:
Are YARN applications including MapReduce applications?
Should I write my code like a YARN application, forgetting drivers
and creating Yarn clients, ApplicationMasters and so on?
Can I still develop the client classes with drivers + job settings?
Are MapReduce1 (recompiled with MR2 libraries) jobs managed by YARN
in the same way that YARN applications?
What differences are between MapReduce1 applications and YARN applications regarding the way in which YARN will manage them internally?
Thanks in advance
HADOOP Version 1
The JobTracker is responsible for resource management---managing the slave nodes--- major functions involve
tracking resource consumption/availability
job life-cycle management---scheduling individual tasks of the job, tracking progress, providing fault tolerance for tasks.
Issues with Hadoop v1
JobTracker is responsible for all spawned MR applications, it is a single point of failure---If JobTracker goes down, all applications in the cluster are killed. Moreover, if the cluster has a large number of applications, JobTracker becomes the performance bottleneck, to address the issues of scalability and job management Hadoop v2 was released.
Hadoop v2
The fundamental idea of YARN is to split the two major responsibilities of the Job-Tracker—that is, resource management and job scheduling/monitoring—into separate daemons: a global ResourceManager and a per-application ApplicationMaster (AM). The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, operating system for managing applications in a distributed manner.
To interact with the new resourceManagement and Scheduling, A Hadoop YARN mapReduce Application is developed---MRv2 has nothing to do with the mapReduce programming API
Application programmers will see no difference between MRv1 and MRv2, MRv2 is fully backward compatible---Yes a MR application(.jar), can be run on both the frameworks without any change in code.
The Hadoop 2.x already contains the code for MR Client and AppMaster, the programmer just needs to focus on their MapReduce Applications.
MapReduce was previously integrated in Hadoop Core---the only API to interact with data in HDFS. Now In Hadoop v2 it runs as a separate Application, Hadoop v2 allows other application programming frameworks---e.g MPI---to process HDFS data.
Refer to Apache documentation page on YARN architecture and related SE posts:
Hadoop gen1 vs Hadoop gen2
Are YARN applications including MapReduce applications?
YARN support Mapreduce applications. It also runs Spark jobs unlike in Hadoop 1.x.
Should I write my code like a YARN application, forgetting drivers and creating Yarn clients, ApplicationMasters and so on?
Yes. You should forget about all these application components and write your application. Have a look at sample code
Can I still develop the client classes with drivers + job settings? ¿Are MapReduce1 (recompiled with MR2 libraries) jobs managed by YARN in the same way that YARN applications?
Yes. You can do. But look at this compatibility article.
What differences are between MapReduce1 applications and YARN applications regarding the way in which YARN will manage them internally?
Refer to this SE post:
What additional benefit does Yarn bring to the existing map reduce?
YARN is just a cluster manager.
First the applications have to developed for YARN (if not already implemented). Here are few of the applications which are supported on YARN. If you want a new appplications to run on YARN this is the guide.
Then the same MR/Spark/Hama programs can be run on YARN.
My company uses a lot of different web services on daily bases. I find that I repeat same steps over and over again on daily bases.
For example, when I start a new project, I perform the following actions:
Create a new client & project in Liquid Planner.
Create a new client Freshbooks
Create a project in Github or Codebasehq
Developers to Codebasehq or Github who are going to be working on this project
Create tasks in Ticketing system on Codebasehq and tasks in Liquid Planner
This is just when starting new projects. When I have to track tasks, it gets even trickier because I have to monitor tasks in 2 different systems.
So my question is, is there a tool that I can use to create a web service that will automate some of these interactions? Ideally, it would be something that would allow me to graphically work with the web service API and produce an executable that I can run on a server.
I don't want to build it from scratch. I know, I can do it with Python or RoR, but I don't want to get that low level.
I would like to add my sources and pass data around from one service to another. What could I use? Any suggestions?
Progress DataXtend Semantic Integrator lets you build WebServices through an Eclipse based GUI.
It is a commercial product, and I happen to work for the company that makes it. In some respects I think it might be overkill for you, as it's really an enterprise-level data mapping tool for mapping disparate data sources (web services, databases, xml files, COBOL) to a common model, as opposed to a simple web services builder, and it doesn't really support your github bits, anymore than normal Eclipse plugins would.
That said, I do believe there are Mantis plugins for github to do task tracking, and I know there's a git plugin for Eclipse that works really well (jgit).
Couldn't you simply use Selenium to execute some of this tasks for you? Basically as long as you can do something from the browser, Selenium will also be able to do. Selenium comes with a language called "selenese", so you can even use it to programmatically create an "API" with your tasks.
I know this is a different approach to what you're originally looking for, but I've been using selenium for a number of tasks, and found it's even good to execute ANT tasks or unit tests.
Hope this helps you
What about Apache Camel?
Camel lets you create the Enterprise Integration Patterns to implement routing and mediation rules in either a Java based Domain Specific Language (or Fluent API), via Spring based Xml Configuration files or via the Scala DSL. This means you get smart completion of routing rules in your IDE whether in your Java, Scala or XML editor.
Apache Camel uses URIs so that it can easily work directly with any kind of Transport or messaging model such as HTTP, ActiveMQ, JMS, JBI, SCA, MINA or CXF Bus API together with working with pluggable Data Format options.