How to Start MapReduce programs? - mapreduce

I have installed single node cluster in my system(VM->Ubuntu).
I have studies basics of MapReduce and Hadoop Framework. How to get started with MapReduce Coding?

To understand MR please start with Why Big Data is needed and how the traditional problems of JOIN which was solved with SQL needs a different perspective as data is growing huge.
Please follow these amazing links which helped me .
http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/
http://bytepadding.com/map-reduce/

Ensure you have installed Hadoop. Create a new user in ubuntu $ sudo adduser hduser sudo. change user to 'hduser'(id used while Hadoop configuration,you can switch to the userid used during your Hadoop programming config. Install Java, Install ssh, Extract Hadoop into /home/hduser. Set JAVA HOME, HADOOP HOME , PATH environment variables in to .bashrc file. Create a new directory, write the program and compile the java files and make sure that you give all permissions to the files.it is there in local filesystem only we have to copy into HDFS. Output directory will be there HDFS, to check output

Related

AWS Parallel Cluster software installation

I am very new to generic HPC concepts, and recently I need to use AWS parallel cluster to conduct some large-scale parallel computation.
I went through this tutorial and successfully build a cluster with the Slurm scheduler. I can successfully log in to the system with ssh. But I got stuck here. I need to install some software but I can't determine how to. Should I do a sudo apt-get install xxx and expect it is installed on every new node instantiated whenever there is a job scheduled? On one hand, it sounds like magic, but on the other hand, are the master node and new nodes initiated sharing the same storage? If so, apt-get install might work as they are using the same file system. It seems the Internet has very little material about it.
To conclude, my question is: if I want to install packages on the cluster I created on AWS, am I able to use sudo apt-get install xxx to do it? Are the new nodes instantiated sharing the same storage as the head node? If so, is it a good practice to do it? If not, what's the right way?
Thank you very much!
On a Parallelcluster deployed cluster, the /home directory of the head node is shared by default as an NFS share across all compute nodes. So if you just install your application in the user folder (ec2-user home folder) it will be available to all compute nodes. Once you install your application you could just run your application using the scheduler.
You may have the question next that the /home is limited in space, that's why it is recommended to have an additional shared storage volume that you can attach to the head node during cluster creation this allows you to control the attributes of the shared storage such as size, type etc.. and use it. for more details here is the Parallelcluster documentation around Shared storage configuration section
https://docs.aws.amazon.com/parallelcluster/latest/ug/SharedStorage-v3.html
Using an additional shared storage is the recommended way to run your production workloads as you have better control over the storage volume attributes. However for getting started you could just try running from your home folder first.
Thanks

Can you load standard zeppelin interpreter settings from S3?

Our company is building up a suite of common internal Spark functions and jobs, and I'd like to make sure that our data scientists have access to all of these when they prototype in Zeppelin.
Ideally, I'd like a way for them to start up a Zeppelin notebook on AWS EMR, and have the dependency jar we build automatically loaded onto it without them having to manually type in the maven information manually every time (private repo location/credentials, package info, etc).
Right now we have the dependency jar loaded on S3, and with some work we could get a private maven repository to host it on.
I see that ZEPPELIN_INTERPRETER_DIR saves off interpreter settings, but I don't think it can load from a common default location (like S3, or something)
Is there a way to tell Zeppelin on an EMR cluster to load it's interpreter settings from a common location? I can't be the first person to want this.
Other thoughts I've had but have not tried yet:
Have a script that uses aws cmd line options to start a EMR cluster with all the necessary settings pre-made for you. (Could also upload the .jar dependency if we can't get maven to work)
Use a infrastructure-as-code framework to start up the clusters with the required settings.
I don't believe it's possible to tell EMR to load settings from a common location. The first thought you included is the way to go imo - you would aws emr create ... and that create would include a shell script step to replace /etc/zeppelin/conf.dist/interpreter.json by downloading the interpreter.json of interest from S3, and then hard restart zeppelin (sudo stop zeppelin; sudo start zeppelin).

Programming the Pepper robot without "Choreography" software?

Usually the developer can use Softbanks own software Choreography to give programs to Pepper robot.
Isn't there a way to setup a different development environment? e.g. Access via SSH and creating Python scripts with a simple text editor and starting the script manually? It means writing and starting Python scripts for Pepper without using Choreography.
You can also use qibuild (pip install qibuild) : https://github.com/aldebaran/qibuild
It contains a qipkg command, just run
qipkg deploy-package path/to/your/file.pml --url USER#IP:/home/nao
A pml file is a project, it is created by Choregraph, or you can use this tool :
https://github.com/pepperhacking/robot-jumpstarter
in order to get a sample app.
Of course, using Choregraphe is not an obligation, you can use the different SDKs directly.
You can for instance create a python script on your computer, copy it on the robot
scp path/to/script/myscript.py nao#robotIp
And then ssh onto the robot and launch the script
ssh nao#robotIp
python myscript.py
You can also ssh onto the robot, create a script (using nano for instance) and launch it from there.
I've been using Pycharm Pro for 6 months and I am happy with it. You get automatic deployment and remote debugging. The most basic setup must still be done with Choregraphe, but it takes less than one minut.

How to setup and use Kafka-Connect-HDFS in HDP 2.4

I want to use kafka-connect-hdfs on hortonworks 2.4. Can you please help me with the steps i need to follow to setup in HDP env.
Other than building Kafka Connect HDFS from source, you can download and extract Confluent Platform's TAR.GZ files on your Hadoop nodes. That doesn't mean you are "installing Confluent"
Then you can cd /path/to/confluent-x.y.z/
And run Kafka Connect from there.
./bin/connect-standalone ./etc/kafka/connect-standalone.properties ./etc/kafka-connect-hdfs/quickstart-hdfs.properties
If that is working for you, then in order to run connect-distributed (the recommended way to run Kafka Connect), you need to download the same thing on the rest of the machines you want to run Kafka Connect on.

WebHCat on Amazon's EMR?

Is it possible or advisable to run WebHCat on an Amazon Elastic MapReduce cluster?
I'm new to this technology and I was wonder if it was possible to use WebHCat as a REST interface to run Hive queries. The cluster in question is running Hive.
I wasn't able to get it working but WebHCat is actually installed by default on Amazon's EMR instance.
To get it running you have to do the following,
chmod u+x /home/hadoop/hive/hcatalog/bin/hcat
chmod u+x /home/hadoop/hive/hcatalog/sbin/webhcat_server.sh
export TEMPLETON_HOME=/home/hadoop/.versions/hive-0.11.0/hcatalog/
export HCAT_PREFIX=/home/hadoop/.versions/hive-0.11.0/hcatalog/
/home/hadoop/hive/hcatalog/webhcat_server.sh start
You can then confirm that it's running on port 50111 using curl,
curl -i http://localhost:50111/templeton/v1/status
To hit 50111 on other machines you have to open the port up in the EC2 EMR security group.
You then have to configure the users you going to "proxy" when you run queries in hcatalog. I didn't actually save this configuration, but it is outlined in the WebHCat documentation. I wish they had some concrete examples there but basically I ended up configuring the local 'hadoop' user as the one that run the queries, not the most secure thing to do I am sure, but I was just trying to get it up and running.
Attempting a query then gave me this error,
{"error":"Server IPC version 9 cannot communicate with client version
4"}
The workaround was to switch off of the latest EMR image (3.0.4 with Hadoop 2.2.0) and switch to a Hadoop 1.0 image (2.4.2 with Hadoop 1.0.3).
I then hit another issues where it couldn't find the Hive jar properly, after struggling with the configuration more, I decided I had dumped enough time into trying to get this to work and decided to communicate with Hive directly (using RBHive for Ruby and JDBC for the JVM).
To answer my own question, it is possible to run WebHCat on EMR, but it's not documented at all (Googling lead me nowhere which is why I created this question in the first place, it's currently the first hit when you search "WebHCat EMR") and the WebHCat documentation leaves a lot to be desired. Getting it to work seems like a pain, though my hope is that by writing up the initial steps someone will come along and take it the rest of the way and post a complete answer.
I did not test it but, it should be doable.
EMR allows to customise the bootstrap actions, i.e. the scripts run where the nodes are started. You can use bootstrap actions to install additional software and to change the configuration of applications on the cluster
See more details at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html.
I would create a shell script to install WebHCat and test your script on a regular EC2 instance first (outside the context of EMR - just as a test to ensure your script is OK)
You can use EC2's user-data properties to test your script, typically :
#!/bin/bash
curl http://path_to_your_install_script.sh | sh
Then - once you know the script is working - make it available to the cluster on a S3 bucket and follow these instructions to include your script as custom bootstrap action of your cluster.
--Seb