Does an EMR master node know its cluster ID? - amazon-web-services

I want to be able to create EMR clusters, and for those clusters to send messages back to some central queue. In order for this to work, I need to have some sort of agent running on each master node. Each one of those agents will have to identify itself in this message so that the recipient knows which cluster the message is about.
Does the master node know its ID (j-*************)? If not, then is there some other piece of identifying information that could allow the message recipient to infer this ID?
I've taken a look through the config files in /home/hadoop/conf, and I haven't found anything useful. I found the ID in /mnt/var/log/instance-controller/instance-controller.log, but it looks like it'll be difficult to grep for. I'm wondering where instance-controller might get that ID from in the first place.

You may look at /mnt/var/lib/info/ on Master node to find lot of info about your EMR cluster setup. More specifically /mnt/var/lib/info/job-flow.json contains the jobFlowId or ClusterID.
You can use the pre-installed json parser (jq) to get the jobflow id.
cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId"
(updated as per #Marboni)

You can use Amazon EC2 API to figure out. The example below uses shell commands for simplicity. In real life you should use appropriate API to do this steps.
First you should find out your instance ID:
INSTANCE=`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`
Then you can use your instance ID to find out the cluster id :
ec2-describe-instances $INSTANCE | grep TAG | grep aws:elasticmapreduce:job-flow-id
Hope this helps.

As been specifed above, the information is in the job-flow.json file. This file has several other attributes. So, knowing where it's located, you can do it in a very easy way:
cat /mnt/var/lib/info/job-flow.json | grep jobFlowId | cut -f2 -d: | cut -f2 -d'"'
Edit: This command works in core nodes also.

Another option - query the metadata server:
curl -s http://169.254.169.254/2016-09-02/user-data/ | sed -r 's/.*clusterId":"(j-[A-Z0-9]+)",.*/\1/g'

Apparently the Hadoop MapReduce job has no way to know which cluster it is running on - I was surprised to find this out myself.
BUT: you can use other identifiers for each map to uniquely identify the mapper which is running, and the job that is running.
These are specified in the environment variables passed on to each mapper. If you are writing a job in Hadoop streaming, using Python, the code would be:
import os
if 'map_input_file' in os.environ:
fileName = os.environ['map_input_file']
if 'mapred_tip_id' in os.environ:
mapper_id = os.environ['mapred_tip_id'].split("_")[-1]
if 'mapred_job_id' in os.environ:
jobID = os.environ['mapred_job_id']
That gives you: input file name, the task ID, and the job ID. Using one or a combination of those three values, you should be able to uniquely identify which mapper is running.
If you are looking for a specific job: "mapred_job_id" might be what you want.

Related

Using multiple SSH keys for different hosts with Ansible EC2 Inventory Plugin

I am trying to use Ansible to install applications across a number of existing AWS EC2 instances which use a number of different SSH keys and usernames on different Linux OSes. Because of the changing state of the existing instances I am attempting to use Ansible's Dynamic Inventory via the aws_ec2 inventory plugin as recommended.
I am able to group the hosts by key_name but now need to run the Ansible playbook against this inventory using the relevant SSH key and username according to the group, structured as the below example output from ansible-inventory -i inventory.aws_ec2.yml --graph:
#all:
|--#_SSHkey1:
| |--hostnameA
| |--hostnameB
|--#_SSHkey2:
| |--hostnameC
|--#_SSHkey3:
| |--hostnameD
| |--hostnameE
| |--hostnameF
|--#aws_ec2:
| |--hostnameA
| |--hostnameB
| |--hostnameC
| |--hostnameD
| |--hostnameE
| |--hostnameF
|--#ungrouped:
I have tried creating a separate hosts file (as per the below) using the groups as listed above, providing the path to the relevant SSH key but I am unsure how you would use this with the dynamic inventory.
[SSHkey1]
ansible_user=ec2-user
ansible_ssh_private_key_file=/path/to/SSHkey1
[SSHkey2]
ansible_user=ubuntu
ansible_ssh_private_key_file=/path/to/SSHkey2
[SSHkey3]
ansible_user=ec2-user
ansible_ssh_private_key_file=/path/to/SSHkey3
This is not explained in the official Ansible documentation here and here but should be a common use case. A lot of the documentation I have found refers to an older method of using Dynamic Inventory using a python script (ec2.py) which is deprecated and so is no longer relevant (for instance this AWS post).
I have found a similar unanswered question here (Part 3).
Any links to examples, documentation or explanations would be greatly appreciated as this seems to be a relatively new way of creating a dynamic inventory and I am finding it hard to locate clear, detailed documentation.
Edit
Using group variables as suggested by #larsks in the comments worked. Was initially caught out by the fact that the SSH key names returned from the inventory plugin prepend an underscore so the group names need to be of the form _SSHkey.
The answer was to use group variables as suggested in the comments. SSH key names returned from the inventory plugin prepend an underscore so the group names need to be of the form _SSHkey.
Have you considered using the ssh config file? ~/.ssh/config. You can put specific host connection information there. Host, hostname,user,Identityfile are the four options you need
Host ec1
Hostname 10.10.10.10
User ubuntu
IdentityFile ~/.ssh/ec1-ubuntu.rsa
Then when you ssh to 'ec1' , ssh will connect to host 10.10.10.10 as user ubuntu with the specified rsa key. 'Ec1' can be any name you like it does not have to be actual host name or ip or FQDN. Make it match your inventory name.
Warning:: make certain file permissions for the directory ~/.ssh and the files within it are all 0600 (chmod -R 0600 ~/.ssh) and that the owner is correct or ssh will give you fits. On ubuntu the /var/log/auth.log will help with troubleshooting.

How to get ec2 instance details with price details using aws cli

How to get ec2 instance details(like name,id,type,region,volume,platform,ondemand/reserved) with instance price details .
Using aws api in cli and write it as a csv file .
Thanks in advance .
similar to my answer here: get ec2 pricing programmatically?
you can do something similar to the following:
aws pricing get-products --service-code AmazonEC2 --filters "Type=TERM_MATCH,Field=instanceType,Value=m5.xlarge" "Type=TERM_MATCH,Field=location,Value=US East (N. Virginia)" --region us-east-1 | jq -rc '.PriceList[]' | jq -r '[ .product.attributes.servicecode, .product.attributes.location, .product.attributes.instancesku?, .product.attributes.instanceType, .product.attributes.usagetype, .product.attributes.operatingSystem, .product.attributes.memory, .product.attributes.physicalProcessor, .product.attributes.processorArchitecture, .product.attributes.vcpu, .product.attributes.currentGeneration, .terms.OnDemand[].priceDimensions[].unit, .terms.OnDemand[].priceDimensions[].pricePerUnit.USD, .terms.OnDemand[].priceDimensions[].description] | #csv'
I recommand you to use ansible with the ec2-inventory to do so.
Ansible will be able to take all thoses informations using request like:
Then you can have the platform like this for example :
ansible -i ec2.py -m debug -a "var=ec2_platform" all
You'll have to create a script in yaml to take the informations you need and write them in a csv file.
I don't know any easy way to get the exact price of the servers for amazon-ec2, there is a lot argument to take in account, the OS, the disk space, server type, is it reserved or not, etc ...
But I did a good approximation using what I told you above.
Here is the explanation for dynamic inventory with ansible and ec2:
http://docs.ansible.com/ansible/intro_dynamic_inventory.html
Hope it helped !
If your aim is not to automate the prizing of your servers, you can have a one shot from this URL:
https://aws.amazon.com/fr/ec2/pricing/on-demand/
You'll need to know :
server type (ex: m3.large)
Reservation type (reserved or on demand)
OS type (linux, windows, RHEL, ...)
the hour coverage (it depends if you shutdown your server or not during the night or else ...)
Then you'll have a good approximation of the prize.
If you want to have more details, you'll have to have a look at your network and data activity. And this is not that easy to calculate...
Another approach would be to go in your pricing menu, and look at your facturation to know what you paid for the past month. But this won't work if you want to estimate the prize of a new server.
Hope it helped.

Is there a simple way to get an OpsWorks id from an instance?

Since many of the OpsWorks APIs take an OpsWorks id (different than an EC2 instance id), it seems like there should be an easy way to get the id. There is an opswork-agent-cli stack_state command that returns a JSON blob that has includes the id, but that still requires parsing, and I can't be sure what tools will be available on the instance. It is reasonably easy to parse the id out of the JSON using shell commands, but they feel like an ugly hack. Are there any commands I'm missing or other ways to get an instance to report its id?
I think you have to parse it.
You can use jq to parse JSON data, like it's typically done when reading EC2 instance metadata. jq package is included in AWS Linux AMIs (see available packages).
In your case, try opswork-agent-cli stack_state | jq '.stack.stack_id' .

How to get EMR cluster information from slave machine (task instance group) [duplicate]

I want to be able to create EMR clusters, and for those clusters to send messages back to some central queue. In order for this to work, I need to have some sort of agent running on each master node. Each one of those agents will have to identify itself in this message so that the recipient knows which cluster the message is about.
Does the master node know its ID (j-*************)? If not, then is there some other piece of identifying information that could allow the message recipient to infer this ID?
I've taken a look through the config files in /home/hadoop/conf, and I haven't found anything useful. I found the ID in /mnt/var/log/instance-controller/instance-controller.log, but it looks like it'll be difficult to grep for. I'm wondering where instance-controller might get that ID from in the first place.
You may look at /mnt/var/lib/info/ on Master node to find lot of info about your EMR cluster setup. More specifically /mnt/var/lib/info/job-flow.json contains the jobFlowId or ClusterID.
You can use the pre-installed json parser (jq) to get the jobflow id.
cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId"
(updated as per #Marboni)
You can use Amazon EC2 API to figure out. The example below uses shell commands for simplicity. In real life you should use appropriate API to do this steps.
First you should find out your instance ID:
INSTANCE=`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`
Then you can use your instance ID to find out the cluster id :
ec2-describe-instances $INSTANCE | grep TAG | grep aws:elasticmapreduce:job-flow-id
Hope this helps.
As been specifed above, the information is in the job-flow.json file. This file has several other attributes. So, knowing where it's located, you can do it in a very easy way:
cat /mnt/var/lib/info/job-flow.json | grep jobFlowId | cut -f2 -d: | cut -f2 -d'"'
Edit: This command works in core nodes also.
Another option - query the metadata server:
curl -s http://169.254.169.254/2016-09-02/user-data/ | sed -r 's/.*clusterId":"(j-[A-Z0-9]+)",.*/\1/g'
Apparently the Hadoop MapReduce job has no way to know which cluster it is running on - I was surprised to find this out myself.
BUT: you can use other identifiers for each map to uniquely identify the mapper which is running, and the job that is running.
These are specified in the environment variables passed on to each mapper. If you are writing a job in Hadoop streaming, using Python, the code would be:
import os
if 'map_input_file' in os.environ:
fileName = os.environ['map_input_file']
if 'mapred_tip_id' in os.environ:
mapper_id = os.environ['mapred_tip_id'].split("_")[-1]
if 'mapred_job_id' in os.environ:
jobID = os.environ['mapred_job_id']
That gives you: input file name, the task ID, and the job ID. Using one or a combination of those three values, you should be able to uniquely identify which mapper is running.
If you are looking for a specific job: "mapred_job_id" might be what you want.

Need to get name of cloudformation template used to deploy ec2 from the command line using aws cli or api

I used a cloudformation template to create an ec2 instance. Is there any way besides tagging that I can get the name of the cloudformation template via the command line?
Method 1: Tagging
Tagging is going to be the cleanest and easiest way to get that data. You do need to do some advance work and this won't work for existing instances, but it's going to be fast and reliable.
Method 2: Cross-referencing
If you have the instance id, you can ask Cloudformation to search for it's sibling stack resources, from which you can infer the stack name, id, etc.
c = boto.cloudformation.connect_to_region('us-east-1')
c.describe_stack_resources(physical_resource_id='i-830e2869')[0].stack_name
If the instance is not part of a stack, you'll get a Stack for i-830e2869 does not exist 400 error.
Method 3: User data
I'll admit - this was pretty creative so kudos for thinking it up.
curl http://169.254.169.254/latest/user-data | grep 'cfn-init -s' | awk '{print $3}'
The reason this works is that instances created by Cloudformation need to run /opt/aws/bin/cfn-init to install packages and /opt/aws/bin/cfn-signal in order to report their successful creation and one of the parameters is the stack name.
It'll fail if someone edits the user-data, but despite feeling a bit hacky, it seems pretty reliable. I still wouldn't recommend using it in prod given it's brittle reliance on a script parameter.