How to add connectors to presto on Amazon EMR - amazon-web-services

I've set up a small EMR cluster with Hive/Presto installed, I want to query files on S3 and import them to Postgres on RDS.
To run queries on S3 and save the results in a table in postgres I've done the following:
Started a 3 node EMR cluster from the AWS console.
Manually SSH into the Master node to create an EXTERNAL table in hive, looking at an S3 bucket.
Manually SSH into each of the 3 nodes and add a new catalog file:
/etc/presto/conf.dist/catalog/postgres.properties
with the following contents
connector.name=postgresql
connection-url=jdbc:postgresql://ip-to-postgres:5432/database
connection-user=<user>
connection-password=<pass>
and edited this file
/etc/presto/conf.dist/config.properties
adding
datasources=postgresql,hive
Restart presto by running the following manually on all 3 nodes
sudo restart presto-server
This setup seems to work well.
In my application, there are multiple databases created dynamically. It seems that those configuration/catalog changes need to be made for each database and the server needs to be restarted to see the new config changes.
Is there a proper way for my application (using boto or other methods) to update configurations by
Adding a new catalog file in all nodes /etc/presto/conf.dist/catalog/ for each new database
Adding a new entry in all nodes in /etc/presto/conf.dist/config.properties
Gracefully restarting presto across the whole cluster (ideally when it becomes idle, but that's not a major concern.

I believe you can run a simple bash script to achieve what you want. There is no other way except creating a new cluster with --configurations parameter where you provide the desired configurations. You can run below script from the master node.
#!/bin/sh
# "cluster_nodes.txt" with private IP address of each node.
aws emr list-instances --cluster-id <cluster-id> --instance-states RUNNING | grep PrivateIpAddress | sed 's/"PrivateIpAddress"://' | sed 's/\"//g' | awk '{gsub(/^[ \t]+|[ \t]+$/,""); print;}' > cluster_nodes.txt
# For each IP connect with ssh and configure.
while IFS='' read -r line || [[ -n "$line" ]]; do
echo "Connecting $line"
scp -i <PEM file> postgres.properties hadoop#$line:/tmp;
ssh -i <PEM file> hadoop#$line "sudo mv /tmp/postgres.properties /etc/presto/conf/catalog;sudo chown presto:presto /etc/presto/conf/catalog/postgres.properties;sudo chmod 644 /etc/presto/conf/catalog/postgres.properties; sudo restart presto-server";
done < cluster_nodes.txt

During Provision of your cluster:
You can provide the configuration details at the time of provision.
Refer to Presto Connector Configuration on how to add this automatically during the provision of your cluster.

You can provide the configuration via the management console as follows:
Or you can use the awscli to pass those configurations as follows:
#!/bin/bash
JSON=`cat <<JSON
[
{ "Classification": "presto-connector-postgresql",
"Properties": {
"connection-url": "jdbc:postgresql://ip-to-postgres:5432/database",
"connection-user": "<user>",
"connection-password": "<password>"
},
"Configurations": []
}
]
JSON`
aws emr create-cluster --configurations "$JSON" # ... reset of params

Related

How can I install aws cli, from WITHIN the ECS task?

Question:
How can I install aws cli, from WITHIN the ECS task ?
DESCRIPTION:
I'm using a docker container to run the logstash application (it is part of the elastic family).
The docker image name is "docker.elastic.co/logstash/logstash:7.10.2"
This logstash application needs to write to S3, thus it needs AWS CLI installed.
If aws is not installed, it crashes.
# STEP 1 #
To avoid crashing, when I used this application only as a docker, I ran it in a way that I caused the 'logstash start' to be delayed, after docker container was started.
I did this by adding "sleep" command to an external docker-entrypoint file, before it starts the logstash.
This is how it looks in the docker-entrypoint file:
sleep 120
if [[ -z $1 ]] || [[ ${1:0:1} == '-' ]] ; then
exec logstash "$#"
else
exec "$#"
fi
# EOF
# STEP 2 #
run the docker with "--entrypoint" flag so it will use my entrypoint file
docker run \
-d \
--name my_logstash \
-v /home/centos/DevOps/psifas_logstash_docker-entrypoint:/usr/local/bin/psifas_logstash_docker-entrypoint \
-v /home/centos/DevOps/logstash.conf:/usr/share/logstash/pipeline/logstash.conf \
-v /home/centos/DevOps/logstash.yml:/usr/share/logstash/config/logstash.yml \
--entrypoint /usr/local/bin/psifas_logstash_docker-entrypoint \
docker.elastic.co/logstash/logstash:7.10.2
# STEP 3 #
install aws cli and configure aws cli from the server hosting the docker:
docker exec -it -u root <DOCKER_CONTAINER_ID> yum install awscli -y
docker exec -it <DOCKER_CONTAINER_ID> aws configure set aws_access_key_id <MY_aws_access_key_id>
docker exec -it <DOCKER_CONTAINER_ID> aws configure set aws_secret_access_key <MY_aws_secret_access_key>
docker exec -it <DOCKER_CONTAINER_ID> aws configure set region <MY_region>
This worked for me,
Now I want to "translate" this flow into an AWS ECS task.
in ECS I will use parameters instead of running the above 3 "aws configure" commands.
MY QUESTION
How can I do my 3rd step, installing aws cli, from WITHIN the ECS task ? (meaning not to run it on the EC2 server hosting the ECS cluster)
When I was working on the docker I also thought of these options to use the aws cli:
find an official elastic docker image containing both logstash and aws cli. <-- I did not find one.
create such an image by myself and use. <-- I prefer not , because I want to avoid the maintenance of creating new custom images when needed (e.g when new version of logstash image is available).
Eventually I choose the 3 steps above, but I'm open to suggestion.
Also, My tests showed that running 2 containers within the same ECS task:
logstah
awscli
and then the logstash container will use the aws cli container
(image "amazon/aws-cli") is not working.
THANKS A LOT IN ADVANCE :-)
Your option #2, create the image yourself, is really the best way to do this. Anything else is going to be a "hack". Also, you shouldn't be running aws configure for an image running in ECS, you should be assigning a IAM role to the task, and the AWS CLI will pick that up and use it.
Mark B, your answer helped me to solve this. Thanks!
writing here the solution in case it will help somebody else.
There is no need to install AWS CLI, in the logstash docker container running inside the ECS task.
Inside the logstash container (from image "docker.elastic.co/logstash/logstash:7.10.2") there is AWS SDK to connect to the S3.
The only thing required is to allow the ECS Task execution role, access to S3.
(I attached AmazonS3FullAccess policy)

Ansible Dynamic inventory with static group with dynamic children

I am sure many who work with Terraform and Ansible or just Ansible on a daily basis must have come across this question.
Some background:
I create my infrastructure on AWS using Terraform and configure my machines using Ansible. my inventory file contains hardcoded public ip addresses with some variables. As the business demands, I create and destroy my machines very often.
My question:
I want don't want to update my inventory file with new public IP addresses every time I destroy and create my instances. So my fundamental requirement is - every time I destroy my machine I should be able run my Terraform script to recreate the machines and when I run my Ansible Playbook, Ansible should be able to pick up the right target machines and run the playbook. I need to know what I need to describe in my inventory file to achieve this automation. Domain name (www.fooexample.com) and static public IP addresses in the inventory file is not an option in my case? I have seen scripts that do it with, what it looks like a hostname (webserver1)
There are forums that talk about using the ec2.py option but ec2.py is getting all the public ip addresses associated with the account but i only want to target some of the machines as you can imagine and not all of them with my playbook.
Any help regarding this would be appreciated.
Thanks in Advance
I do something similar in GCP but the concept should apply to AWS.
Starting with Ansible 2.7 there is a new inventory plugin architecture and some inventory plugins to replace the dynamic inventory scripts (such as ec2.py and gcp.py). The AWS plugin documentation is at https://docs.ansible.com/ansible/2.9/plugins/inventory/aws_ec2.html.
First, you need to tag the groups of hosts you want to target in AWS. You should be able to handle this with Terraform (such as Service = Web).
Next, enable the aws_ec2 plugin in ansible.cfg by adding:
[inventory]
enable_plugins = aws_ec2
Now, convert over to using the new plugin instead of ec2.py. This means creating a aws_ec2.yaml file based on the documentation. An example might look like:
plugin: aws_ec2
regions:
- us-east-1
keyed_groups:
- prefix: tag
key: tags
# Set individual variables with compose
compose:
ansible_host: public_ip_address
The key parts here are the keyed_groups and compose section. This will give you the public IP addresses as the host to connect to in inventory and groups you can limit to with -l or --limit.
Considering you had some instances in us-east-1 tagged with Service = Web you could target them like:
ansible -i aws_ec2.yaml -m ping -l tag_Service_Web
This would target just those tagged hosts on their public IP address. Any dynamic scaling you do (such as increasing the count in Terraform for that resource) will be picked up by the inventory plugin on next run.
You can also use the tag in playbooks. If you had a playbook that you always targeted at these hosts you can set hosts: tag_Service_Web in the playbook.
Bonus:
I've been experimenting with an Ansible Pull model that automates some of this bootstrapping. The idea is to combine cloud-init with a special script to bootstrap the playbook for that host automatically.
Example script that cloud-init kicks off:
#!/bin/bash
set -euo pipefail
lock_files=(
/var/lib/dpkg/lock
/var/lib/apt/lists/lock
/var/lib/dpkg/lock-frontend
/var/cache/apt/archives/lock
/var/lib/apt/daily_lock
)
export ANSIBLE_HOST_PATTERN_MISMATCH="ignore"
export PATH="/tmp/ansible-venv/bin:$PATH"
for file in "${lock_files[#]}"; do
while fuser "$file" >/dev/null 2>&1; do
echo "Waiting for lock $file to be available..."
sleep 5
done
done
apt-get update -qy
apt-get install --no-install-recommends -qy virtualenv python-virtualenv python-nacl python-wheel python-bcrypt
virtualenv -p /usr/bin/python --system-site-packages /tmp/ansible-venv
pip install ansible==2.7.10 apache-libcloud==2.3.0 jmespath==0.9.3
ansible-pull myplaybook.yaml \
-U git#github.com:myorg/infrastructure.git \
-i gcp_compute.yaml \
--private-key /tmp/ansible-keys/infrastructure_ssh_deploy_key \
--vault-password-file /tmp/ansible-keys/vault \
-d /tmp/ansible-infrastructure \
--accept-host-key
This script is a bit simplified from my actual one (leaving out some domain specific authentication and key providing stuff). But you can adapt it to AWS by doing something like bootstrapping keys from S3 or KMS or another boot time configuration service. I find that ansible-pull works well when the playbook only takes a minute or two to run and doesn't have any dependencies on external inventory (like references to other groups such as to gather IP addresses).

AWS static ip address

I am using AWS code deploy agent and deploying my project to the server through bitbucket plugin.
The code deployment agent first executes the script files which has the command to execute my spring-boot project.
Since I have two environments one development and another production. I want the script to do things differently based on the environment i.e two different instances.
My plan is to fetch the aws static ip-address which is mapped and from that determine the environment
(production or stage).
How to fetch the elastic ip address through sh commands.
edited
Static IP will work.
Here is a more nature CodeDeploy way to solve this is - set up 2 CodeDeploy deployment groups, one for your development env and the other for your production env. Then in your script, you can use environment variables that CodeDeploy will set during the deployment to understand which env you are deploying to.
Here is a blog post about how to use CodeDeploy environment variables: https://aws.amazon.com/blogs/devops/using-codedeploy-environment-variables/
You could do the following:
id=$( curl http://169.254.169.254/latest/meta-data/instance-id )
eip=$( aws ec2 describe-addresses --filter Name=instance-id,Values=${id} | aws ec2 describe-addresses | jq .Addresses[].PublicIp --raw-output )
The above gets the instance-id from metadata, then uses the aws cli to look for elastic IPs filtered by the id from metadata. Using jq this output can then be parsed down to the IP you are looking for.
Query the metadata server
eip=`curl -s 169.254.169.254/latest/meta-data/public-ipv4`
echo $eip
The solution is completely off tangent to what I originally asked but it was enough for my requirement.
I just needed to know the environment I am in to do certain actions. So what I did was to set an environment variable in an independent script file where the environment variable is set and the value is that of the environment.
ex: let's say in a file env-variables.sh
export profile= stage
In the script file where the commands have to be executed based on the environment I access it this way
source /test/env-variables.sh
echo current profile is $profile
if [ $profile = stage ]
then
echo stage
elif [ $profile = production ]
then
echo production
else
echo failure
fi
Hope some one finds it useful

Run cron task on AWS EMR master node

How can i run periodic job in background on EMR cluster?
I have script.sh with cron job and application.py in s3 and want to run cluster with this command:
aws emr create-cluster
--name "Test cluster"
–-release-label emr-5.12.0
--applications Name=Hive Name=Pig Name=Ganglia Name=Spark
--use-default-roles
--ec2-attributes KeyName=myKey
--instance-type m3.xlarge
--instance-count 3
--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,
Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,
Args["s3://mybucket/script-path/script.sh"]
Finally, i want that cron job from script.sh execute application.py
Now i don't understand how to install cron on master node, python file need some libraries, they should be installed to.
You need to SSH into the master node, and then perform the crontab setup from there, not on your local machine:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html
Connect to the Master Node Using SSH
Secure Shell (SSH) is a network protocol you can use to create a
secure connection to a remote computer. After you make a connection,
the terminal on your local computer behaves as if it is running on the
remote computer. Commands you issue locally run on the remote
computer, and the command output from the remote computer appears in
your terminal window.
When you use SSH with AWS, you are connecting to an EC2 instance,
which is a virtual server running in the cloud. When working with
Amazon EMR, the most common use of SSH is to connect to the EC2
instance that is acting as the master node of the cluster.
Using SSH to connect to the master node gives you the ability to
monitor and interact with the cluster. You can issue Linux commands on
the master node, run applications such as Hive and Pig interactively,
browse directories, read log files, and so on. You can also create a
tunnel in your SSH connection to view the web interfaces hosted on the
master node. For more information, see View Web Interfaces Hosted on
Amazon EMR Clusters.
By default crontab is installed in linux system, you don't need to install manually.
To add spark job scheduling in cron, please follow below steps
Login into master node (SSH into master).
run command
crontab -e
Add the below line in crontab and save it ( :w)
*/15 0 * * * /script-path/script.sh
Now cron will schedule the jobs for every 15 minutes.
Please refer this link to know about cron.
Hope this helps.
Thanks
Ravi

How to setup Kubernetes Master HA on AWS

What I am trying to do:
I have setup kubernete cluster using documentation available on Kubernetes website (http_kubernetes.io/v1.1/docs/getting-started-guides/aws.html). Using kube-up.sh, i was able to bring kubernete cluster up with 1 master and 3 minions (as highlighted in blue rectangle in the diagram below). From the documentation as far as i know we can add minions as and when required, So from my point of view k8s master instance is single point of failure when it comes to high availability.
Kubernetes Master HA on AWS
So I am trying to setup HA k8s master layer with the three master nodes as shown above in the diagram. For accomplishing this I am following kubernetes high availability cluster guide, http_kubernetes.io/v1.1/docs/admin/high-availability.html#establishing-a-redundant-reliable-data-storage-layer
What I have done:
Setup k8s cluster using kube-up.sh and provider aws (master1 and minion1, minion2, and minion3)
Setup two fresh master instance’s (master2 and master3)
I then started configuring etcd cluster on master1, master 2 and master 3 by following below mentioned link:
http_kubernetes.io/v1.1/docs/admin/high-availability.html#establishing-a-redundant-reliable-data-storage-layer
So in short i have copied etcd.yaml from the kubernetes website (http_kubernetes.io/v1.1/docs/admin/high-availability/etcd.yaml) and updated Node_IP, Node_Name and Discovery Token on all the three nodes as shown below.
NODE_NAME NODE_IP DISCOVERY_TOKEN
Master1
172.20.3.150 https_discovery.etcd.io/5d84f4e97f6e47b07bf81be243805bed
Master2
172.20.3.200 https_discovery.etcd.io/5d84f4e97f6e47b07bf81be243805bed
Master3
172.20.3.250 https_discovery.etcd.io/5d84f4e97f6e47b07bf81be243805bed
And on running etcdctl member list on all the three nodes, I am getting:
$ docker exec <container-id> etcdctl member list
ce2a822cea30bfca: name=default peerURLs=http_localhost:2380,http_localhost:7001 clientURLs=http_127.0.0.1:4001
As per documentation we need to keep etcd.yaml in /etc/kubernete/manifest, this directory already contains etcd.manifest and etcd-event.manifest files. For testing I modified etcd.manifest file with etcd parameters.
After making above changes I forcefully terminated docker container, container was existing after few seconds and I was getting below mentioned error on running kubectl get nodes:
error: couldn't read version from server: Get httplocalhost:8080/api: dial tcp 127.0.0.1:8080: connection refused
So please kindly suggest how can I setup k8s master highly available setup on AWS.
To configure an HA master, you should follow the High Availability Kubernetes Cluster document, in particular making sure you have replicated storage across failure domains and a load balancer in front of your replicated apiservers.
Setting up HA controllers for kubernetes is not trivial and I can't provide all the details here but I'll outline what was successful for me.
Use kube-aws to set up a single-controller cluster: https://coreos.com/kubernetes/docs/latest/kubernetes-on-aws.html. This will create CloudFormation stack templates and cloud-config templates that you can use as a starting point.
Go the AWS CloudFormation Management Console, click the "Template" tab and copy out the complete stack configuration. Alternatively, use $ kube-aws up --export to generate the cloudformation stack file.
User the userdata cloud-config templates generated by kube-aws and replace the variables with actual values. This guide will help you determine what those values should be: https://coreos.com/kubernetes/docs/latest/getting-started.html. In my case I ended up with four cloud-configs:
cloud-config-controller-0
cloud-config-controller-1
cloud-config-controller-2
cloud-config-worker
Validate your new cloud-configs here: https://coreos.com/validate/
Insert your cloud-configs into the CloudFormation stack config. First compress and encode your cloud config:
$ gzip -k cloud-config-controller-0
$ cat cloud-config-controller-0.gz | base64 > cloud-config-controller-0.enc
Now copy the content into your encoded cloud-config into the CloudFormation config. Look for the UserData key for the appropriate InstanceController. (I added additional InstanceController objects for the additional controllers.)
Update the stack at the AWS CloudFormation Management Console using your newly created CloudFormation config.
You will also need to generate TLS asssets: https://coreos.com/kubernetes/docs/latest/openssl.html. These assets will have to be compressed and encoded (same gzip and base64 as above), then inserted into your userdata cloud-configs.
When debugging on the server, journalctl is your friend:
$ journalctl -u oem-cloudinit # to debug problems with your cloud-config
$ journalctl -u etcd2
$ journalctl -u kubelet
Hope that helps.
There is also kops project
From the project README:
Operate HA Kubernetes the Kubernetes Way
also:
We like to think of it as kubectl for clusters
Download the latest release, e.g.:
cd ~/opt
wget https://github.com/kubernetes/kops/releases/download/v1.4.1/kops-linux-amd64
mv kops-linux-amd64 kops
chmod +x kops
ln -s ~/opt/kops ~/bin/kops
See kops usage, especially:
kops create cluster
kops update cluster
Assuming you already have s3://my-kops bucket and kops.example.com hosted zone.
Create configuration:
kops create cluster --state=s3://my-kops --cloud=aws \
--name=kops.example.com \
--dns-zone=kops.example.com \
--ssh-public-key=~/.ssh/my_rsa.pub \
--master-size=t2.medium \
--master-zones=eu-west-1a,eu-west-1b,eu-west-1c \
--network-cidr=10.0.0.0/22 \
--node-count=3 \
--node-size=t2.micro \
--zones=eu-west-1a,eu-west-1b,eu-west-1c
Edit configuration:
kops edit cluster --state=s3://my-kops
Export terraform scripts:
kops update cluster --state=s3://my-kops --name=kops.example.com --target=terraform
Apply changes directly:
kops update cluster --state=s3://my-kops --name=kops.example.com --yes
List cluster:
kops get cluster --state s3://my-kops
Delete cluster:
kops delete cluster --state s3://my-kops --name=kops.identityservice.co.uk --yes