How to migrate elasticsearch data to AWS elasticsearch domain? - amazon-web-services

I have elasticsearch 5.5 running on a server with some data indexed in it. I want to migrate this ES data to AWS elasticsearch cluster. How I can perform this migration. I got to know that one way is by creating the snapshot of ES cluster, but I am not able to find any proper documentation for this.

The best way to migrate is by using Snapshots. You will need to snapshot your data to Amazon S3 and then proceed a restore from there. Documentation for snapshots to S3 can be found here. Alternatively, you can also re-index your data though this is a longer process and there are limitations depending on the version of AWS ES.
I also recommend looking at Elastic Cloud, the official hosted offering on AWS that includes the additional X-Pack monitoring, management, and security features. The migration guide for moving to Elastic Cloud also goes over snapshots and re-indexing.

I momentarily created a shell script for this -
Github - https://github.com/vivekyad4v/aws-elasticsearch-domain-migration/blob/master/migrate.sh
#!/bin/bash
#### Make sure you have Docker engine installed on the host ####
###### TODO - Support parameters ######
export AWS_ACCESS_KEY_ID=xxxxxxxxxx
export AWS_SECRET_ACCESS_KEY=xxxxxxxxx
export AWS_DEFAULT_REGION=ap-south-1
export AWS_DEFAULT_OUTPUT=json
export S3_BUCKET_NAME=my-es-migration-bucket
export DATE=$(date +%d-%b-%H_%M)
old_instance="https://vpc-my-es-ykp2tlrxonk23dblqkseidmllu.ap-southeast-1.es.amazonaws.com"
new_instance="https://vpc-my-es-mg5td7bqwp4zuiddwgx2n474sm.ap-south-1.es.amazonaws.com"
delete=(.kibana)
es_indexes=$(curl -s "${old_instance}/_cat/indices" | awk '{ print $3 }')
es_indexes=${es_indexes//$delete/}
es_indexes=$(echo $es_indexes|tr -d '\n')
echo "index to be copied are - $es_indexes"
for index in $es_indexes; do
# Export ES data to S3 (using s3urls)
docker run --rm -ti taskrabbit/elasticsearch-dump \
--s3AccessKeyId "${AWS_ACCESS_KEY_ID}" \
--s3SecretAccessKey "${AWS_SECRET_ACCESS_KEY}" \
--input="${old_instance}/${index}" \
--output "s3://${S3_BUCKET_NAME}/${index}-${DATE}.json"
# Import data from S3 into ES (using s3urls)
docker run --rm -ti taskrabbit/elasticsearch-dump \
--s3AccessKeyId "${AWS_ACCESS_KEY_ID}" \
--s3SecretAccessKey "${AWS_SECRET_ACCESS_KEY}" \
--input "s3://${S3_BUCKET_NAME}/${index}-${DATE}.json" \
--output="${new_instance}/${index}"
new_indexes=$(curl -s "${new_instance}/_cat/indices" | awk '{ print $3 }')
echo $new_indexes
curl -s "${new_instance}/_cat/indices"
done

Related

AWS Glue 3.0 container not working for Jupyter notebook local development

I am working on Glue in AWS and trying to test and debug in local dev. I follow the instruction here https://aws.amazon.com/blogs/big-data/developing-aws-glue-etl-jobs-locally-using-a-container/ to develop Glue job locally. On that post, they use Glue 1.0 image for testing and it works as it should be. However when I load and try to dev by Glue 3.0 version; I follow the guidance steps but, I can't open Jupyter notebook on :8888 like the post said even every step seems correct.
here my cmd to start a Jupyter notebook on Glue 3.0 container
docker run -itd -p 8888:8888 -p 4040:4040 -v ~/.aws:/root/.aws:ro --name glue3_jupyter amazon/aws-glue-libs:glue_libs_3.0.0_image_01 /home/jupyter/jupyter_start.sh
nothing shows on http://localhost:8888.
still have no idea why! I understand the diff. between versions of Glues just wanna develop and test on the latest version of it. Have anybody got the same issue?
Thanks.
It seems that GLUE 3.0 image has some issues with SSL. A workaround for working locally is to disable SSL (you also have to change the script paths as documentation is not updated).
$ docker run -it -p 8888:8888 -p 4040:4040 -e DISABLE_SSL="true" \
-e AWS_ACCESS_KEY_ID=$(aws --profile default configure get aws_access_key_id) \
-e AWS_SECRET_ACCESS_KEY=$(aws --profile default configure get aws_secret_access_key) \
-e AWS_DEFAULT_REGION=$(aws --profile default configure get region) \
--name glue_jupyter amazon/aws-glue-libs:glue_libs_3.0.0_image_01 \
/home/glue_user/jupyter/jupyter_start.sh
After a few seconds you should have a working jupyter notebook instance running on http://127.0.0.1:8888

Userdata ec2 is not excuted

I am setting up a web app through code pipeline. My cloud formation script is creating an ec2 instance. In that ec2 user data, I have written a logic to get a code from the s3 and copy the code in the ec2 and start the server. A web app is in Python Pyramid framework.
code pipeline is connected with GitHub. It creates a zip file and uploads to the s3 bucket. (That is all in a buildspec.yml file)
When I changed the user data script and run code pipeline it works fine.
But When I changed some web app(My code base) file and re-run the code pipeline. That change is not reflected.
This is for ubuntu ec2 instance.
#cloud-boothook
#!/bin/bash -xe
echo "hello "
exec > /etc/setup_log.txt 2> /etc/setup_err.txt
sleep 5s
echo "User_Data starts"
rm -rf /home/ubuntu/c
mkdir /home/ubuntu/c
key=`aws s3 ls s3://bucket-name/pipeline-name/MyApp/ --recursive | sort | tail -n 1 | awk '{print $4}'`
aws s3 cp s3://bucket-name/$key /home/ubuntu/c/
cd /home/ubuntu/c
zipname="$(cut -d'/' -f3 <<<"$key")"
echo $zipname
mv /home/ubuntu/c/$zipname /home/ubuntu/c/c.zip
unzip -o /home/ubuntu/c/c.zip -d /home/ubuntu/c/
echo $?
python3 -m venv venv
venv/bin/pip3 install -e .
rm -rf cc.zip
aws configure set default.region us-east-1
venv/bin/pserve development.ini http_port=5000 &
The expected result is when I run core pipeline, every time user data script will execute.
Give me a suggestion, any other
The User-Data script gets executed exactly once upon instance creation. If you want to periodically synchronize your code changes to the instance you should think about implementing a CronJob in your User-Data script or use a service like AWS CodeDeploy to deploy new versions (this is the preferred approach).
CodePipeline uses a different S3 object for each pipeline execution artifact, so you can't hardcore a reference to it. You could publish the artifact to a fixed location. You might want to consider using CodeDeploy to deploy the latest version of your application.

Decrypted vars when install a new aws instance via user-data script

I have Ansible playbooks ready, they includes several encrypted vars. With normal process, I can feed a vault password file to decrypt them with --vault-password-file ~/.vault_pass.txt and deploy the change to remote EC2 instance. So I needn't expose the password file.
But my request is different here. I need include ansible-playbook change in user-data script when create a new EC2 instance. Ideally I should automatically have all setting ready after the instance is running.
I deploy the instances with Terraform by below simple user-data script:
#!/usr/bin/bash
yum -y update
/usr/local/bin/aws s3 cp s3://<BUCKET>/ansible.tar.gz ansible.tar.gz
gtar zxvf ansible.tar.gz
cd ansible
ansible-playbook -i inventory/ec2.py -c local ROLE.yml
So I have to upload my password file into user-data script as well, if in the playbook, there are some encrypted vars.
Anything I can do to avoid it? Will Ansible Tower help for this request?
I did test with CredStash, but still a chicken and egg issue.
If you want your instances to configure themselves they are going to either need all the credentials or another way to get the credentials, ideally with some form of one time pass.
The best I can think of off the top of my head is to use Hashicorp's Vault to store the credentials (potentially all of our secrets or maybe just the Ansible Vault password that then can be used to un-vault your Ansible variables) and have your deploy process create a one time use token that is injected into the user-data script via Terraform's templating.
To do this you'll probably want to wrap your Terraform apply command with some form of helper script that might look like this (untested):
#!/bin/bash
vault_host="10.0.0.3"
vault_port="8200"
response=`curl \
-X POST \
-H "X-Vault-Token:$VAULT_TOKEN" \
-d '{"num_uses":"1"}' \
http://${vault_host}:${vault_port}/auth/token/create/ansible_vault_read`
vault_token=`echo ${response} | jq '.auth.client_token' --raw-output`
terraform apply \
-var 'vault_host=${vault_host}'
-var 'vault_port=${vault_port}'
-var 'vault_token=${vault_token}'
And then your user data script will want to be templated in Terraform with something like this (also untested):
template.tf:
resource "template_file" "init" {
template = "${file("${path.module}/init.tpl")}"
vars {
vault_host = "${var.vault_host}"
vault_port = "${var.vault_port}"
vault_token = "${var.vault_token}"
}
}
init.tpl:
#!/usr/bin/bash
yum -y update
response=`curl \
-H "X-Vault-Token: ${vault_token}" \
-X GET \
http://${vault_host}:${vault_port}/v1/secret/ansible_vault_pass`
ansible_vault_password=`echo ${response} | jq '.data.ansible_vault_pass' --raw-output`
echo ${ansible_vault_password} > ~/.vault_pass.txt
/usr/local/bin/aws s3 cp s3://<BUCKET>/ansible.tar.gz ansible.tar.gz
gtar zxvf ansible.tar.gz
cd ansible
ansible-playbook -i inventory/ec2.py -c local ROLE.yml --vault-password-file ~/.vault_pass.txt
Alternatively you could simply have the instances call something such as Ansible Tower to trigger the playbook to be run against it. This allows you to keep the secrets on the central box doing the configuration rather than having to distribute them to every instance you are deploying.
With Ansible Tower this is done using callbacks and you will need to set up job templates and then have your user data script curl the Tower to trigger the configuration run. You could change your user data script to something like this instead:
template.tf:
resource "template_file" "init" {
template = "${file("${path.module}/init.tpl")}"
vars {
ansible_tower_host = "${var.ansible_tower_host}"
ansible_host_config_key = "${var.ansible_host_config_key}"
}
}
init.tpl:
#!/usr/bin/bash
curl \
-X POST
--data "host_config_key=${ansible_host_config_key}" \
http://{${ansible_tower_host}/v1/job_templates/1/callback/
The host_config_key may seem to be a secret at first glance but it's a shared key that can be used for multiple hosts to access a job template and Ansible Tower will still only run if the host is either defined in a static inventory for the job template or if you are using dynamic inventories then if the host is found in that lookup.

Reading revision string in post-deploy hook

I though this would be easy but I cannot manage to find a way to get the revision string from a post deploy hook on EBS. The use case is straightforward: I want to warn rollbar of a deploy.
Here is the current script :
# Rollbar deploy notifier
files:
"/opt/elasticbeanstalk/hooks/appdeploy/post/90_notify_rollbar.sh":
mode: "000755"
content: |
#!/bin/bash
. /opt/elasticbeanstalk/support/envvars
LOCAL_USERNAME=`whoami`
REVISION=`date +%Y-%m-%d:%H:%M:%S`
curl https://api.rollbar.com/api/1/deploy/ \
-F access_token=$ROLLBAR_KEY \
-F environment=$RAILS_ENV \
-F revision=$REVISION \
-F local_username=$LOCAL_USERNAME
So far I'm using the current date as revision number, but that isn't really helpful. I tried using /opt/elasticbeanstalk/bin/get-config but I couldn't find anything relevant in the environment and container section, and couldn't read anything from meta. Plus, I found no doc about those, so...
Ideally, I would also like the username of the deployer, not the one on the local machine, but that would be the cherry on the cake.
Thanks for your time !
You can update your elastic beanstalk instance profile role (aws-elasticbeanstalk-ec2-role) to allow it to call Elastic Beanstalk APIs. In the post deploy hook you can call DescribeEnvironments with the current environment name using the aws cli or any of the AWS SDKs.
Let me know if you have any more questions about this or if this does not work for you.
I'm also looking for an easy alternative for API. For now I use bash
eb deploy && curl https://api.rollbar.com/api/1/deploy/ -F access_token=xxx -F environment=production -F revision=`git rev-parse --verify HEAD` -F rollbar_username=xxx
Replace xxx with your token and username

How to set an environment variable in Amazon EC2

I created a tag on the AWS console for one of my EC2 instances.
However, when I look on the server, no such environment variable is set.
The same thing works with elastic beanstalk. env shows the tags I created on the console.
$ env
[...]
DB_PORT=5432
How can I set environment variables in Amazon EC2?
You can retrieve this information from the meta data and then run your own set environment commands.
You can get the instance-id from the meta data (see here for details: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html#instancedata-data-retrieval)
curl http://169.254.169.254/latest/meta-data/instance-id
Then you can call the describe-tags using the pre-installed AWS CLI (or install it on your AMI)
aws ec2 describe-tags --filters "Name=resource-id,Values=i-5f4e3d2a" "Name=Value,Values=DB_PORT"
Then you can use OS set environment variable command
export DB_PORT=/what/you/got/from/the/previous/call
You can run all that in your user-data script. See here for details: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html
Lately, it seems AWS Parameter Store is a better solution.
Now there is even a secrets manager which auto manages sensitive configurations as database keys and such..
See this script using SSM Parameter Store based of the previous solutions by Guy and PJ Bergeron.
https://github.com/lezavala/ec2-ssm-env
I used a combination of the following tools:
Install jq library (sudo apt-get install -y jq)
Install the EC2 Instance Metadata Query Tool
Here's the gist of the code below in case I update it in the future: https://gist.github.com/marcellodesales/a890b8ca240403187269
######
# Author: Marcello de Sales (marcello.desales#gmail.com)
# Description: Create Create Environment Variables in EC2 Hosts from EC2 Host Tags
#
### Requirements:
# * Install jq library (sudo apt-get install -y jq)
# * Install the EC2 Instance Metadata Query Tool (http://aws.amazon.com/code/1825)
#
### Installation:
# * Add the Policy EC2:DescribeTags to a User
# * aws configure
# * Souce it to the user's ~/.profile that has permissions
####
# REboot and verify the result of $(env).
# Loads the Tags from the current instance
getInstanceTags () {
# http://aws.amazon.com/code/1825 EC2 Instance Metadata Query Tool
INSTANCE_ID=$(./ec2-metadata | grep instance-id | awk '{print $2}')
# Describe the tags of this instance
aws ec2 describe-tags --region sa-east-1 --filters "Name=resource-id,Values=$INSTANCE_ID"
}
# Convert the tags to environment variables.
# Based on https://github.com/berpj/ec2-tags-env/pull/1
tags_to_env () {
tags=$1
for key in $(echo $tags | /usr/bin/jq -r ".[][].Key"); do
value=$(echo $tags | /usr/bin/jq -r ".[][] | select(.Key==\"$key\") | .Value")
key=$(echo $key | /usr/bin/tr '-' '_' | /usr/bin/tr '[:lower:]' '[:upper:]')
echo "Exporting $key=$value"
export $key="$value"
done
}
# Execute the commands
instanceTags=$(getInstanceTags)
tags_to_env "$instanceTags"
If you are using linux or mac os for your ec2 instance then,
Go to your root directory and write command:
vim .bash_profile
You can see your bash_profile file and now press 'i' for inserting a lines, then add
export DB_PORT="5432"
After adding this line you need to save file, so press 'Esc' button then press ':' and after colon write 'w' it will save the file without exiting.
For exit, again press ':' after that write 'quit' and now you are exit from the file. To check that your environment variable is set or not write below commands:
python
>>>import os
>>>os.environ.get('DB_PORT')
>>>5432
Following the instructions given by Guy, I wrote a small shell script. This script uses AWS CLI and jq. It lets you import your AWS instance and AMI tags as shell environment variables.
I hope it can help a few people.
https://github.com/12moons/ec2-tags-env