AWS Glue 3.0 container not working for Jupyter notebook local development - amazon-web-services

I am working on Glue in AWS and trying to test and debug in local dev. I follow the instruction here https://aws.amazon.com/blogs/big-data/developing-aws-glue-etl-jobs-locally-using-a-container/ to develop Glue job locally. On that post, they use Glue 1.0 image for testing and it works as it should be. However when I load and try to dev by Glue 3.0 version; I follow the guidance steps but, I can't open Jupyter notebook on :8888 like the post said even every step seems correct.
here my cmd to start a Jupyter notebook on Glue 3.0 container
docker run -itd -p 8888:8888 -p 4040:4040 -v ~/.aws:/root/.aws:ro --name glue3_jupyter amazon/aws-glue-libs:glue_libs_3.0.0_image_01 /home/jupyter/jupyter_start.sh
nothing shows on http://localhost:8888.
still have no idea why! I understand the diff. between versions of Glues just wanna develop and test on the latest version of it. Have anybody got the same issue?
Thanks.

It seems that GLUE 3.0 image has some issues with SSL. A workaround for working locally is to disable SSL (you also have to change the script paths as documentation is not updated).
$ docker run -it -p 8888:8888 -p 4040:4040 -e DISABLE_SSL="true" \
-e AWS_ACCESS_KEY_ID=$(aws --profile default configure get aws_access_key_id) \
-e AWS_SECRET_ACCESS_KEY=$(aws --profile default configure get aws_secret_access_key) \
-e AWS_DEFAULT_REGION=$(aws --profile default configure get region) \
--name glue_jupyter amazon/aws-glue-libs:glue_libs_3.0.0_image_01 \
/home/glue_user/jupyter/jupyter_start.sh
After a few seconds you should have a working jupyter notebook instance running on http://127.0.0.1:8888

Related

Userdata ec2 is not excuted

I am setting up a web app through code pipeline. My cloud formation script is creating an ec2 instance. In that ec2 user data, I have written a logic to get a code from the s3 and copy the code in the ec2 and start the server. A web app is in Python Pyramid framework.
code pipeline is connected with GitHub. It creates a zip file and uploads to the s3 bucket. (That is all in a buildspec.yml file)
When I changed the user data script and run code pipeline it works fine.
But When I changed some web app(My code base) file and re-run the code pipeline. That change is not reflected.
This is for ubuntu ec2 instance.
#cloud-boothook
#!/bin/bash -xe
echo "hello "
exec > /etc/setup_log.txt 2> /etc/setup_err.txt
sleep 5s
echo "User_Data starts"
rm -rf /home/ubuntu/c
mkdir /home/ubuntu/c
key=`aws s3 ls s3://bucket-name/pipeline-name/MyApp/ --recursive | sort | tail -n 1 | awk '{print $4}'`
aws s3 cp s3://bucket-name/$key /home/ubuntu/c/
cd /home/ubuntu/c
zipname="$(cut -d'/' -f3 <<<"$key")"
echo $zipname
mv /home/ubuntu/c/$zipname /home/ubuntu/c/c.zip
unzip -o /home/ubuntu/c/c.zip -d /home/ubuntu/c/
echo $?
python3 -m venv venv
venv/bin/pip3 install -e .
rm -rf cc.zip
aws configure set default.region us-east-1
venv/bin/pserve development.ini http_port=5000 &
The expected result is when I run core pipeline, every time user data script will execute.
Give me a suggestion, any other
The User-Data script gets executed exactly once upon instance creation. If you want to periodically synchronize your code changes to the instance you should think about implementing a CronJob in your User-Data script or use a service like AWS CodeDeploy to deploy new versions (this is the preferred approach).
CodePipeline uses a different S3 object for each pipeline execution artifact, so you can't hardcore a reference to it. You could publish the artifact to a fixed location. You might want to consider using CodeDeploy to deploy the latest version of your application.

Running Splash server and Scrapy spiders on the same Ec2 Instance

I'm deploying a web scraping application composed of Scrapy spiders that scrape content from websites as well as screenshot webpages with the Splash javascript rendering service. I want to deploy the whole application to a single Ec2 instance. But for the application to work I must run a splash server from a docker image at the same time I'm running my spiders. How can I run multiple processes on an Ec2 instance? Any advice on best practices would be most appreciated.
Total noob question. I found the best way to run a Splash server and Scrapy spiders on an Ec2 instance after configuration is via a bash script scheduled to run with a cronjob. Here is the bash script I came up with:
#!bin/bash
# Change to proper directory to run Scrapy spiders.
cd /home/ec2-user/project_spider/project_spider
# Activate my virtual environment.
source /home/ec2-user/venv/python36/bin/activate # activate my virtual environment
# Create a shell variable to store date at runtime
LOGDATE=$(date +%Y%m%dT%H%M%S);
# Spin up splash instance from docker image.
sudo docker run -d -p 8050:8050 -p 5023:5023 scrapinghub/splash --max-timeout 3600
# Scrape first site and store dated log file in logs directory.
scrapy crawl anhui --logfile /home/ec2-user/project_spider/project_spider/logs/anhui_spider/anhui_spider_$LOGDATE.log
...
# Spin down splash instance via docker image.
sudo docker rm $(sudo docker stop $(sudo docker ps -a -q --filter ancestor=scrapinghub/splash --format="{{.ID}}"))
# Exit virtual environment.
deactivate
# Send an email to confirm cronjob was successful.
# Note that sending email from Ec2 is difficult and you can not use 'MAILTO'
# in your cronjob without setting up something like postfix or sendmail.
# Using Mailgun is an easy way around that.
curl -s --user 'api:<YOURAPIHERE>' \
https://api.mailgun.net/v3/<YOURDOMAINHERE>/messages \
-F from='<YOURDOMAINADDRESS>' \
-F to=<RECIPIENT> \
-F subject='Cronjob Run Successfully' \
-F text='Cronjob completed.'

How to migrate elasticsearch data to AWS elasticsearch domain?

I have elasticsearch 5.5 running on a server with some data indexed in it. I want to migrate this ES data to AWS elasticsearch cluster. How I can perform this migration. I got to know that one way is by creating the snapshot of ES cluster, but I am not able to find any proper documentation for this.
The best way to migrate is by using Snapshots. You will need to snapshot your data to Amazon S3 and then proceed a restore from there. Documentation for snapshots to S3 can be found here. Alternatively, you can also re-index your data though this is a longer process and there are limitations depending on the version of AWS ES.
I also recommend looking at Elastic Cloud, the official hosted offering on AWS that includes the additional X-Pack monitoring, management, and security features. The migration guide for moving to Elastic Cloud also goes over snapshots and re-indexing.
I momentarily created a shell script for this -
Github - https://github.com/vivekyad4v/aws-elasticsearch-domain-migration/blob/master/migrate.sh
#!/bin/bash
#### Make sure you have Docker engine installed on the host ####
###### TODO - Support parameters ######
export AWS_ACCESS_KEY_ID=xxxxxxxxxx
export AWS_SECRET_ACCESS_KEY=xxxxxxxxx
export AWS_DEFAULT_REGION=ap-south-1
export AWS_DEFAULT_OUTPUT=json
export S3_BUCKET_NAME=my-es-migration-bucket
export DATE=$(date +%d-%b-%H_%M)
old_instance="https://vpc-my-es-ykp2tlrxonk23dblqkseidmllu.ap-southeast-1.es.amazonaws.com"
new_instance="https://vpc-my-es-mg5td7bqwp4zuiddwgx2n474sm.ap-south-1.es.amazonaws.com"
delete=(.kibana)
es_indexes=$(curl -s "${old_instance}/_cat/indices" | awk '{ print $3 }')
es_indexes=${es_indexes//$delete/}
es_indexes=$(echo $es_indexes|tr -d '\n')
echo "index to be copied are - $es_indexes"
for index in $es_indexes; do
# Export ES data to S3 (using s3urls)
docker run --rm -ti taskrabbit/elasticsearch-dump \
--s3AccessKeyId "${AWS_ACCESS_KEY_ID}" \
--s3SecretAccessKey "${AWS_SECRET_ACCESS_KEY}" \
--input="${old_instance}/${index}" \
--output "s3://${S3_BUCKET_NAME}/${index}-${DATE}.json"
# Import data from S3 into ES (using s3urls)
docker run --rm -ti taskrabbit/elasticsearch-dump \
--s3AccessKeyId "${AWS_ACCESS_KEY_ID}" \
--s3SecretAccessKey "${AWS_SECRET_ACCESS_KEY}" \
--input "s3://${S3_BUCKET_NAME}/${index}-${DATE}.json" \
--output="${new_instance}/${index}"
new_indexes=$(curl -s "${new_instance}/_cat/indices" | awk '{ print $3 }')
echo $new_indexes
curl -s "${new_instance}/_cat/indices"
done

How to customize the docker run command on Elastic Beanstalk?

Here's the thing, I need to tell Docker to not containerize the container’s networking, because it needs to connect to a MongoDB that is inside a VPN (enterprise private DB).
There is a Docker command that let's me do exactly that: --net=host. Reference here.
So, for example, when running the container on my local machine, I will do something like:
docker run --rm -it --net=host [image-name]:[version] bash -il
And that command will do the trick. Thanks to that, I can connect to the "private" MongoDB.
So, my question is: Is there a way customize the docker run command of a Single Docker Environment on Elastic Beanstalk so I can add the --net=host?
I have tried using the container_commands into the config.yml file to add that instruction there, but I don't think that does what I need, here is a snippet:
container_commands:
00-test_command:
command: bundle exec thin --net=host
01-networking-fix:
command: "docker run --rm -it --net=host [image-name]:[version] bash -il"
I ended up fixing it with two container commands
container_commands:
00_fix_networking:
command: sed -i 's/docker run -d/docker run --net=host -d/' /opt/elasticbeanstalk/hooks/appdeploy/enact/00run.sh
01_fix_docker_ip:
command: sed -i 's/server $EB_CONFIG_NGINX_UPSTREAM_IP/server localhost/' /opt/elasticbeanstalk/hooks/appdeploy/enact/01flip.sh
Update:
I also had to fix the Upstart script. Unfortunately, I didn't write down what I did because I didn't end up needing to alter the docker run command. You would do a files directive for (I think) /etc/init/docker. AWS edits the Nginx configuration in the same manner as in 01flip.sh in that file as well.
Explanation:
In the 64bit Amazon Linux 2015.03 v2.0.2 running Docker 1.7.1 platform version, the file you need to edit is /opt/elasticbeanstalk/hooks/appdeploy/enact/00run.sh. This file is now far more complex than Samar's version so I didn't want to put the actual contents in there. However, the change is basically the same. There's the line that starts with
docker run -d
I fixed it with a container command:
container_commands:
00_fix_networking:
command: sed -i 's/docker run -d/docker run --net=host -d/' /opt/elasticbeanstalk/hooks/appdeploy/enact/00run.sh
This successfully adds the --net=host argument but now there's another problem. The system ends up with an invalid Nginx directive. Using --net=host means that when you run docker inspect <container id> there is no IP address in the NetworkSettings. AWS uses this to create the server directive for Nginx and ends up generating server :<some port you chose> (before adding --net=host it would look like server <ip>:<port>). I needed to patch that file, too. It's generated in /opt/elasticbeanstalk/hooks/appdeploy/enact/01flip.sh.
01_fix_docker_ip:
command: sed -i 's/server $EB_CONFIG_NGINX_UPSTREAM_IP/server localhost/' /opt/elasticbeanstalk/hooks/appdeploy/enact/01flip.sh
While elastic beanstalk is generally well suited for applications that work with standard set of configurations, its difficult to customize and keep things updated along with the updates AWS provides to EB stacks. Having said that, I've done something like below which is a bit hacky but works fine.
files:
"/opt/elasticbeanstalk/hooks/appdeploy/pre/04run.sh":
mode: "000755"
owner: root
group: root
encoding: plain
content: |
#script content of original 04run.sh along with modification on docker run cmd
# eg. I injected multi-ports here
docker run -d \
"${EB_CONFIG_DOCKER_ENV_ARGS[#]}" \
"${EB_CONFIG_DOCKER_VOLUME_MOUNTS[#]}" \
"${EB_CONFIG_DOCKER_ENTRYPOINT_ARGS[#]}" \
"${PORT_ARGS[#]}" \
$EB_CONFIG_DOCKER_IMAGE_STAGING \
"${EB_CONFIG_DOCKER_COMMAND_ARGS[#]}" 2>&1 | tee /tmp/docker_run.log | tee $EB_CONFIG_DOCKER_STAGING_APP_FILE
This is not very neat, at least I have to make sure that it does not break with updates on elastic beanstalk. The above one is for docker 1.5 stack but you can do something similar with the version you're running.
Note that the latest version of the AWS stack (with Docker 1.7.1) has a slightly different pre-deploy setup. You'll need to update the file at the location: /opt/elasticbeanstalk/hooks/appdeploy/enact/00run.sh
commands:
00001_add_privileged:
cwd: /tmp
command: 'sed -i "s/docker run -d/docker run --privileged -d/" /opt/elasticbeanstalk/hooks/appdeploy/enact/00run.sh'
or, for example, if you want to pass args to your Docker image:
commands:
00001_modify_docker_run:
cwd: /tmp
command: 'sed -i "s/\$EB_CONFIG_DOCKER_IMAGE_STAGING/\$EB_CONFIG_DOCKER_IMAGE_STAGING -gzip -enable-url-source/" /opt/elasticbeanstalk/hooks/appdeploy/enact/00run.sh'

cloudformation composer install

So I am using cloudformation for my AWS setup, I am trying to run composer but for some reason no matter what command I put in my userdata section I always can an error, this is my error:
php /usr/local/bin/composer.phar create-project composer/satis /var/www/satis --stability=dev
[RuntimeException]
The HOME or COMPOSER_HOME environment variable must be set for composer to run correctly
This is my code within the userdata section:
"#composer\n",
"curl -sS https://getcomposer.org/installer | php\n",
"mv composer.phar /usr/local/bin/composer.phar\n",
"#satis\n",
"php /usr/local/bin/composer.phar create-project composer/satis /var/www/satis --stability=dev\n",
Does anyone have any ideas why this might not work and should I should be doing ?
Composer is looking for the location of the .composer directory. Export the HOME or COMPOSER_HOME env variable, e.g. : HOME=/root php /usr/local/bin/composer.phar create-project composer/satis /var/www/satis --stability=dev, it will work fine then.
I had the similar issue with amazon linux ami 2, it was showing in the log All settings correct for using Composer. The HOME or COMPOSER_HOME environment variable must be set for composer to run correctly, but it was not installed at all. Below is the way to fix it. Might be helpful to somebody rather waisting 2,3 hours!
sudo curl -sS https://getcomposer.org/installer | sudo php
mv composer.phar /usr/bin/composer
chmod +x /usr/bin/composer
export COMPOSER_HOME=/root
Agree with Ntwobike's answer.
When launching AWS EC2 instances I was installing composer by running an Ansible playbook during in the user data script run. (The user data script is called by cloud-init during the instance build process).
For some reason at this point in the build the $HOME environment variable is not set. So I needed to add 'export HOME=/root' - e.g.
# These need to be set to enable the composer installer to run. It is probably due to an issue
# with the $HOME variable not yet being set at this point in the instance creation.
export HOME=/root
ansible-playbook --extra-vars "target=localhost" playbooks/debian-9/drush.yml