Configuring AmazonLinux AMI instances - amazon-web-services

I am trying to setup an AMI such that, when booted it will auto configure itself with a defined "configuration" somewhere on a server. I came across Chef and Puppet. Considering Puppet, I was able to run though their examples but couldn't see one for auto configuration from master. I found out that Puppet Enterprise is not supported on "Amazon Linux". Team chose Amazon Linux and would like keep that instead of going to other OS just because one tool doesn't support it. Can someone please give me some idea about how I could achieve this? (I am trying to stay away from home grown shell scripts over a good industry adopted tool for maintainability)

What I have done in the past is to copy /etc/rc.local to /etc/rc.local.orig, and then configure /etc/rc.local to kick off a puppet run and then pave over itself.
/etc/rc.local:
#!/bin/bash
##
#add pre-puppeting stuff here, I add the hostname in "User-data" when creating the VM so I can set the hostname before checking in
##
/usr/bin/puppet agent --test
/bin/cp -f /etc/rc.local.orig /etc/rc.local
/sbin/init 6

AWS CloudFormation is one of Amazon's recommended ways to provision servers (and other cloud resources, too). You declare all the resources you need in a JSON file, and specify how to provision each server by declaring packages to install, services to run, files to create, and commands to run when the server is created. See the user guide for more information. I also wrote a couple of blog posts about getting started with it.

Related

AWS Batch Failing to launch Dockerfile - standard_init_linux.go:219: exec user process caused: exec format error

I am attempting to use AWS Batch to launch a linux server, which will in essence perform the fetch and go example included within AWS (to download a SH from S3 and run it).
Does AWS Batch work at all for anyone?
The aws fetch_and_go example always fails, even followed someone elses guide online which mimicked the aws example.
I have tried creating Dockerfile for amazonlinux:latest and ubuntu:20.04 with numerous RUN and CMD.
The scripts always seem to fail with the error:
standard_init_linux.go:219: exec user process caused: exec format error
I thought at first this was relevant to my deployment access rights maybe within the amazonlinux so have played with chmod 777, chmod -x etc on the she file.
The final nail in the coffin, my current script is litterely:
FROM ubuntu:20.04
Launch this using AWS Batch, no command or parameters passed through and it still fails with the same error code. This is almost hinting to me that there is either a setup issue with my AWS Batch (which im using default wizard settings, except changing to an a1.medium server) or that AWS Batch has some major issues.
Has anyone had any success with AWS Batch launching their own Dockerfiles ? Could they share their examples and/or setup parameters?
Thank you in advance.
A1 instances are ARM based first-generation Graviton CPU. It is highly likely the image you are trying to run something that is expecting x86 CPU (Intel or AMD). Any instance class with a "g" in it ("c6g" or "m5g") are Graviton2 which is also ARM based and will not work for the default examples.
You can test whether a specific container will run by launching an A1 instance yourself and running the container (after installing docker). My guess is that you will get the same error. Running on Intel or AMD instances should work.
To leverage Batch with ARM your containerized application will need to work on ARM. If you point me to the exact example, I can give more details on how to adjust to run on A1 or Graviton2 instances.
I had the same issue, and it was because I build the image locally on my M1 Mac.
Try adding --platform linux/amd64 to your docker build command before pushing if this is your case.
In addition to the other comment. You can create multi-arch images yourself which will provide the correct architecture.
https://www.docker.com/blog/multi-arch-build-and-images-the-simple-way/

AWS Docker Interpreter with Pycharm

I'm having difficulty to set this up correctly, and burning through AWS server time while I try to make it work. I have segmentation code that is heavily memory intensive that I'd like to temporarily spin up an AWS server with 192GB of ram. I understand that this is possible using docker, but the instructions on pycharm are non-existent with respect to the docker instructions necessary to tie it together (it references existing code as opposed to showing how to assemble it from scratch). What would the docker run command on the server look like to enable a connection to the 2375 port?
EDIT: I am using Pycharm Professional
UPD: Checking PyCharm options I found that there is an option to use Docker Machines. This seem to be exactly what you need. With Docker Machines you can make Docker spin up an EC2 instance for you with proper security out-of-the-box. Read official documentation on how to get started here and AWS driver options to learn how to set EC2 instance type, AMI, and other options here .
Original post:
To enable this feature you have to run Docker daemon with '-H' option:
sudo dockerd -H tcp://0.0.0.0:2375
You may read more on that in the Docker docs: https://docs.docker.com/engine/reference/commandline/dockerd/ .
Beware though, for EC2 you may also need to open that port using security group https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/authorizing-access-to-an-instance.html .
I also want to add that what you want to achieve isn't good from security perspective. Exposing docker socket like that is like an invitation for bad guys to throw a party at your EC2 instance. But since you mentioned that this is temporary...

Can you load standard zeppelin interpreter settings from S3?

Our company is building up a suite of common internal Spark functions and jobs, and I'd like to make sure that our data scientists have access to all of these when they prototype in Zeppelin.
Ideally, I'd like a way for them to start up a Zeppelin notebook on AWS EMR, and have the dependency jar we build automatically loaded onto it without them having to manually type in the maven information manually every time (private repo location/credentials, package info, etc).
Right now we have the dependency jar loaded on S3, and with some work we could get a private maven repository to host it on.
I see that ZEPPELIN_INTERPRETER_DIR saves off interpreter settings, but I don't think it can load from a common default location (like S3, or something)
Is there a way to tell Zeppelin on an EMR cluster to load it's interpreter settings from a common location? I can't be the first person to want this.
Other thoughts I've had but have not tried yet:
Have a script that uses aws cmd line options to start a EMR cluster with all the necessary settings pre-made for you. (Could also upload the .jar dependency if we can't get maven to work)
Use a infrastructure-as-code framework to start up the clusters with the required settings.
I don't believe it's possible to tell EMR to load settings from a common location. The first thought you included is the way to go imo - you would aws emr create ... and that create would include a shell script step to replace /etc/zeppelin/conf.dist/interpreter.json by downloading the interpreter.json of interest from S3, and then hard restart zeppelin (sudo stop zeppelin; sudo start zeppelin).

How to run pdftk on elastic beanstalk

I am trying to run pdftk on an Elastic Beanstalk. The first problem I run into is that I cannot install pdftk on an instance of a Amazon Linux AMI because one of the dependencies (gcj) is not supported.
One of the options I am looking at is creating my own AMI and using that for my Elastic Beanstalk. Amazon recommends not doing this, and there are no community images for EB and Ubuntu.
Another option is using Docker. I am not as familiar with Docker, but I think I would be able to install pdftk in a container and then deploy that to EB. I am using Codeship for deployments and it looks like they have some options for Docker. (This is the options I'm currently exploring)
The last option I can think of is writing a library for encrypting pdfs on my own. I had a look at the encryption specifications for pdfs and I think this is not a time efficient option.
Has any one had a similar problem and found a good solution to the problem?
UPDATE:
After some more research I discovered that the issue was not with Amazon Linux bug with Fedora. Fedora dropped gcj because there was a lack of maintainers on the project, then dropped pdftk because it depends on gcj.
If you need another pdf tool kit I have found podofo to be a good replacement for what I've needed.
First I apologise for resurrecting an old thread! Recently we wanted to create an Elastic Beanstalk worker environment that uses pdftk. Of course we also stumbled on the same issue, so this is what we did and it works for us so far. I hope it'll work for others too.
In the .ebextensions folder add the linked configs:
The needed LaTeX packages:
packages.config
You'll also need to add the el5 library in order to install libgcj.
01_el5_yum.config
Next add this config with the commands to install libgcj, pdftk and pdfjam
02_pdftk.config
And that should be it.
In case anyone comes here having problems with pdftk - poppler-utils also cover some tasks done by pdftk (in my case it was pdf splitting) and can be easily set up on an EB instance through .ebextensions:
packages:
yum:
poppler-utils: []

WebHCat on Amazon's EMR?

Is it possible or advisable to run WebHCat on an Amazon Elastic MapReduce cluster?
I'm new to this technology and I was wonder if it was possible to use WebHCat as a REST interface to run Hive queries. The cluster in question is running Hive.
I wasn't able to get it working but WebHCat is actually installed by default on Amazon's EMR instance.
To get it running you have to do the following,
chmod u+x /home/hadoop/hive/hcatalog/bin/hcat
chmod u+x /home/hadoop/hive/hcatalog/sbin/webhcat_server.sh
export TEMPLETON_HOME=/home/hadoop/.versions/hive-0.11.0/hcatalog/
export HCAT_PREFIX=/home/hadoop/.versions/hive-0.11.0/hcatalog/
/home/hadoop/hive/hcatalog/webhcat_server.sh start
You can then confirm that it's running on port 50111 using curl,
curl -i http://localhost:50111/templeton/v1/status
To hit 50111 on other machines you have to open the port up in the EC2 EMR security group.
You then have to configure the users you going to "proxy" when you run queries in hcatalog. I didn't actually save this configuration, but it is outlined in the WebHCat documentation. I wish they had some concrete examples there but basically I ended up configuring the local 'hadoop' user as the one that run the queries, not the most secure thing to do I am sure, but I was just trying to get it up and running.
Attempting a query then gave me this error,
{"error":"Server IPC version 9 cannot communicate with client version
4"}
The workaround was to switch off of the latest EMR image (3.0.4 with Hadoop 2.2.0) and switch to a Hadoop 1.0 image (2.4.2 with Hadoop 1.0.3).
I then hit another issues where it couldn't find the Hive jar properly, after struggling with the configuration more, I decided I had dumped enough time into trying to get this to work and decided to communicate with Hive directly (using RBHive for Ruby and JDBC for the JVM).
To answer my own question, it is possible to run WebHCat on EMR, but it's not documented at all (Googling lead me nowhere which is why I created this question in the first place, it's currently the first hit when you search "WebHCat EMR") and the WebHCat documentation leaves a lot to be desired. Getting it to work seems like a pain, though my hope is that by writing up the initial steps someone will come along and take it the rest of the way and post a complete answer.
I did not test it but, it should be doable.
EMR allows to customise the bootstrap actions, i.e. the scripts run where the nodes are started. You can use bootstrap actions to install additional software and to change the configuration of applications on the cluster
See more details at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html.
I would create a shell script to install WebHCat and test your script on a regular EC2 instance first (outside the context of EMR - just as a test to ensure your script is OK)
You can use EC2's user-data properties to test your script, typically :
#!/bin/bash
curl http://path_to_your_install_script.sh | sh
Then - once you know the script is working - make it available to the cluster on a S3 bucket and follow these instructions to include your script as custom bootstrap action of your cluster.
--Seb