Can you load standard zeppelin interpreter settings from S3? - amazon-web-services

Our company is building up a suite of common internal Spark functions and jobs, and I'd like to make sure that our data scientists have access to all of these when they prototype in Zeppelin.
Ideally, I'd like a way for them to start up a Zeppelin notebook on AWS EMR, and have the dependency jar we build automatically loaded onto it without them having to manually type in the maven information manually every time (private repo location/credentials, package info, etc).
Right now we have the dependency jar loaded on S3, and with some work we could get a private maven repository to host it on.
I see that ZEPPELIN_INTERPRETER_DIR saves off interpreter settings, but I don't think it can load from a common default location (like S3, or something)
Is there a way to tell Zeppelin on an EMR cluster to load it's interpreter settings from a common location? I can't be the first person to want this.
Other thoughts I've had but have not tried yet:
Have a script that uses aws cmd line options to start a EMR cluster with all the necessary settings pre-made for you. (Could also upload the .jar dependency if we can't get maven to work)
Use a infrastructure-as-code framework to start up the clusters with the required settings.

I don't believe it's possible to tell EMR to load settings from a common location. The first thought you included is the way to go imo - you would aws emr create ... and that create would include a shell script step to replace /etc/zeppelin/conf.dist/interpreter.json by downloading the interpreter.json of interest from S3, and then hard restart zeppelin (sudo stop zeppelin; sudo start zeppelin).

Related

AWS CodeDeploy agent is deleting files in the wrong folder during install

We have an unusual setup. We use git on Azure Devops for our code repositories, and AWS for our cloud-based services. In our arsenal we have a mixture of AWS Lambda functions, along with console apps, web apps, and Windows services running on EC2 instances. We have been able to create CI/CD pipelines for all three classes of apps. For the apps running on EC2 instances we use AWS CodeDeploy. These deployments are more complicated, but they all work -- except for one.
Another unusual thing about our setup is that both our development and QA environments are on the same EC2 instance. When the CodeDeploy agent running on that instance retrieves the deployment archive, it unpacks it, reads the appspec.yml file, runs our before install script, which backs up the existing installation and shuts down any services that might be using those files. Then, the install phase updates the files in the designated environment, then deletes -- or tries to delete -- all the files in the other environment folder.
In other words, if a DEV deployment is running, it replaces the files in the DEV folder and also tries to delete the files in the QA folder. I know this sounds like a scripting problem, but I have checked all the script and yaml files no where do I reference the opposing environment.
In this case, the app is a Windows service. Normally, I get a Ruby 'Permission denied # unlink_internal' error on a file in the other folder. As an experiment, I shut down the service in the other environment in my before install script and, as I expected, the agent deleted all the files in the other environment. It updated the files in the target environment, but left the folder in the other environment empty!
Here are my files. I suspect, the problem is being caused by something I did, but I can't, for the life of me, find it.
These are all .NET projects. In my solution I have a ConfigFiles folder set up with subfolders for each environment. Then, in my pipeline yaml file I run a script to select the correct files to move into the archive based on the git branch that is being built.
Here's the code for code for the script that selects the correct files.
Here's the Azure pipeline YAML file.
Here's my before install script:
And, finally, here is my appspec.yml file, which the CodeDeploy agent uses to know where to update the files during installation. How I want this to be the wrong path, but in the deployment archive, the environment specific values are all exactly right.
Any ideas on this one would be greatly appreciated.
I encountered the same problem where deployment of an app deletes files from another app in another folder unexpectedly. My solution is to use different deployment groups for each app, even though they are deploying to the same EC2 instance.
Deploying many apps on the same EC2 instance using the same deployment group results in files/folder deletion on other deployed projects.
From AWS Technical Support:
The reason is that codedeploy creates a clean up file by the format '[deployment group 1 ID]_cleanup" in the directory '/opt/codedeploy-agent/deployment-root/deployment-instructions' everytime a deployment is made to the deployment group and this file deletes all the files that had been installed during the previous deployment made to the deployment group. Since the deployment group is the same in your case, when you make a deployment to the deployment group which installs files to the folder "/var/www/project1", files installed by the previous deployment in the folder "/var/www/project2" are being cleaned up and vice versa which is an expected mechanism of the codedeploy agent.
You can find the explaination here: https://docs.aws.amazon.com/codedeploy/latest/userguide/codedeploy-agent.html#codedeploy-agent-install-files
Please consider creating two different applications/deployment groups
and configure the two pipelines to use different
applications/deployment groups which should fix your problem.

How do I access beanstalk application venv?

this last week I have been trying to upload a flask app using AWS Beanstalk.
The main problem for me was loading a very heavy library as part of the bundle (there is a 500mb limit for uploading the bundle code).
Instead, I tried to use requirements.txt file so it would download the library directly to the server.
Unfortunately, every time I tried to include the library name in the requirements file, it failed to load it (torch library).
on pythonanywhere server there is a console which allows you to access the virtual environment and simply type
pip install torch
which was very useful and comfortable.
I am looking for something similar in AWS beanstalk, so that I could install the library directly instead of relying on the requirements.txt file.
I have been at it for a few days now and can't make any progress.
your help would be much appreciated.
another question,
is it possible to load the venv to Amazon-S3 and then access the folder from the beanstalk environment?
Its not a good practice to "manually" install your dependencies or configure your EB env from inside. This is only useful for testing and debugging purposes. Thus keep that it mind.
To get your venv, you have to ssh to your EB instance using regular ssh or web-based clients available in AWS EC2 console when you locate your EB EC2 instance. Session manager should work out-of-the-box to enable you to login to the instance.
When you login to the instance, then to activate your venv, you do:
# start bash
bash
# source venv
source /var/app/venv/staging-*/bin/activate

Is there a way to push changed to AWS Beanstalk instead of uploading an entire zip file on each deploy?

Im migrating a Play! application from Heroku to AWS Beanstalk.
Heroku is really straight forward when it comes to deploying: Just push changes to a remote git repository on Heroku and the build occurs on the server side.
This is very convenient because it is not necessary to upload the whole project for each tiny change (Including all libraries!).
Basically for each change we are generating a huge 140 MB Docker zipped file that takes at least 10 minutes to upload.
Surely there must be a better way but a long search on Google only returned options to automize the file generation with scripts and alternatives like Jenkins but this does not solve the problem, it just automates the problem.
Does anyone have a better solution?
You can set up a AWS CodeCommit repository, and use that as a remote for your local git repository. Next you can set up AWS CodePipeline to build your application and deploy to Elastic Beanstalk whenever there is a new commit to the AWS CodeCommit repository.
This way you don't have to upload everything every time. Whenever you do git push, only the changed files are uploaded to the AWS CodeCommit repository, and then AWS CodePipeline takes care of building your application and deploying it to Elastic Beanstalk.
So I got curious about this question too and had a conversation with an AWS specialist about different options here. Each option has it's downsides tho.
The first option is to bake your application code, create an AMI out of it and carry out deployment using baked AMI. More on that
You have to test this approach first before adopting. The downside is that you would have to regularly maintain the AMI. You might also miss out critical patches from Beanstalk since AMI has been locked down
A good read on this topic
The next approach would be to move out of Beanstalk and use CloudFormation where you can just upload your application folder to S3. Your CloudFormation template has to take care of spinning up all the resources required and using AWS::CloudFormation::Init and cfn-signals, it would be possible to install and setup software.Changes within the resource Metadata can be detected by making use of the proper CloudFormation signal and we can also run user-specified actions when a change is detected on the template specification.
(AWS::CloudFormation::Init)
http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-helper-scripts-reference.html (set of helper scripts that can be used with CloudFormation)
Although these are not exactly a solution to what you asked for, they can be a good alternative. At least I made sure that you are not missing out any available options at Beanstalk.
Also one advice I got from them was to consider splitting up application into multiple components and sub-components. This would reduce your application size considerably.
Hope this helped.
Short answer: No.
Long Answer: I ended up packaging the app with activator and not using Docker.
Crate a folder named "dist" in the root of the project.
Include a file named Procfile with the following line:
web: ./bin/YOUR_APP_NAME -Dhttp.port=5000 -Dconfig.file=conf/application.conf
Make sure to replace YOUR_APP_NAME with the name of your app as configured in build.sbt.
Package the Play app with the following command:
activator clean dist
That will generate a zip file inside target/universal/ folder in the project.
Deploy that zip file to AWS Elastic Beanstalk.

Spark standalone mode on AWS EMR

I'm able to run Spark on AWS EMR without much trouble following the documentation but from what I see it always uses YARN instead of the standalone manager. Is there any way to use the standalone mode instead of YARN easily? I don't really feel like hacking the bootstrap scripts to turn off yarn and deploy spark master/workers myself.
I'm running into a weird YARN related bug and I was hoping it won't happen with standalone manager.
As far as I know there are no way to run in standalone mode on EMR unless you go back to the old ami-versions instead of using the emr-release-label. The old ami-version will however cause other problems with newer versions of Spark, so I wouldn't go that way.
What you can do is to launch ordinary EC2-instances with Spark instead of using EMR. If you have a local Spark installation, go to the ec2 folder and use spark-ec2 to launch the cluster, like this:
./spark-ec2 --copy-aws-credentials --key-pair=MY_KEY --identity-file=MY_PEM_FILE.pem --region=MY_PREFERED_REGION --instance-type=INSTANCE_TYPE --slaves=NUMBER_OF_SLAVES --hadoop-major-version=2 --ganglia launch NAME_OF_JOB
I suspect that you have jar-files that are needed, so they have to be copied onto the cluster (copy to master first, ssh to master and copy them onto the slaves from there. ./spark-ec2/copy-dir on master will copy a directory onto all slaves). Then restart Spark:
./spark/sbin/stop-master.sh
./spark/sbin/stop-slaves.sh
./spark/sbin/start-master.sh
./spark/sbin/start-slaves.sh
and you are ready to launch Spark in standalone mode:
./spark/bin/spark-submit --deploy-mode client ...

WebHCat on Amazon's EMR?

Is it possible or advisable to run WebHCat on an Amazon Elastic MapReduce cluster?
I'm new to this technology and I was wonder if it was possible to use WebHCat as a REST interface to run Hive queries. The cluster in question is running Hive.
I wasn't able to get it working but WebHCat is actually installed by default on Amazon's EMR instance.
To get it running you have to do the following,
chmod u+x /home/hadoop/hive/hcatalog/bin/hcat
chmod u+x /home/hadoop/hive/hcatalog/sbin/webhcat_server.sh
export TEMPLETON_HOME=/home/hadoop/.versions/hive-0.11.0/hcatalog/
export HCAT_PREFIX=/home/hadoop/.versions/hive-0.11.0/hcatalog/
/home/hadoop/hive/hcatalog/webhcat_server.sh start
You can then confirm that it's running on port 50111 using curl,
curl -i http://localhost:50111/templeton/v1/status
To hit 50111 on other machines you have to open the port up in the EC2 EMR security group.
You then have to configure the users you going to "proxy" when you run queries in hcatalog. I didn't actually save this configuration, but it is outlined in the WebHCat documentation. I wish they had some concrete examples there but basically I ended up configuring the local 'hadoop' user as the one that run the queries, not the most secure thing to do I am sure, but I was just trying to get it up and running.
Attempting a query then gave me this error,
{"error":"Server IPC version 9 cannot communicate with client version
4"}
The workaround was to switch off of the latest EMR image (3.0.4 with Hadoop 2.2.0) and switch to a Hadoop 1.0 image (2.4.2 with Hadoop 1.0.3).
I then hit another issues where it couldn't find the Hive jar properly, after struggling with the configuration more, I decided I had dumped enough time into trying to get this to work and decided to communicate with Hive directly (using RBHive for Ruby and JDBC for the JVM).
To answer my own question, it is possible to run WebHCat on EMR, but it's not documented at all (Googling lead me nowhere which is why I created this question in the first place, it's currently the first hit when you search "WebHCat EMR") and the WebHCat documentation leaves a lot to be desired. Getting it to work seems like a pain, though my hope is that by writing up the initial steps someone will come along and take it the rest of the way and post a complete answer.
I did not test it but, it should be doable.
EMR allows to customise the bootstrap actions, i.e. the scripts run where the nodes are started. You can use bootstrap actions to install additional software and to change the configuration of applications on the cluster
See more details at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html.
I would create a shell script to install WebHCat and test your script on a regular EC2 instance first (outside the context of EMR - just as a test to ensure your script is OK)
You can use EC2's user-data properties to test your script, typically :
#!/bin/bash
curl http://path_to_your_install_script.sh | sh
Then - once you know the script is working - make it available to the cluster on a S3 bucket and follow these instructions to include your script as custom bootstrap action of your cluster.
--Seb