AWS Parallel Cluster software installation - amazon-web-services

I am very new to generic HPC concepts, and recently I need to use AWS parallel cluster to conduct some large-scale parallel computation.
I went through this tutorial and successfully build a cluster with the Slurm scheduler. I can successfully log in to the system with ssh. But I got stuck here. I need to install some software but I can't determine how to. Should I do a sudo apt-get install xxx and expect it is installed on every new node instantiated whenever there is a job scheduled? On one hand, it sounds like magic, but on the other hand, are the master node and new nodes initiated sharing the same storage? If so, apt-get install might work as they are using the same file system. It seems the Internet has very little material about it.
To conclude, my question is: if I want to install packages on the cluster I created on AWS, am I able to use sudo apt-get install xxx to do it? Are the new nodes instantiated sharing the same storage as the head node? If so, is it a good practice to do it? If not, what's the right way?
Thank you very much!

On a Parallelcluster deployed cluster, the /home directory of the head node is shared by default as an NFS share across all compute nodes. So if you just install your application in the user folder (ec2-user home folder) it will be available to all compute nodes. Once you install your application you could just run your application using the scheduler.
You may have the question next that the /home is limited in space, that's why it is recommended to have an additional shared storage volume that you can attach to the head node during cluster creation this allows you to control the attributes of the shared storage such as size, type etc.. and use it. for more details here is the Parallelcluster documentation around Shared storage configuration section
https://docs.aws.amazon.com/parallelcluster/latest/ug/SharedStorage-v3.html
Using an additional shared storage is the recommended way to run your production workloads as you have better control over the storage volume attributes. However for getting started you could just try running from your home folder first.
Thanks

Related

Node.JS native addons on LINUX [duplicate]

I'm using AWS Lambda, which involves creating an archive of my node.js script, including the node_modules folder and uploading that to their infrastructure to run.
This works fine, except when it comes to node modules with native bindings (using node-gyp). Because the binding was complied and project archived on my local computer (OS X), it is not compatible with AWS's (Amazon Linux) servers.
How can I cross-compile/install a node module (specifically, node-sqlite3) so when I upload it to another server arch it runs?
While not really a solution to your problem, a very easy workaround could be to simply compile the native addons on a Linux machine.
For your particular situation, I would use Vagrant. Vagrant can create virtual machines and configure them within seconds.
Find an OS image that resembles Amazon's Linux distro (Fedora, CentOS, others that use yum as package manager - see Wiki)
Use a simple configuration script that, when run by Vagrant on machine startup, will run npm install (optionally it might also remove the node_modules folder before to ensure a clean installation)
For extra comfort, the script can also create the zip file for deployment
Once the installation finishes, the script will shutdown the VM to avoid unnecessary consumption of system resources
Deploy!
It might require some tuning if the linked libraries are not at the same place on the target machine but generally this seems to me like the best and quickest solution.
While installing the app using Vagrant might be sufficient in some cases, I have found it necessary to build the app on Linux which is as close to Lambda's Amazon Linux AMI as possible.
You can read the original answer here: https://stackoverflow.com/a/34019739/303184
Steps to make it work:
Spawn new EC2 instance. Make sure it is based on exactly the same image as your AWS Lambda runtime. You can review Lambda env details here: http://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html. In our case, it was Amazon Linux AMI called amzn-ami-hvm-2015.03.0.x86_64-gp2.
Install nvm and use it to install the same version of Node.js as on the AWS Lambda. At the time of writing this, it was v0.10.36. You can refer to http://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html again to find out.
You will probably need to install git & g++ compiler on the EC2. You can do this running
sudo yum install git gcc-c++
Finally, clone your app to your new EC2 and install your app's dependecies:
nvm use 0.10.36
npm install --production
You can then easily download the node_modules using scp or such.
Same lines as Robert's answer, when I had to work on my MAC in a different OS I use vm ware like Oracle's free virtualizer VirtualBox to get a linux on my mac, no cost to me. Or sign up for a new AWS account, you get a micro for a year free. Use that to get your linux box, do whatever you need there.
AWS has a page describing how to deal with native NPM modules: https://aws.amazon.com/blogs/compute/nodejs-packages-in-lambda/

How do I upgrade a library in Qubole's Jupyter Notebook, using PySpark?

Is there a way to do it right from a cell in the notebook? similar to pip install ... --upgrade
I didn't know how to do what's instructed on https://docs.qubole.com/en/latest/faqs/general-questions/install-custom-python-libraries.html#pre-installed-python-libraries
The current Python version is 3.5.3, and Pandas 0.20.1. I need to upgrade Pandas, and Matplotlib
In Qubole are two ways to upgrade/install a package for the python environment. Currently there is no interface available inside notebook to install new packages.
New and Recommended Way (via Package Mangement) : User can enable Package Management functionality for an account and add new packages to a cluster via UI. There are lot of advantages of using package management over cluster versions in terms of performance and usability. Refer to https://docs.qubole.com/en/latest/user-guide/package-management/index.html for further details.
Old Way (via bootstrap) : User can configure a bootstrap which is basically a shell script executed on each node when the cluster starts and or upscales (more nodes are getting added to cluster). This can be configured via clusters UI and need a cluster start for every change. This is what is instructed in link you shared.
You cannot download/upgrade packages directly from the cell in the notebook. This is because your notebook is associated to a cluster. Now, to ensure that all the nodes of the cluster have the package installed, you must either use the package management (https://docs.qubole.com/en/latest/user-guide/package-management/package-management-environment.html) or the cluster's node bootstrap (https://docs.qubole.com/en/latest/user-guide/clusters/run-scripts-cluster.html#examples-node-scripts).
Do let me know if you have any further questions.

Deploy a C++ application to the Google Cloud Platform Kubernetes engine

As far as my understanding goes, the Kubernetes engine is meant for deploying applications that can be load balanced, for example, having an application which unhashes a string. If pod-a is on high load, it would be offloaded to pod-b. Correct me if I am wrong here, since if this is false, my following question will not make sense.
After exploring it for few hours I can't seem to figure out how to deploy a C++ application to the Kubernetes cluster. How would I do so?
What I tried:
I tried to follow the guide: Interactive Tutorial - Deploying an App, however, I couldn't understand how I would get my C++ app as an image that could be deployed.
What the C++ application is:
At the moment it proxies TCP traffic to another HOST designated by clients' HOSTNAME. It is pretty much a reverse proxy, however, this is NOT an HTTP application.
Is Kubernetes the right choice?
-
Kubernetes is really useful to loadbalance workloads, to provide high availability in case of failure to speed up test processes, and to increase safety during production rollout through different strategies and increase security through segregation.
However, not all the kind of workloads can take advantage of all the features introduced by Kubernetes.
For example, if your application is built in such a way it needs a stable amount of RAM and CPU, the code as well is really stable and you need merely one replica, then maybe Kubernetes and containers are not the best choice (even if you can perfectly use them), and you should rather implement everything on a big monolithic server/virtual machine.
But if you need to deploy it on a different cloud provider, and it should run merely some hours every day, maybe then it can make use as well of those features. If you are willing to add a layer, make sure that you need the features it introduces, otherwise it would be merely an overhead.
Note that Kubernetes it is not capable of splitting your workload alone. Therefore I do not know if what you mean by "If pod-a is on high load, it would be offloaded to pod-b" likely yes it is possible, but you have to instruct it to do so.
Kubernetes takes care to run your POD, making sure there have been scheduled on nodes where enough memory and CPU is available according to your specification, you can set up autoscaling procedures as well to support high workload periods or to scale even the cluster itself. Your application should have been created in such a way to support a divide and conquer pattern, otherwise you will likely have three nodes, one pod running on one node, two idle and a overhead that you could have avoided.
If your C++ application POD unhashes a strings and a single request could consume all the resources of a node Kubernetes will not "spit" the initial workload and will not create for you more PODS scheduling them across the cluster! Of course you can achieve something similar, but it will not come for free and you will likely need to modify your C++ code.
For sure you can take advantage of Kubernetes, running your application on it is pretty easy, but maybe you will have to modify something in the architecture to fully make advantage of those features.
Deploy the C++ application
The process to deploy your application in Kubernetes is pretty standard. Develop it locally, create a Docker image with all the libraries and components you need, test it locally, push it to the registry, and create the deployment in Kubernetes.
Let's say that you have all the resources needed to run your application and your executable file in a local folder. Create the Docker file.
Example, modify to implement your application, I have reported it as an example to show syntax:
# Download base image, Ubuntu 16.04 (Xenial Xerus)
FROM ubuntu:16.04
# Update software repository
RUN apt-get update
# Install nginx, php-fpm and supervisord from the Ubuntu repository
RUN apt-get install -y nginx php7.0-fpm supervisor
# Define the environment variable
ENV nginx_vhost /etc/nginx/sites-available/default
[...]
# Enable php-fpm on the nginx virtualhost configuration
COPY default ${nginx_vhost}
[...]
RUN chown -R www-data:www-data /var/www/html
# Volume configuration
VOLUME ["/etc/nginx/sites-enabled", "/etc/nginx/certs", "/etc/nginx/conf.d", "/var/log/nginx", "/var/www/html"]
# Configure services and port
COPY start.sh /start.sh
CMD ["./start.sh"]
EXPOSE 80 443
Built it running:
export PROJECT_ID="$(gcloud config get-value project -q)"
docker build -t gcr.io/${PROJECT_ID}/hello-app:v1 .
gcloud docker -- push gcr.io/${PROJECT_ID}/hello-app:v1
kubectl run hello --image=gcr.io/${PROJECT_ID}/hello-app:v1 --port [port number if needed]
More information is here.

Spark standalone mode on AWS EMR

I'm able to run Spark on AWS EMR without much trouble following the documentation but from what I see it always uses YARN instead of the standalone manager. Is there any way to use the standalone mode instead of YARN easily? I don't really feel like hacking the bootstrap scripts to turn off yarn and deploy spark master/workers myself.
I'm running into a weird YARN related bug and I was hoping it won't happen with standalone manager.
As far as I know there are no way to run in standalone mode on EMR unless you go back to the old ami-versions instead of using the emr-release-label. The old ami-version will however cause other problems with newer versions of Spark, so I wouldn't go that way.
What you can do is to launch ordinary EC2-instances with Spark instead of using EMR. If you have a local Spark installation, go to the ec2 folder and use spark-ec2 to launch the cluster, like this:
./spark-ec2 --copy-aws-credentials --key-pair=MY_KEY --identity-file=MY_PEM_FILE.pem --region=MY_PREFERED_REGION --instance-type=INSTANCE_TYPE --slaves=NUMBER_OF_SLAVES --hadoop-major-version=2 --ganglia launch NAME_OF_JOB
I suspect that you have jar-files that are needed, so they have to be copied onto the cluster (copy to master first, ssh to master and copy them onto the slaves from there. ./spark-ec2/copy-dir on master will copy a directory onto all slaves). Then restart Spark:
./spark/sbin/stop-master.sh
./spark/sbin/stop-slaves.sh
./spark/sbin/start-master.sh
./spark/sbin/start-slaves.sh
and you are ready to launch Spark in standalone mode:
./spark/bin/spark-submit --deploy-mode client ...

Configuring AmazonLinux AMI instances

I am trying to setup an AMI such that, when booted it will auto configure itself with a defined "configuration" somewhere on a server. I came across Chef and Puppet. Considering Puppet, I was able to run though their examples but couldn't see one for auto configuration from master. I found out that Puppet Enterprise is not supported on "Amazon Linux". Team chose Amazon Linux and would like keep that instead of going to other OS just because one tool doesn't support it. Can someone please give me some idea about how I could achieve this? (I am trying to stay away from home grown shell scripts over a good industry adopted tool for maintainability)
What I have done in the past is to copy /etc/rc.local to /etc/rc.local.orig, and then configure /etc/rc.local to kick off a puppet run and then pave over itself.
/etc/rc.local:
#!/bin/bash
##
#add pre-puppeting stuff here, I add the hostname in "User-data" when creating the VM so I can set the hostname before checking in
##
/usr/bin/puppet agent --test
/bin/cp -f /etc/rc.local.orig /etc/rc.local
/sbin/init 6
AWS CloudFormation is one of Amazon's recommended ways to provision servers (and other cloud resources, too). You declare all the resources you need in a JSON file, and specify how to provision each server by declaring packages to install, services to run, files to create, and commands to run when the server is created. See the user guide for more information. I also wrote a couple of blog posts about getting started with it.