Tasks not executed by the Compute Nodes in Ubuntu CfnCluster image - amazon-web-services

I'm trying to use CfnCluster 1.2.1 for GPU computing and I'm using a custom AMI based on the Ubuntu 14.04 CfnCluster AMI.
Everything is created correctly in the CloudFormation console, although when I submit a new test task to Oracle Grid Engine using qsub from the Master Server, it never gets executed from the queue according to qstat. It stays always in status "qw" and never enters state "r".
It seems to work fine with the Amazon Linux AMI (using user ec2-user instead of ubuntu) and the exact same configuration. Also, the master instance announces the number of remaining tasks to the cluster as a metric, and new compute instances are auto-scaled as a result.
What mechanisms does CfnCluster or Oracle Grid Engine provide to further debug this? I took a look at the log files, but didn't find anything relevant. What could be the cause for this behavior?
Thank you,
Diego

Similar to https://stackoverflow.com/a/37324418/704265
From your qhost output, it looks like your machine "ip-10-0-0-47" is properly configured in SGE. However, on "ip-10-0-0-47" sge_execd is either not running or not configured properly. If it were, qhost would report statistics for "ip-10-0-0-47".

I think I found the solution. It seems to be the same issue as the one described in https://github.com/awslabs/cfncluster/issues/86#issuecomment-196966385
I fixed it by adding the following line to the CfnCluster configuration file:
base_os = ubuntu1404
If a custom_ami is specified but no base_os is specified, it defaults to use the Amazon Linux, which uses a different method to configure SGE. There may be problems in the SGE configuration performed by CfnCluster if base_os and custom_ami os are different.

Related

Dataproc custom image: Cannot complete creation

For a project, I have to create a Dataproc cluster that has one of the outdated versions (for example, 1.3.94-debian10) that contain the vulnerabilities in Apache Log4j 2 utility. The goal is to get the alert related (DATAPROC_IMAGE_OUTDATED), in order to check how SCC works (it is just for a test environment).
I tried to run the command gcloud dataproc clusters create dataproc-cluster --region=us-east1 --image-version=1.3.94-debian10 but got the following message ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Selected software image version 1.3.94-debian10 is vulnerable to remote code execution due to a log4j vulnerability (CVE-2021-44228) and cannot be used to create new clusters. Please upgrade to image versions >=1.3.95, >=1.4.77, >=1.5.53, or >=2.0.27. For more information, see https://cloud.google.com/dataproc/docs/guides/recreate-cluster, which makes sense, in order to protect the cluster.
I did some research and discovered that I will have to create a custom image with said version and generate the cluster from that. The thing is, I have tried to read the documentation or find some tutorial, but I still can't understand how to start or to run the file generate_custom_image.py, for example, since I am not confortable with cloud shell (I prefer the console).
Can someone help? Thank you

How can you find out Azure-pipeline image content?

I'm new to Azure-Pipeline and struggling to put together a C++ oriented pipeline that uses camke which properly compiles, run tests and build documentation on Ubuntu, macOS, and Windows.
I managed the macOS and Ubuntu cases rather easily but am struggling with the Windows case not knowing what's installed and what's in system PATH for the given image & container I've selected.
Not being super familiar with the Azure-Platform I'm basically relying on commit-push-run-pipeline every single little change to my YAML file thus wasting time and resources.
I can't imagine that the only way is to blindly try out commands by commit, push and run the pipeline.
I managed to find a basic description of the currently (hopefully) available images here following the included software link for Windows link yoou end up on a comprehensive list of what's supposedly installed (I have some doubts on whether this documentation actually matches the content of the image). Calling some of those tools like cmake and choco, present in the above list, failed. Whether or not they're actually installed and in system PATH I have no idea.
Q1: Is there any way to locally test out an Azure-Pipeline YAML?
Q2: Is there any way to figure what is actually installed on a given image/container (without issuing a DIR /s from the root folder??)
Q3: Is it possible to connect to a running container (or is it a VM???) instance and directly tinker with it?
Q4: Alternatively, is it possible to run such an image locally (Docker)? Does it imply execution on a Windows machine or is that a standalone VM image?
EDIT: Found out about this question, although doesn't quite answer mine: Is there a tool to validate an Azure DevOps Pipeline locally?
Q1: Is there any way to locally test out an Azure-Pipeline YAML?
The answer is yes. You could create your private agent to execute the Azure-Pipeline YAML.
Self-hosted agents
Q2: Is there any way to figure what is actually installed on a given
image/container (without issuing a DIR /s from the root folder??)
Just as you know, we could check the document Software for the software installed on the agent. If you want to know the install the path of some software, you could check the debug log from the build task. For example, cmake. We could check the build log from the cmake task:
Q3: Is it possible to connect to a running container (or is it a
VM???) instance and directly tinker with it?
For the hosted agent, I am afraid the answer is not.
Q4: Alternatively, is it possible to run such an image locally
(Docker)? Does it imply execution on a Windows machine or is that a
standalone VM image?
The answer is yes, we could Run a self-hosted agent in Docker. And it imply execution on a Windows machine.

How to apply rolling updates in VM instances instead of using Managed Instance group in GCP?

Problem: I want to apply patch updates in a VM instance which is not a part of a Managed Instance Group. The patch update could be-
A change in the version of the current OS of a VM instance, that is, change from Ubuntu-16-v1 to Ubuntu-16-v2.
An upgrade of the OS boot, that is, changing from Ubuntu-16 OS to Ubuntu-18 OS.
Installation of a new package in the existing machine.
Exploration:
For Problem 1 & 2 stated above
I have explored and tried the rolling update feature present in Managed Instance Group in the Google Cloud Platform and this seems to be a good approach for the problem stated, but what should be the best approach with best practices if someone is not using a Managed Instance Group? You may find the details here.
For Problem 3 stated above
I have tried the Os-patch Management service of GCP but is there any other method that I could use?
Create an "image" from the boot disks of your existing Compute Engine instances.
For updating with newer configurations and software, group images in "image family" which always points to the latest image.
See https://cloud.google.com/compute/docs/images/create-delete-deprecate-private-images#setting_families
For your use case, I think you should use IAC script like terraform to recreate similar VMs with the same name, disk, internal address, etc..and call the script from the repo directly on a scheduled date automatically or provide self patch instructions.
Here is the likely process:
Send Email Notification to all the VM owners that Auto-Patch is
scheduled on XYZ.
Email content should include an Instance list going to be
patched/update, list of action, patch team contact details.
An email should also include a link for skipping this auto-update and perform "Self Patching instruction"
documents
Self patching documents should have a command to call autopatch
wrapper script like: "curl -u "encrypted-auth:x-oauth-basic" -k -H 'Accept:
application/vnd.github.VERSION.raw'
'https://github.com/api/v3/repos/xyz/images/contents/gcp/patch_OS_update.sh?ref=master'
|bash -s -- -q"
The above script can also have other options like to query patchset available for particular VM or scan the VM for pending updates

Trouble configuring Presto's memory allocation on AWS EMR

I am really hoping to use Presto in an ETL pipeline on AWS EMR, but I am having trouble configuring it to fully utilize the cluster's resources. This cluster would exist solely for this one query, and nothing more, then die. Thus, I would like to claim the maximum available memory for each node and the one query by increasing query.max-memory-per-node and query.max-memory. I can do this when I'm configuring the cluster by adding these settings in the "Edit software settings" box of the cluster creation view in the AWS console. But the Presto server doesn't start, reporting in the server.log file an IllegalArgumentException, saying that max-memory-per-node exceeds the useable heap space (which, by default, is far too small for my instance type and use case).
I have tried to use the session setting set session resource_overcommit=true, but that only seems to override query.max-memory, not query.max-memory-per-node, because in the Presto UI, I see that very little of the available memory on each node is being used for the query.
Through Google, I've been led to believe that I need to also increase the JVM heap size by changing the -Xmx and -Xms properties in /etc/presto/conf/jvm.config, but it says here (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html) that it is not possible to alter the JVM settings in the cluster creation phase.
To change these properties after the EMR cluster is active and the Presto server has been started, do I really have to manually ssh into each node and alter jvm.config and config.properties, and restart the Presto server? While I realize it'd be possible to manually install Presto with a custom configuration on an EMR cluster through a bootstrap script or something, this would really be a deal-breaker.
Is there something I'm missing here? Is there not an easier way to make Presto allocate all of a cluster to one query?
As advertised, increasing query.max-memory-per-node, and also by necessity the -Xmx property, indeed cannot be achieved on EMR until after Presto has already started with the default options. To increase these, the jvm.config and config.properties found in /etc/presto/conf/ have to be changed, and the Presto server restarted on each node (core and coordinator).
One can do this with a bootstrap script using commands like
sudo sed -i "s/query.max-memory-per-node=.*GB/query.max-memory-per-node=20GB/g" /etc/presto/conf/config.properties
sudo restart presto-server
and similarly for /etc/presto/jvm.conf. The only caveats are that one needs to include the logic in the bootstrap action to execute only after Presto has been installed, and that the server on the coordinating node needs to be restarted last (and possibly with different settings if the master node's instance type is different than the core nodes).
You might also need to change resources.reserved-system-memory from the default by specifying a value for it in config.properties. By default, this value is .4*(Xmx value), which is how much memory is claimed by Presto for the system pool. In my case, I was able to safely decrease this value and give more memory to each node for executing the query.
As a matter of fact, there are configuration classifications available for Presto in EMR. However, please note that these may vary depending on the EMR release version. For a complete list of the available configuration classifications per release version, please visit 1 (make sure to switch between the different tabs according to your desired release version). Specifically regarding to jvm.config properties, you will see in 2 that these are not currently configurable via configuration classifications. That being said, you can always edit the jvm.config file manually per your needs.
Amazon EMR 5.x Release Versions
1
Considerations with Presto on Amazon EMR - Some Presto Deployment Properties not Configurable:
2

How does ElasticBeanStalk deploy your application version to instances?

I am currently using AWS ElasticBeanStalk and I was curious as to how (as in internally) it knows that when you fire up an instance (or it automatically does with scaling), to unpack the zip I uploaded as a version? Is there some enviroment setting that looks up my zip in my S3 bucket and then unpacks automatically for every instance running in that environment?
If so, could this be used to automate a task such as run an SQL query on boot-up (instance deployment) too? Are these automated tasks changeable or viewable at all?
Thanks
I don't know how beanstalk knows which version to download and unpack, but running a task on start-up is trivial. Check out cloud-init, a tool written by Ubuntu that's now packaged in Amazon Linux. It allows you to pass arbitrary shell scripts into the UserData section of the instance configuration, and those shell scripts will run on startup.
It's a great way to bootstrap instances on startup, which avoids the soul-sucking misery of managing AMIs.
A quick (possibly non-applicable) warning: If you're running a SQL query on a database that lives on the beanstalk AMI, you're pretty much guaranteed to lose your database at some point. Those machines are designed to be entirely transient. Do not put databases on them. See this answer for more details.
Since your goal seems to be to run custom configuration tasks, the answer is yes, there is a way to do that. You can define custom actions in an .ebextensions file packaged with your app. For example, you can configure a command to run every time a new machine is deployed:
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-ec2.html#linux-commands