Incorrect image reference when launching Dataflow Flex templates - google-cloud-platform

We are using Dataflow Flex Templates and following this guide (https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates) to stage and launch jobs. This is working in our environment. However, when I SSH onto the Dataflow VM and run docker ps I see it is referencing the a different docker image to the one we speccify in our template (underlined in green):
The template I am launching from is as follows and jobs are created using gcloud beta dataflow flex-template run:
{
"image": "gcr.io/<MY PROJECT ID>/samples/dataflow/streaming-beam-sql:latest",
"metadata": {
"description": "An Apache Beam streaming pipeline that reads JSON encoded messages from Pub/Sub, uses Beam SQL to transform the message data, and writes the results to a BigQuery",
"name": "Streaming Beam SQL",
"parameters": [
{
"helpText": "Pub/Sub subscription to read from.",
"label": "Pub/Sub input subscription.",
"name": "inputSubscription",
"regexes": [
".*"
]
},
{
"helpText": "BigQuery table spec to write to, in the form 'project:dataset.table'.",
"is_optional": true,
"label": "BigQuery output table",
"name": "outputTable",
"regexes": [
"[^:]+:[^.]+[.].+"
]
}
]
},
"sdkInfo": {
"language": "JAVA"
}
}
So I would expect the output of docker ps to show gcr.io/<MY PROJECT ID>/samples/dataflow/streaming-beam-sql as the image on Dataflow. When I launch the image from GCR to run on a GCE instance I get the following output when running docker ps:
Should I expect to see the name of the image I have referenced in the Dataflow template on the Dataflow VM? Or have I missed a step somewhere?
Thanks!

TLDR; You are looking in the worker VM instead of launcher VM.
In case of flex templates, when you run the job, it first creates a launcher VM where it pulls your container and runs it to generate the job graph. This VM will destroyed after this step is completed. Then the worker VM is started to actually run the generated job graph. In the worker VM there is no need for your container. Your container is used only to generate the job graph based on the parameters passed.
In your case, you are trying to search for your image in the worker VM. The launcher VM is short lived and starts with launcher-*********************. If you SSH into that VM and do docker ps you will be able to see your container image.

Related

startup scripts on Google cloud platform using Packer

Im using hashicorp's Packer to create machine images for the google cloud (AMI for Amazon). I want every instance to run a script once the instance is created on the cloud. As i understand from the Packer docs, i could use the startup_script_file to do this. Now i got this working but it seems that the script is only runned once, on image creation resulting in the same output on every running instance. How can i trigger this script only on instance creation such that i can have different output for every instance?
packer config:
{
"builders": [{
"type": "googlecompute",
"project_id": "project-id",
"source_image": "debian-9-stretch-v20200805",
"ssh_username": "name",
"zone": "europe-west4-a",
"account_file": "secret-account-file.json",
"startup_script_file": "link to file"
}]
}
script:
#!/bin/bash
echo $((1 + RANDOM % 100)) > test.log #output of this remains the same on every created instance.

Run AWS EMR Cluster Using Step Functions

I am very new to AWS Step Functions and AWS Lambda Functions and could really use some help getting an EMR Cluster running through Step Functions. A sample of my current State Machine structure is shown by the following code
{
"Comment": "This is a test for running the structure of the CustomCreate job.",
"StartAt": "PreStep",
"States": {
"PreStep": {
"Comment": "Check that all the necessary files exist before running the job.",
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:XXXXXXXXXX:function:CustomCreate-PreStep-Function",
"Next": "Run Job Choice"
},
"Run Job Choice": {
"Comment": "This step chooses whether or not to go forward with running the main job.",
"Type": "Choice",
"Choices": [
{
"Variable": "$.FoundNecessaryFiles",
"BooleanEquals": true,
"Next": "Spin Up Cluster"
},
{
"Variable": "$.FoundNecessaryFiles",
"BooleanEquals": false,
"Next": "Do Not Run Job"
}
]
},
"Do Not Run Job": {
"Comment": "This step triggers if the PreStep fails and the job should not run.",
"Type": "Fail",
"Cause": "PreStep unsuccessful"
},
"Spin Up Cluster": {
"Comment": "Spins up the EMR Cluster.",
"Type": "Pass",
"Next": "Update Env"
},
"Update Env": {
"Comment": "Update the environment variables in the EMR Cluster.",
"Type": "Pass",
"Next": "Run Job"
},
"Run Job": {
"Comment": "Add steps to the EMR Cluster.",
"Type": "Pass",
"End": true
}
}
}
Which is shown by the following workflow diagram
The PreStep and Run Job Choice tasks use a simple Lambda Function to check that the files necessary to run this job exist on my S3 Bucket, then go to spin up the cluster provided that the necessary files are found. These tasks are working properly.
What I am not sure about is how to handle the EMR Cluster related steps.
In my current structure, the first task is to spin up an EMR Cluster. this could be done through directly using the Step Function JSON, or preferably, using a JSON Cluster Config file (titled EMR-cluster-setup.json) I have located on my S3 Bucket.
My next task is to update the EMR Cluster environment variables. I have a .sh script located on my S3 Bucket that can do this. I also have a JSON file (titled EMR-RUN-Script.json) located on my S3 Bucket that will add a first step to the EMR Cluster that will run and source the .sh script. I just need to run that JSON file from within the EMR Cluster, which I do not know how to do using the Step Functions. The code for EMR-RUN-SCRIPT.json is shown below
[
{
"Name": "EMR-RUN-SCRIPT",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
"Args": [
"s3://PATH/TO/env_configs.sh"
]
}
}
]
My third task is to add a step that contains a spark-submit command to the EMR Cluster. This command is described in a JSON config file (titled EMR-RUN-STEP.json) located on my S3 Bucket that can be uploaded to the EMR Cluster in a similar manner to uploading the environment configs file in the previous step. The code for EMR-RUN-STEP.json is shown below
[
{
"Name": "EMR-RUN-STEP",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"bash", "-c",
"source /home/hadoop/.bashrc && spark-submit --master yarn --conf spark.yarn.submit.waitAppCompletion=false --class CLASSPATH.TO.MAIN s3://PATH/TO/JAR/FILE"
]
}
}
]
Finally, I want to have a task that makes sure the EMR Cluster terminates after it completes its run.
I know there may be a lot involved within this question, but I would greatly appreciate any assistance with any of the issues described above. Whether it be following the structure I outlined above, or if you know of another solution, I am open to any form of help. Thank you in advance.
You need a terminate cluster step,
as documentation states:
https://docs.aws.amazon.com/step-functions/latest/dg/connect-emr.html
createCluster uses the same request syntax as runJobFlow, except for the following:
The field Instances.KeepJobFlowAliveWhenNoSteps is mandatory,
and must have the Boolean value TRUE.
So, you need a step to do this for you:
terminateCluster.sync - for me this is preferable over the simple terminateCluster as it waits for the cluster to actually terminate and you can handle any hangs here - you'll be using Standard step functions so the bit of extra time will not be billed
Shuts down a cluster (job flow).
terminateJobFlows The same as terminateCluster, but waits for the cluster to terminate.
ps.: if you are using termination protection you'll need an extra step to turn if off before you can terminate your cluster ;)
'KeepJobFlowAliveWhenNoSteps': False
add the above configurations to emr cluster creation script. it will auto terminate emr clusters when all the steps are completed emr boto3 config

How can I set up Continuous Integration of a Dockerized application to Elastic Beanstalk?

I'm new to Docker, and my previous experience is with deploying Java web applications (running in Tomcat containers) to Elastic Beanstalk. The pipeline I'm used to goes something like this: a commit is checked into git, which triggers a Jenkins job, which builds the application JAR (or WAR) file, publishes it to Artifactory, and then deploys that same JAR to an application in Elastic Beanstalk using eb deploy. (Apologies if "pipeline" is a reserved term; I'm using it conceptually.)
Incidentally, I'm also going to be using Gitlab for CI/CD instead of Jenkins (due to organizational reasons out of my control), but the jump from Jenkins to Gitlab seems straight-forward to me -- certainly moreso than the jump from deploying WARs directly to deploying Dockerized containers.
Moving over into the Docker world, I imagine the pipeline will go something like this: a commit is checked into git, which triggers the Gitlab CI, which will then build the JAR or WAR file, publish it to Artifactory, then use the Dockerfile to build the Docker image, publish that Docker image into Amazon ECR (maybe?)... and then I'm honestly not sure how the Elastic Beanstalk integration would proceed from there. I know it has something to do with the Dockerrun.aws.json file, and presumably needs to call the AWS CLI.
I just got done watching a webinar from Amazon called Running Microservices and Docker on AWS Elastic Beanstalk, which stated that in the root of my repo there should be a Dockerrun.aws.json file which essentially defines the integration to EB. However, it seems that JSON file contains a link to the individual Docker image in ECR, which is throwing me off. Wouldn't that link change every time a new image is built? I'm imagining that the CI would need to dynamically update the JSON file in the repo... which almost feels like an anti-pattern to me.
In the webinar I linked above, the host created his Docker image and pushed it ECR manually, with the CLI. Then he manually uploaded the Dockerrun.aws.json file to EB. He didn't need to upload the application however, since it was already contained within the Docker image. This all seems odd to me and I question whether I'm understanding things correctly. Will the Dockerrun.aws.json file need to change on every build? Or am I thinking about this the wrong way?
In the 8 months since I posted this question, I've learned a lot and we've already moved onto different and better technology. But I will post what I learned in answer to my original question.
The Dockerrun.aws.json file is almost exactly the same as an ECS task definition. It's important to use the Multi-Docker container deployment version of Beanstalk (as opposed to the single container), even if you're only deploying a single container. IMO they should just get rid of the single-container platform for Beanstalk as it's pretty useless. But assuming you have Beanstalk set to the Multi-Container Docker platform, then the Dockerrun.aws.json file looks something like this:
{
"AWSEBDockerrunVersion": 2,
"containerDefinitions": [
{
"name": "my-container-name-this-can-be-whatever-you-want",
"image": "my.artifactory.com/docker/my-image:latest",
"environment": [],
"essential": true,
"cpu": 10,
"memory": 2048,
"mountPoints": [],
"volumesFrom": [],
"portMappings": [
{
"hostPort": 80,
"containerPort": 80
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/aws/elasticbeanstalk/my-image/var/log/stdouterr.log",
"awslogs-region": "us-east-1",
"awslogs-datetime-format": "%Y-%m-%d %H:%M:%S.%L"
}
}
}
]
}
If you decide, down the road, to convert the whole thing to an ECS service instead of using Beanstalk, that becomes really easy, as the sample JSON above is converted directly to an ECS task definition by extracting the "containerDefinitions" part. So the equivalent ECS task definition might look something like this:
[
{
"name": "my-container-name-this-can-be-whatever-you-want",
"image": "my.artifactory.com/docker/my-image:latest",
"environment": [
{
"name": "VARIABLE1",
"value": "value1"
}
],
"essential": true,
"cpu": 10,
"memory": 2048,
"mountPoints": [],
"volumesFrom": [],
"portMappings": [
{
"hostPort": 0,
"containerPort": 80
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/aws/ecs/my-image/var/log/stdouterr.log",
"awslogs-region": "us-east-1",
"awslogs-datetime-format": "%Y-%m-%d %H:%M:%S.%L"
}
}
}
]
Key differences here are that with the Beanstalk version, you need to map port 80 to port 80 because a limitation of running Docker on Beanstalk is that you cannot replicate containers on the same instance, whereas in ECS you can. This means that in ECS you can map your container port to host port "zero," which really just tells ECS to pick a random port in the ephemeral range which allows you to stack multiple replicas of your container on a single instance. Secondly with ECS if you want to pass in environment variables, you need to inject them directly into the Task Definition JSON. In Beanstalk world, you don't need to put the environment variables in the Dockerrun.aws.json file, because Beanstalk has a separate facility for managing environment variables in the console.
In fact, the Dockerrun.aws.json file should really just be thought of as a template. Because Docker on Beanstalk uses ECS under-the-hood, it simply takes your Dockerrun.aws.json as a template and uses it to generate its own Task Definition JSON, which injects the managed environment variables into the "environment" property in the final JSON.
One of the big questions I had at the time when I first asked this question was whether you had to update this Dockerrun.aws.json file every time you deployed. What I discovered is that it comes down to a choice of how you want to deploy things. You can, but you don't have to. If you write your Dockerrun.aws.json file so that the "image" property references the :latest Docker image, then there's no need to ever update that file. All you need to do is bounce the Beanstalk instance (i.e. restart the environment), and it will pull whatever :latest Docker image is available from Artifactory (or ECR, or wherever else you publish your images). Thus, all a build pipeline would need to do is publish the :latest Docker image to your Docker repository, and then trigger a restart of the Beanstalk environment using the awscli, with a command like this:
$ aws elasticbeanstalk restart-app-server --region=us-east-1 --environment-name=myapp
However, there are a lot of drawbacks to that approach. If you have a dev/unstable branch that publishes a :latest image to the same repository, you become at risk of deploying that unstable branch if the environment happens to restart on its own. Thus, I would recommend versioning your Docker tags and only deploying the version tags. So instead of pointing to my-image:latest, you would point to something like my-image:1.2.3. This does mean that your build process would have to update the Dockerrun.aws.json file on each build. And then you also need to do more than just a simple restart-app-server.
In this case, I wrote some bash scripts that made use of the jq utility to programmatically update the "image" property in the JSON, replacing the string "latest" with whatever the current build version was. Then I would have to make a call to the awsebcli tool (note that this is a different package than the normal awscli tool) to update the environment, like this:
$ eb deploy myapp --label 1.2.3 --timeout 1 || true
Here I'm doing something hacky: the eb deploy command unfortunately takes FOREVER. (This was another reason we switched to pure ECS; Beanstalk is unbelievably slow.) That command hangs for the entire deployment time, which in our case could take up to 30 minutes or more. That's completely unreasonable for a build process, so I force the process to timeout after 1 minute (it actually continues the deployment; it just disconnects my CLI client and returns a failure code to me even though it may subsequently succeed). The || true is a hack that effectively tells Gitlab to ignore the failure exit code, and pretend that it succeeded. This is obviously problematic because there's no way to tell if the Elastic Beanstalk deployment really did fail; we're assuming it never does.
One more thing on using eb deploy: by default this tool will automatically try to ZIP up everything in your build directory and upload that entire ZIP to Beanstalk. You don't need that; all you need is to update the Dockerrun.aws.json. In order to do this, my build steps were something like this:
Use jq to update Dockerrun.aws.json file with the latest version tag
Use zip to create a new ZIP file called deploy.zip and put Dockerrun.aws.json inside it
Make sure a file called .elasticbeanstalk/config.yml is in place (described below)
Run the eb deploy ... command
Then you need a file in the build directory at .elasticbeanstalk/config.yml which looks like this:
deploy:
artifact: deploy.zip
global:
application_name: myapp
default_region: us-east-1
workspace_type: Application
The awsebcli knows to automatically look for this file when you call eb deploy. And what this particular file says is to look for a file called deploy.zip instead of trying to ZIP up the whole directory itself.
So the :latest method of deployment is problematic because you risk deploying something unstable; the versioned method of deployment is problematic because the deployment scripts are more complicated, and because unless you want your build pipelines to take 30+ minutes, there's a chance that the deployment won't be successful and there's really no way to tell (outside of monitoring each deployment yourself).
Anyways, it's a bit more work to set up, but I would recommend migrating to ECS whenever you can. (Better still to migrate to EKS, though that's a lot more work.) Beanstalk has a lot of problems.

AWS CannotPullContainerError no space left on device Docker

I'm trying to use a large docker image (the image is on dockerhub here about 18GB) as a job definition for AWS batch. I'm getting the following error about running out of space:
CannotPullContainerError: write /var/lib/docker/tmp/GetImageBlob#######: no space left on device
The Cloudformation JSON section that defines the job is here
"JobDef3": {
"Type": "AWS::Batch::JobDefinition",
"Properties": {
"Type": "container",
"ContainerProperties": {
"Image": {
"Fn::Join": [
"",
[
"cornhundred/",
"dockerized-cellranger-nick:latest"
]
]
},
"Vcpus": 1,
"Command": ["some command"],
"Memory": 3000,
},
"RetryStrategy": {
"Attempts": 1
}
}
},
How can I get AWS to increase the amount of space available so that I can run this image?
I was able to run the docker container by moving the large files (~15GB reference genome files) out of the docker image and downloading them after running the container. I also needed to make a custom Amazon Machine Image (AMI, see AWS Batch Genomics for an example) and attach a volume to handle the large reference genome files since the default container was not large enough.
I had a similar issue. Clearing up unused docker images and volumes didn't work for me (ie docker container prune nor docker system prune
I saw another page saying that restarting docker fixed it for that user, but doing a service docker restart I got this error: /etc/init.docker: line 35: ulimit: open files: cannot modify limit: Operation not permitted
To try and fix that issue, I saw sites mentioning to update the ulimit values in some configuration files but when I tried to save the file with the updated parameters I got write error (file system full?)
At which point, I realized (as the initial error you showed) I needed to clean up and remove files.
I did a du -h from the root folder and saw that the /var/lib/docker/tmp/ folder (which is part of the error message I experienced and you posted above) used up way more disk space than other folders.
So I removed older files there and I no longer got that error message.

How to write files from Docker image to EFS?

Composition
Jenkins server on EC2 instance, uses EFS
Docker image for above Jenkins server
Need
Write templates to directory on EFS each time ECS starts the task which builds the Jenkins server
Where is the appropriate place to put a step to do the write?
Tried
If I do it in the Dockerfile, it writes to the Docker image, but never propagates the changes to EFS so that the templates are available as projects on the Jenkins server.
I've tried putting the write command in jenkins.sh but I can't figure out how that is run, anyway it doesn't place the templates where I need them.
The original question included:
Write templates to directory on EFS each time ECS starts the task
In addition to #luke-peterson's answer you can use the shell script as an entry point in your docker file, in order to copy files between the mounted EFS folder and the container.
Instead of ENTRYPOINT, use following directive in your dockerfile:
CMD ["sh", "/app/startup.sh"]
And inside startup.sh you can copy files freely and run the app (.net core app in my example):
cp -R /app/wwwroot/. /var/jenkins-home
dotnet /app/app.dll
Of course, you can also do it programmatically insede the app itself.
You need to start the task with a volume, then mount that volume into the container. This way you have persistent storage across multiple Jenkins start/stop cycles.
Your task definition would look something like the below (I've removed the non relevant parts). The important components are mountPoints and volumes. Not that this is not the same as volumesFrom as you aren't mounting volumes from another container, but rather running them in a single task.
This also assumes you're running Jenkins in the default JENKINS_HOME directory as well as having mounted your EFS drive to /mnt/efs/jenkins-home on the EC2 instance.
{
"requiresAttributes": ...
"taskDefinitionArn": ... your ARN ...,
"containerDefinitions": [
{
"portMappings": ...
.... more config here .....
"mountPoints": [
{
"containerPath": "/var/jenkins_home",
"sourceVolume": "jenkins-home",
}
]
}
],
"volumes": [
{
"host": {
"sourcePath": "/mnt/efs/jenkins-home"
},
"name": "jenkins-home"
}
],
"family": "jenkins"
}
Task definition within ECS: