I am trying to get multiple ECS tasks to run on the same EC2 server. It is a g4dn.xlarge which has 1GPU, 4CPU, and 16GB of memory.
I am using this workaround to allow the GPU to be shared between tasks. https://github.com/aws/containers-roadmap/issues/327
However, when I launch multiple tasks, the second one gets stuck in a provisioning state until the first one finishes.
CloudWatch shows that the CPUUtilization is below 50% for the entire duration of each task.
This is my current CDK:
const taskDefinition = new TaskDefinition(this, 'TaskDefinition', {
compatibility: Compatibility.EC2
})
const container = taskDefinition.addContainer('Container', {
image: ContainerImage.fromEcrRepository(<image>),
entryPoint: ["python", "src/script.py"],
workingDirectory: "/root/repo",
startTimeout: Duration.minutes(5),
stopTimeout: Duration.minutes(60),
memoryReservationMiB: 8192,
logging: LogDriver.awsLogs({
logGroup: logGroup,
streamPrefix: 'prefix',
}),
})
const startUpScript = UserData.forLinux()
// Hack for allowing tasks to share the same GPU
// https://github.com/aws/containers-roadmap/issues/327
startUpScript.addCommands(
`(grep -q ^OPTIONS=\\"--default-runtime /etc/sysconfig/docker && echo '/etc/sysconfig/docker needs no changes') || (sed -i 's/^OPTIONS="/OPTIONS="--default-runtime nvidia /' /etc/sysconfig/docker && echo '/etc/sysconfig/docker updated to have nvidia runtime as default' && systemctl restart docker && echo 'Restarted docker')`
)
const launchTemplate = new LaunchTemplate(this, 'LaunchTemplate', {
machineImage: EcsOptimizedImage.amazonLinux2(
AmiHardwareType.GPU
),
detailedMonitoring: false,
instanceType: InstanceType.of(InstanceClass.G4DN, InstanceSize.XLARGE),
userData: startUpScript,
role: <launchTemplateRole>,
})
const autoScalingGroup = new AutoScalingGroup(this, 'AutoScalingGroup', {
vpc: vpc,
minCapacity: 0,
maxCapacity: 1,
desiredCapacity: 0,
launchTemplate: launchTemplate,
})
const capacityProvider = new AsgCapacityProvider(this, 'AsgCapacityProvider', {
autoScalingGroup: autoScalingGroup,
})
cluster.addAsgCapacityProvider(capacityProvider)
Edit:
Issue still persists after assigning task definition the CPU and Memory amounts.
Got it working by setting both the task sizes and container sizes to less than the sum that is available on the instance. So, although the instance has 16gb RAM and 4vCPUs, there must be leftover RAM and CPU for the instance in order to assign new tasks. So 2 tasks that have 2vCPU & 8gb RAM won't work but if they both have 1vCPU and 4gb RAM, that will work.
Related
I have a GKE cluster where I create jobs through django, it runs my c++ code images and the builds are triggered through github. It was working just fine up until now. However I have recently pushed a new commit to github (It was a really small change, like three-four lines of basic operations) and it built an image as usual. But this time, it said Pod errors: BackoffLimitExceeded, Error with exit code 137 when trying to create the job through my simple job, and the job is not completed.
I did some digging into the problem and through runnig kubectl describe POD_NAME I got this output from a failed pod:
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-nqgnl:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m32s default-scheduler Successfully assigned default/xvb8zfzrhhmz-jk9vf to gke-cluster-1-default-pool-ee7e99bb-xzhk
Normal Pulling 7m7s kubelet Pulling image "gcr.io/videoo3-360019/github.com/videoo-io/videoo-render:latest"
Normal Pulled 4m1s kubelet Successfully pulled image "gcr.io/videoo3-360019/github.com/videoo-io/videoo-render:latest" in 3m6.343917225s
Normal Created 4m1s kubelet Created container jobcontainer
Normal Started 4m kubelet Started container jobcontainer
Warning Evicted 3m29s kubelet The node was low on resource: ephemeral-storage. Container jobcontainer was using 91144Ki, which exceeds its request of 0.
Normal Killing 3m29s kubelet Stopping container jobcontainer
Warning ExceededGracePeriod 3m19s kubelet Container runtime did not kill the pod within specified grace period.
The error occurs because of this line:
The node was low on resource: ephemeral-storage. Container jobcontainer was using 91144Ki, which exceeds its request of 0.
I do not have a yaml file where I set my pod informations, instead I make a django call handle configuration which looks like this:
def kube_create_job_object(name, container_image, namespace="default", container_name="jobcontainer", env_vars={}):
# Body is the object Body
body = client.V1Job(api_version="batch/v1", kind="Job")
# Body needs Metadata
# Attention: Each JOB must have a different name!
body.metadata = client.V1ObjectMeta(namespace=namespace, name=name)
# And a Status
body.status = client.V1JobStatus()
# Now we start with the Template...
template = client.V1PodTemplate()
template.template = client.V1PodTemplateSpec()
# Passing Arguments in Env:
env_list = []
for env_name, env_value in env_vars.items():
env_list.append( client.V1EnvVar(name=env_name, value=env_value) )
print(env_list)
security = client.V1SecurityContext(privileged=True, allow_privilege_escalation=True, capabilities= client.V1Capabilities(add=["CAP_SYS_ADMIN"]))
container = client.V1Container(name=container_name, image=container_image, env=env_list, stdin=True, security_context=security)
template.template.spec = client.V1PodSpec(containers=[container], restart_policy='Never')
body.spec = client.V1JobSpec(backoff_limit=0, ttl_seconds_after_finished=600, template=template.template)
return body
def kube_create_job(manifest, output_uuid, output_signed_url, webhook_url, valgrind, sleep, isaudioonly):
credentials, project = google.auth.default(
scopes=['https://www.googleapis.com/auth/cloud-platform', ])
credentials.refresh(google.auth.transport.requests.Request())
cluster_manager = ClusterManagerClient(credentials=credentials)
cluster = cluster_manager.get_cluster(name=f"path/to/cluster")
with NamedTemporaryFile(delete=False) as ca_cert:
ca_cert.write(base64.b64decode(cluster.master_auth.cluster_ca_certificate))
config = client.Configuration()
config.host = f'https://{cluster.endpoint}:443'
config.verify_ssl = True
config.api_key = {"authorization": "Bearer " + credentials.token}
config.username = credentials._service_account_email
config.ssl_ca_cert = ca_cert.name
client.Configuration.set_default(config)
# Setup K8 configs
api_instance = kubernetes.client.BatchV1Api(kubernetes.client.ApiClient(config))
container_image = get_first_success_build_from_list_builds(client)
name = id_generator()
body = kube_create_job_object(name, container_image,
env_vars={
"PROJECT" : json.dumps(manifest),
"BUCKET" : settings.GS_BUCKET_NAME,
})
try:
api_response = api_instance.create_namespaced_job("default", body, pretty=True)
print(api_response)
except ApiException as e:
print("Exception when calling BatchV1Api->create_namespaced_job: %s\n" % e)
return body
What causes this and how can I fix it? Am I supposed to set resource/limit varibles to a value and if so how can I do that inside my django job call?
It looks like you are running out of storage on the actual node itself. Since your job spec does not have a request for ephemeral storage, it is being scheduled on any node and in this case it appears like that particular node does not have enough storage available.
I'm not a Python expert, but looks like you should be able to do something like:
storage_size = SOME_VALUE
requests = {'ephemeral-storage': storage_size}
resources = client.V1ResourceRequirements(requests=requests)
container = client.V1Container(name=container_name, image=container_image, env=env_list, stdin=True, security_context=security, resources=resources)
I have a problem understanding the interaction between ElasticCache and ElasticBeanstalk.
I have created a Memcached cluster in ElasticCache (cache.t3.medium 1 node).
ElasticBeanstalk is based on PHP 8.0 running on 64bit Amazon Linux 2/3.3.12.
To get access to memcached on EC2 i have a file .ebextensions with:
packages:
yum:
libmemcached-devel: []
commands:
01_install_memcached:
command: /usr/bin/yes 'no'| /usr/bin/pecl install memcached
test: '! /usr/bin/pecl info memcached'
02_rmfromphpini:
command: /bin/sed -i -e '/extension="memcached.so"/d' /etc/php.ini
03_createconf:
command: /bin/echo 'extension="memcached.so"' > /etc/php.d/41-memcached.ini
If i connect to Memcached i get very fast a connection.
But if i read some key it takes 4 seconds to get the result!
I have tested with Symfony\Component\Cache\Adapter\MemcachedAdapter and native php
$time = microtime( true );
$m = new Memcached();
$m->addServer('<elasticache node endpoint>', 11211);
var_dump($m->get('foo'));
printf('%.5f', microtime( true ) - $time) ;
or
$time = microtime( true );
$memcachedClient = MemcachedAdapter::createConnection('memcached://<elasticache node endpoint>:11211');
$memcachedAdapter = new MemcachedAdapter($memcachedClient, $_ENV['MEMCACHED_NAMESPACE']);
$keyCache = 'utime';
$cacheItem = $memcachedAdapter->getItem($keyCache);
printf('%.5f', microtime( true ) - $time) ;
Any idea why it takes 4 seconds?
The 4.0s is a timeout, probably meaning you cannot reach the memcached service. Check your security groups.
I am experiencing a strange behavior: when I run role B, it complains role A's code which I can successfully run! I have reproduced this to this minimal example:
$ cat playbooka.yml
- hosts:
- host_a
roles:
- role: rolea
tags:
- taga
- role: roleb
tags:
- tagb
I have tagged two roles because I want to selectively run role A or role B, they consist simple tasks as shown below in this minimal example:
$ cat roles/rolea/tasks/main.yml
- name: Get service_facts
service_facts:
- debug:
msg: '{{ ansible_facts.services["amazon-ssm-agent"]["state"] }}'
- when: ansible_facts.services["amazon-ssm-agent"]["state"] != "running"
meta: end_play
$ cat roles/roleb/tasks/main.yml
- debug:
msg: "I am roleb"
The preview confirms that I can run individual roles as specified by tags:
$ ansible-playbook playbooka.yml -t taga -D -C --list-hosts --list-tasks
playbook: playbooka.yml
play #1 (host_a): host_a TAGS: []
pattern: ['host_a']
hosts (1):
3.11.111.4
tasks:
rolea : Get service_facts TAGS: [taga]
debug TAGS: [taga]
$ ansible-playbook playbooka.yml -t tagb -D -C --list-hosts --list-tasks
playbook: playbooka.yml
play #1 (host_a): host_a TAGS: []
pattern: ['host_a']
hosts (1):
3.11.111.4
tasks:
debug TAGS: [tagb]
I can run role A OK:
$ ansible-playbook playbooka.yml -t taga -D -C
PLAY [host_a] *************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ****************************************************************************************************************************************************************************************************************************
ok: [3.11.111.4]
TASK [rolea : Get service_facts] ******************************************************************************************************************************************************************************************************************
ok: [3.11.111.4]
TASK [rolea : debug] ******************************************************************************************************************************************************************************************************************************
ok: [3.11.111.4] => {
"msg": "running"
}
PLAY RECAP ****************************************************************************************************************************************************************************************************************************************
3.11.111.4 : ok=3 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
But when I run role B, it complains the code in role A which I just successfully ran!
$ ansible-playbook playbooka.yml -t tagb -D -C
PLAY [host_a] *************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ****************************************************************************************************************************************************************************************************************************
ok: [3.11.111.4]
ERROR! The conditional check 'ansible_facts.services["amazon-ssm-agent"]["state"] != "running"' failed. The error was: error while evaluating conditional (ansible_facts.services["amazon-ssm-agent"]["state"] != "running"): 'dict object' has no attribute 'services'
The error appears to be in '<path>/roles/rolea/tasks/main.yml': line 9, column 3, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
- when: ansible_facts.services["amazon-ssm-agent"]["state"] != "running"
^ here
We could be wrong, but this one looks like it might be an issue with
unbalanced quotes. If starting a value with a quote, make sure the
line ends with the same set of quotes. For instance this arbitrary
example:
foo: "bad" "wolf"
Could be written as:
foo: '"bad" "wolf"'
I have two questions:
Why role A's code should be involved at all?
Even it gets involved, ansible_facts has services, and the service is "running" as shown above by running role A.
PS: I am using the latest Ansible 2.10.2 and latest python 3.9.1 locally on a MacOS. The remote python can be either 2.7.12 or 3.5.2 (Ubuntu 16_04). I worked around the problem by testing if the dictionary has the services key:
ansible_facts.services is not defined or ansible_facts.services["amazon-ssm-agent"]["state"] != "running"
but it still surprises me that role B will interpret role A's code and interpreted it incorrectly. Is this a bug that I should report?
From the notes in meta module documentation:
Skipping meta tasks with tags is not supported before Ansible 2.11.
Since you run ansible 2.10, the when condition for your meta task in rolea is always evaluated, whatever tag you use. When you use -t tagb, ansible_facts.services["amazon-ssm-agent"] does not exist as you skipped service_facts, and you then get the error you reported.
You can either:
upgrade to ansible 2.11 (might be a little soon as I write this answer since it is not yet available over pip...)
rewrite your condition so that the meta task skips when the var does not exists e.g.
when:
- ansible_facts.services["amazon-ssm-agent"]["state"] is defined
- ansible_facts.services["amazon-ssm-agent"]["state"] != "running"
The second solution is still a good practice IMO in whatever condition (e.g. share your work with someone running an older version, running accidentally against a host without the agent installed....).
One other possibility in your specific case is to move the service_facts tasks to an other role higher in play order, or in the pre_tasks section of your playbook, and tag it always. In this case the task will always play and the fact will always exists, whatever tag you use.
I've a terraform script for my AWS EKS cluster and the following pieces there:
provider "helm" {
alias = "helm"
debug = true
kubernetes {
host = module.eks.endpoint
cluster_ca_certificate = module.eks.ca_certificate
token = data.aws_eks_cluster_auth.cluster.token
load_config_file = false
}
}
and:
resource "helm_release" "prometheus_operator" {
provider = "helm"
depends_on = [
module.eks.aws_eks_auth
]
chart = "stable/prometheus-operator"
name = "prometheus-operator"
values = [
file("staging/prometheus-operator-values.yaml")
]
wait = false
version = "8.12.12"
}
With this setup it takes ~15 minutes to install the required chart with terraform apply and sometimes it fails (with helm ls giving pending-install status). On the other hand if use the following command:
helm install prometheus-operator stable/prometheus-operator -f staging/prometheus-operator-values.yaml --version 8.12.12 --debug
the required chart gets installed in ~3 minutes and never fails. What is the reason for this behavior?
EDIT
Here is a log file from a failed installation. It's quit big - 5.6M. What bothers me a bit is located in line no 47725 and 56045
What's more, helm status prometheus-operator gives valid output (as if it was successfully installed), however there're no pods defined.
EDIT 2
I've also raised an issue.
I have an ENI created, and I need to attach it as a secondary ENI to my EC2 instance dynamically using cloud formation. As I am using red hat AMI, I have to go ahead and manually configure RHEL which includes steps as mentioned in below post.
Manually Configuring secondary Elastic network interface on Red hat ami- 7.5
Can someone please tell me how to automate all of this using cloud formation. Is there a way to do all of it using user data in a cloud formation template? Also, I need to make sure that the configurations remain even if I reboot my ec2 instance (currently the configurations get deleted after reboot.)
Though it's not complete automation but you can do below to make sure that the ENI comes up after every reboot of your ec2 instance (only for RHEL instances). If anyone has any better suggestion, kindly share.
vi /etc/systemd/system/create.service
Add below content
[Unit]
Description=XYZ
After=network.target
[Service]
ExecStart=/usr/local/bin/my.sh
[Install]
WantedBy=multi-user.target
Change permissions and enable the service
chmod a+x /etc/systemd/system/create.service
systemctl enable /etc/systemd/system/create.service
Below shell script does the configuration on rhel for ENI
vi /usr/local/bin/my.sh
add below content
#!/bin/bash
my_eth1=`curl http://169.254.169.254/latest/meta-data/network/interfaces/macs/0e:3f:96:77:bb:f8/local-ipv4s/`
echo "this is the value--" $my_eth1 "hoo"
GATEWAY=`ip route | awk '/default/ { print $3 }'`
printf "NETWORKING=yes\nNOZEROCONF=yes\nGATEWAYDEV=eth0\n" >/etc/sysconfig/network
printf "\nBOOTPROTO=dhcp\nDEVICE=eth1\nONBOOT=yes\nTYPE=Ethernet\nUSERCTL=no\n" >/etc/sysconfig/network-scripts/ifcfg-eth1
ifup eth1
ip route add default via $GATEWAY dev eth1 tab 2
ip rule add from $my_eth1/32 tab 2 priority 600
Start the service
systemctl start create.service
You can check if the script ran fine or not by --
journalctl -u create.service -b
Still need to figure out the joining of the secondary ENI from Linux, but this was the Python script I wrote to have the instance find the corresponding ENI and attach it to itself. Basically the script works by taking a predefined naming tag for both the ENI and Instance, then joins the two together.
Pre-reqs for setting this up are:
IAM role on the instance to allow access to S3 bucket where script is stored
Install pip and the AWS CLI in the user data section
curl -O https://bootstrap.pypa.io/get-pip.py
python get-pip.py
pip install awscli --upgrade
aws configure set default.region YOUR_REGION_HERE
pip install boto3
sleep 180
Note on sleep 180 command: I have my ENI swap out on instance in an autoscaling group. This allows an extra 3 min for the other instance to shut down and drop the ENI, so the new one can pick it up. May or may not be necessary for your use case.
AWS CLI command in user data to download the file onto the instance (example below)
aws s3api get-object --bucket YOURBUCKETNAME --key NAMEOFOBJECT.py /home/ec2-user/NAMEOFOBJECT.py
# coding: utf-8
import boto3
import sys
import time
client = boto3.client('ec2')
# Get the ENI ID
eni = client.describe_network_interfaces(
Filters=[
{
'Name': 'tag:Name',
'Values': ['Put the name of your ENI tag here']
},
]
)
eni_id = eni['NetworkInterfaces'][0]['NetworkInterfaceId']
# Get ENI status
eni_status = eni['NetworkInterfaces'][0]['Status']
print('Current Status: {}\n'.format(eni_status))
# Detach if in use
if eni_status == 'in-use':
eni_attach_id = eni['NetworkInterfaces'][0]['Attachment']['AttachmentId']
eni_detach = client.detach_network_interface(
AttachmentId=eni_attach_id,
DryRun=False,
Force=False
)
print(eni_detach)
# Wait until ENI is available
print('start\n-----')
while eni_status != 'available':
print('checking...')
eni_state = client.describe_network_interfaces(
Filters=[
{
'Name': 'tag:Name',
'Values': ['Put the name of your ENI tag here']
},
]
)
eni_status = eni_state['NetworkInterfaces'][0]['Status']
print('ENI is currently: ' + eni_status + '\n')
if eni_status != 'available':
time.sleep(10)
print('end')
# Get the instance ID
instance = client.describe_instances(
Filters=[
{
'Name': 'tag:Name',
'Values': ['Put the tag name of your instance here']
},
{
'Name': 'instance-state-name',
'Values': ['running']
}
]
)
instance_id = instance['Reservations'][0]['Instances'][0]['InstanceId']
# Attach the ENI
response = client.attach_network_interface(
DeviceIndex=1,
DryRun=False,
InstanceId=instance_id,
NetworkInterfaceId=eni_id
)