Running GPU Monitoring on GCP in a container optimized OS

Running GPU Monitoring on GCP in a container optimized OS - google-cloud-platform

Title has most of the question, but more context is below
Tried following the directions found here:
https://cloud.google.com/compute/docs/gpus/monitor-gpus
I modified the code a bit, but haven't been able to get it working. Here's the abbreviated cloud config I've been running that should show the relevant parts:
- path: /etc/scripts/gpumonitor.sh
permissions: "0644"
owner: root
content: |
#!/bin/bash
echo "Starting script..."
sudo mkdir -p /etc/google
cd /etc/google
sudo git clone https://github.com/GoogleCloudPlatform/compute-gpu-monitoring.git
echo "Downloaded Script..."
echo "Starting up monitoring service..."
sudo systemctl daemon-reload
sudo systemctl --no-reload --now enable /etc/google/compute-gpu-monitoring/linux/systemd/google_gpu_monitoring_agent.service
echo "Finished Script..."
- path: /etc/systemd/system/install-monitoring-gpu.service
permissions: "0644"
owner: root
content: |
[Unit]
Description=Install GPU Monitoring
Requires=install-gpu.service
After=install-gpu.service
[Service]
User=root
Type=oneshot
RemainAfterExit=true
ExecStart=/bin/bash /etc/scripts/gpumonitor.sh
StandardOutput=journal+console
StandardError=journal+console
runcmd:
- systemctl start install-monitoring-gpu.service
Edit:
Turned out it was best to build a docker container with the monitoring script in it and run the docker container in my config script by passing the GPU into the docker container like shown in the following link
https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus

Related

unable to prepare context: unable to evaluate symlinks in Dockerfile path

I'm using AWS Code Build to build a Docker image from ECR. This is the Code Build configuration.
Here is the buidspec.yml
version: 0.2
phases:
pre_build:
commands:
- echo Logging in to Amazon ECR...
- aws --version
- aws ecr get-login-password --region my-region | docker login --username AWS --password-stdin my-image-uri
build:
commands:
- echo Build started on `date`
- echo Building the Docker image...
- docker build -t pos-drf .
- docker tag pos-drf:latest my-image-uri/pos-drf:latest
post_build:
commands:
- echo Build completed on `date`
- echo Pushing the Docker images...
- docker push my-image-uri/pos-drf:latest
Now it's working up until the build command docker build -t pos-drf .
the error message I get is the following
[Container] 2022/12/30 15:12:39 Running command docker build -t pos-drf .
unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /codebuild/output/src696881611/src/Dockerfile: no such file or directory
[Container] 2022/12/30 15:12:39 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: docker build -t pos-drf .. Reason: exit status 1
Now quite sure this is not a permission related issue.
Please let me know if I need to share something else.
UPDATE:
This is the Dockerfile
# base image
FROM python:3.8
# setup environment variable
ENV DockerHOME=/home/app/webapp
# set work directory
RUN mkdir -p $DockerHOME
# where your code lives
WORKDIR $DockerHOME
# set environment variables
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1
# install dependencies
RUN pip install --upgrade pip
# copy whole project to your docker home directory.
COPY . $DockerHOME
RUN apt-get dist-upgrade
# RUN apt-get install mysql-client mysql-server
# run this command to install all dependencies
RUN pip install -r requirements.txt
# port where the Django app runs
EXPOSE 8000
# start server
CMD python manage.py runserver

My mistake was that I had the Dockerfile locally but hadn't pushed it.
CodeBuild worked successfully after pushing the Dockerfile.

Beanstalk: ebextensions customization fails with service: command not found

I am trying to install an application through .ebextensions in my elasticbeanstalk stack. I've followed the doc here for advanced environment customization. This is my config:
files:
"/tmp/software.sh" :
mode: "000755"
owner: root
group: root
content: |
#!/bin/bash
wget https://website.net/software/software-LATEST-1.x86_64.rpm
software-LATEST-1.x86_64.rpm
sed -i -e '$a\
*.* ##127.0.0.1:1514;RSYSLOG_FileFormat' /etc/rsyslog.conf
/sbin/service rsyslog restart
/sbin/service software start
container_commands:
01_run:
command: "/tmp/software.sh"
When applying the config I receive an error that the command "service" is not found even though I point to the location of the service command in /sbin/service. I've tried a lot of different things but I always get this error. Running the script manually on the host works without any issue.
The image the stack is using is Amazon Linux release 2 (Karoo)
The specific error message is:
[3744211/3744211]\n\n/tmp/[01;31m[Kalert[m[K_software.sh: line 8: service: command not found\n/tmp/[01;31m[Kalert[m[K_software.sh: line 9: service: command not found. \ncontainer_command 02_run in .eb[01;31m[Kextensions[m[K/[01;31m[Kalert[m[K-software.config failed. For more detail, check /var/log/eb-activity.log using console or EB CLI","returncode":127,"events":

My co-worker tried to install the software a different way and it worked. This is what worked:
install.config >>
01_install_software:
command: rpm -qa | grep -qw software || yum -y -q install https://website.net/software/software-LATEST-1.x86_64.rpm
02_update_rsyslog:
command: sed -i -e '$a*.* ##127.0.0.1:1514;RSYSLOG_FileFormat' -e '/*.* ##127.0.0.1:1514;RSYSLOG_FileFormat/d' /etc/rsyslog.conf
03_restart_rsyslog:
command: service rsyslog restart
services.config >>
services:
sysvinit:
rsyslog:
enabled: "true"
ensureRunning: "true"
software:
enabled: "true"
ensureRunning: "true"

How to run command inside Docker container

I'm new to Docker and I'm trying to understand the following setup.
I want to debug my docker container to see if it is receiving AWS credentials when running as a task in Fargate. It is suggested that I run the command:
curl 169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
But I'm not sure how to do so.
The setup uses Gitlab CI to build and push the docker container to AWS ECR.
Here is the dockerfile:
FROM rocker/tidyverse:3.6.3
RUN apt-get update && \
apt-get install -y openjdk-11-jdk && \
apt-get install -y liblzma-dev && \
apt-get install -y libbz2-dev && \
apt-get install -y libnetcdf-dev
COPY ./packrat/packrat.lock /home/project/packrat/
COPY initiate.R /home/project/
COPY hello.Rmd /home/project/
RUN install2.r packrat
RUN which nc-config
RUN Rscript -e 'packrat::restore(project = "/home/project/")'
RUN echo '.libPaths("/home/project/packrat/lib/x86_64-pc-linux-gnu/3.6.3")' >> /usr/local/lib/R/etc/Rprofile.site
WORKDIR /home/project/
CMD Rscript initiate.R
Here is the gitlab-ci.yml file:
image: docker:stable
variables:
ECR_PATH: XXXXX.dkr.ecr.eu-west-2.amazonaws.com/
DOCKER_DRIVER: overlay2
DOCKER_TLS_CERTDIR: ""
services:
- docker:dind
stages:
- build
- deploy
before_script:
- docker info
- apk add --no-cache curl jq py-pip
- pip install awscli
- chmod +x ./build_and_push.sh
build-rmarkdown-task:
stage: build
script:
- export REPO_NAME=edelta/rmarkdown_report
- export BUILD_DIR=rmarkdown_report
- export REPOSITORY_URL=$ECR_PATH$REPO_NAME
- ./build_and_push.sh
when: manual
Here is the build and push script:
#!/bin/sh
$(aws ecr get-login --no-include-email --region eu-west-2)
docker pull $REPOSITORY_URL || true
docker build --cache-from $REPOSITORY_URL -t $REPOSITORY_URL ./$BUILD_DIR/
docker push $REPOSITORY_URL
I'd like to run this command on my docker container:
curl 169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
How I run this command on container startup in fargate?

For running a command inside docker container you need to be inside the docker container.
Step 1: Find the container ID / Container Name that you want to debug
docker ps A list of containers will be displayed, pick one of them
Step 2 run following command
docker exec -it <containerName/ConatinerId> bash and then enter wait for few seconds and you will be inside the docker container with interactive mode Bash
for more info read https://docs.docker.com/engine/reference/commandline/exec/

Short answer, just replace the CMD
CMD ["sh", "-c", " curl 169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_UR && Rscript initiate.R"]
Long answer, You need to replace the CMD of the DockerFile, as currently running only Rscript.
you have two option add entrypoint or change CMD, for CMD check above
create entrypoint.sh and run run only when you want to debug.
#!/bin/sh
if [ "${IS_DEBUG}" == true ];then
echo "Container running in debug mode"
curl 169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
# uncomment below section if you still want to execute R script.
# exec "$#"
else
exec "$#"
fi
Changes that will required on Dockerfile side
WORKDIR /home/project/
ENV IS_DEBUG=true
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
entrypoint ["/entrypoint.sh"]
CMD Rscript initiate.R

AWS ECS tasks keep starting and stopping

I am trying to use ECS for deployment with travis.
At one point everything was working but now it stopped.
I am following this tutorial https://testdriven.io/part-five-ec2-container-service/
There are 2 tasks that keep stopping and starting.
These are the messages I see in tasks:
STOPPED (CannotStartContainerError: API error (500): oci ru)
STOPPED (Essential container in task exited)
These are the messages I see in the logs:
FATAL: could not write to file "pg_wal/xlogtemp.28": No space left on device
container_linux.go:262: starting container process caused "exec: \"./entrypoint.sh\": permission denied"
Why is ECS stopping and starting so many new tasks? This was not happening before.
This is my docker_deploy.sh from my main microservice which I am calling via travis.
#!/bin/sh
if [ -z "$TRAVIS_PULL_REQUEST" ] || [ "$TRAVIS_PULL_REQUEST" == "false" ];
then
if [ "$TRAVIS_BRANCH" == "staging" ];
then
JQ="jq --raw-output --exit-status"
configure_aws_cli() {
aws --version
aws configure set default.region us-east-1
aws configure set default.output json
echo "AWS Configured!"
}
make_task_def() {
task_template=$(cat ecs_taskdefinition.json)
task_def=$(printf "$task_template" $AWS_ACCOUNT_ID $AWS_ACCOUNT_ID)
echo "$task_def"
}
register_definition() {
if revision=$(aws ecs register-task-definition --cli-input-json "$task_def" --family $family | $JQ '.taskDefinition.taskDefinitionArn');
then
echo "Revision: $revision"
else
echo "Failed to register task definition"
return 1
fi
}
deploy_cluster() {
family="testdriven-staging"
cluster="ezasdf-staging"
service="ezasdf-staging"
make_task_def
register_definition
if [[ $(aws ecs update-service --cluster $cluster --service $service --task-definition $revision | $JQ '.service.taskDefinition') != $revision ]];
then
echo "Error updating service."
return 1
fi
}
configure_aws_cli
deploy_cluster
fi
fi
This is my Dockerfile from my users microservice:
FROM python:3.6.2
# install environment dependencies
RUN apt-get update -yqq \
&& apt-get install -yqq --no-install-recommends \
netcat \
&& apt-get -q clean
# set working directory
RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app
# add requirements (to leverage Docker cache)
ADD ./requirements.txt /usr/src/app/requirements.txt
# install requirements
RUN pip install -r requirements.txt
# add entrypoint.sh
ADD ./entrypoint.sh /usr/src/app/entrypoint.sh
RUN chmod +x /usr/src/app/entrypoint.sh
# add app
ADD . /usr/src/app
# run server
CMD ["./entrypoint.sh"]
entrypoint.sh:
#!/bin/sh
echo "Waiting for postgres..."
while ! nc -z users-db 5432;
do
sleep 0.1
done
echo "PostgreSQL started"
python manage.py recreate_db
python manage.py seed_db
gunicorn -b 0.0.0.0:5000 manage:app
I tried deleting my cluster and deregistering my tasks and restarting but ECS still continuously stops and starts new tasks now.
When it was working fine: the difference was that instead of the CMD ["./entrypoint.sh"] in my Dockerfile, I had
RUN python manage.py recreate_db
RUN python manage.py seed_db
CMD gunicorn -b 0.0.0.0:5000 manage:app
travis is passing.

The errors are right there.
You don't have enough space on your host; and the entrypoint.sh file is being denied.
Ensure your host has enough disk space (Shell in and df -h to check and expand the volume or just bring up a new instance with more space) and for the entrypoint.sh ensure that when building your image it is executable chmod +x and also is readable by the user the container is running as.
Test your containers locally first; the second error should have been caught in development instantly.

I realize this answer isn't 100% relevant to the question asked, but some googling brought me here due to the title and I figure my solution might help someone later down the line.
I also had this issue, but the reason why my containers kept restarting wasn't a lack of space or other resources, it was because I had enabled dynamic host port mapping and forgotten to update my security group as needed. What happened then is that the health checks my load balancer sent to my containers inevitably failed and ECS restarted the containers (whoops).
Dynamic Port Mapping in AWS Documentation:
https://aws.amazon.com/premiumsupport/knowledge-center/dynamic-port-mapping-ecs/
https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_PortMapping.html Contents --> hostPort
tl;dr - Make sure your load balancer can health check ports 32768 - 65535.

If it's too many tasks running and they have consumed the space then you will need to shell in to the host and do the following. Don't use -f on the docker rm as that will remove the running ECS agent container
docker rm $(docker ps -aq)

Do docker ps -a
Which results in all the stopped containers which are excited, these also consumes disk space.use below command to remove those zoombie
docker rm $(docker ps -a | grep Exited | awk '{print $1}')
And also remove older images or unused images these takes more DiskStation size than containers
docker rmi -f image_name

gunicorn systemd startup not working with django inside a vagrant box

I translated this tutorial into a chef recipe. And so far everything apart from starting gunicorn (the right way) seems to be working.
For example when I shut down the machine after the initial setup and provisioning via vagrant halt and then start it up again with vagrant up - I always get an 502 Bad Gateway error.
Then I have to ssh into the box and run these commands manualy
sudo systemctl daemon-reload
sudo systemctl restart gunicorn
After that everything is working again.
What I dont understand is when I run sudo systemctl status gunicorn before I reload the daemon and restart gunicorn - it tells me that gunicorn is running.
Here is my gunicorn.service file contents that get written to /etc/systemd/sytem/gunicorn.service
[Unit]
Description=gunicorn daemon
After=network.target
[Service]
User=ubuntu
Group=www-data
WorkingDirectory=/vagrant_data
ExecStart=/vagrant_data/myprojectenv/bin/gunicorn --workers 5 --bind unix:/home/ubuntu/run/myproject.sock myproject.wsgi:application --reload
[Install]
WantedBy=multi-user.target
My projects folder structure is:
/home/ubuntu/myproject ls
manage.py myproject myprojectenv
/home/ubuntu/run ls
myproject.sock
I symlinked the myproject folder to vagrant_data which is setup to be the vm.synced_folder in my Vagrantfile.
this is all running on a ubuntu/xenial64 vagrant box.
UPDATE:
include_recipe 'locale'
include_recipe 'apt'
execute 'install requirements' do
command 'sudo apt-get install -y python3-pip python3-dev libpq-dev postgresql postgresql-contrib nginx'
not_if ('sudo dpkg -l | grep postgresql')
end
bash 'setup database and user' do
user 'postgres'
code <<-EOF
echo "CREATE DATABASE #{node['dbname']};" | psql
echo "CREATE USER #{node['dbuser']} WITH PASSWORD '#{node['dbpass']}';" | psql
echo "ALTER ROLE #{node['dbuser']} SET client_encoding TO 'utf8';" | psql
echo "ALTER ROLE #{node['dbuser']} SET default_transaction_isolation TO 'read committed';" | psql
echo "ALTER ROLE #{node['dbuser']} SET timezone TO 'UTC';" | psql
echo "GRANT ALL PRIVILEGES ON DATABASE #{node['dbname']} TO #{node['dbuser']};" | psql
EOF
not_if { `sudo -u postgres psql -tAc \"SELECT * FROM pg_database WHERE datname='#{node['dbname']}'\" | wc -l`.chomp == "1" }
end
execute 'install virtualenv' do
command 'sudo pip3 install virtualenv'
not_if ('sudo pip3 list | grep virtualenv')
end
link "/home/ubuntu/#{node['projectDir']}" do
to '/vagrant_data'
owner 'ubuntu'
group 'www-data'
end
directory '/home/ubuntu/run' do
owner 'ubuntu'
group 'www-data'
action :create
end
bash 'configure and install django' do
code <<-EOF
cd /home/ubuntu/#{node['projectDir']}
virtualenv myprojectenv
source myprojectenv/bin/activate
pip install django gunicorn psycopg2
django-admin.py startproject #{node['projectDir']} .
deactivate
EOF
not_if { ::File::exists?("/home/ubuntu/#{node['projectDir']}/#{node['projectDir']}")}
end
###############
# Note : In development set workers to 1 which will reload the code after each request
# in production set it to cores x 2 + 1 ... which would mostly result in 5 workers
##########
template '/etc/systemd/system/gunicorn.service' do
source 'gunicorn.erb'
owner 'root'
group 'root'
end
template '/etc/nginx/sites-available/myproject' do
source 'test.erb'
owner 'www-data'
group 'www-data'
end
execute 'link to sites-enabled' do
command 'sudo ln -s /etc/nginx/sites-available/myproject /etc/nginx/sites-enabled'
not_if { ::File.symlink?('/etc/nginx/sites-enabled/myproject')}
end
execute 'remove default host' do
command 'sudo rm /etc/nginx/sites-enabled/default'
only_if { ::File.exists?('/etc/nginx/sites-enabled/default') }
end
bash 'enable gunicorn' do
code <<-EOF
sudo systemctl daemon-reload
sudo systemctl start gunicorn
sudo systemctl enable gunicorn
EOF
#not_if { ::File.exists?('/home/ubuntu/run/myproject.sock')}
end
execute 'test nginx' do
command 'sudo nginx -t'
end
execute 'restart nginx' do
command 'sudo service nginx restart'
end
Does anyone know what I am doing wrong ?
UPDATE:
Still not working - after trying almost everything google had to offer.
Now I switched to a kind of workaround with the vagrant-triggers plugin and defined the needed commands for gunicorn in Vagrantfile.
config.trigger.after :up do
run_remote "sudo systemctl daemon-reload"
run_remote "sudo systemctl restart gunicorn"
end
That way I don't have to call vagrant up --provision every time I turn on the machine.
But still I would really like to know how to get that thing started the right way.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js