How to tune/troubleshoot/optimize docker block i/o on AWS - amazon-web-services

I have the following docker containers that I have set up to test my web application:
Jenkins
Apache 1 (serving a laravel app)
Apache 2 (serving a legacy codeigniter app)
MySQL (accessed by both Apache 1 and Apache 2)
Selenium HUB
Selenium Node — ChromeDriver
The jenkins job runs a behat command on Apache 1 which in turn connects to Selenium Hub, which has a ChromeDriver node to actually hit the two apps: Apache 1 and Apache 2.
The whole system is running on an EC2 t2.small instance (1 core, 2GB RAM) with AWS linux.
The problem
The issue I am having is that if I run the pipeline multiple times, the first few times it runs just fine (the behat stage takes about 20s), but on the third and consecutive runs, the behat stage starts slowing down (taking 1m30s) and then failing after 3m or 10m or whenever I lose patience.
If I restart the docker containers, it works again, but only for another 2-4 runs.
Clues
Monitoring docker stats each time I run the jenkins pipeline, I noticed that the Block I/O, and specifically the 'I' was growing exponentially after the first few runs.
For example, after run 1
After run 2
After run 3
After run 4
The Block I/O for the chromedriver container is 21GB and the driver hangs. While I might expect the Block I/O to grow, I wouldn't expect it to grow exponentially as it seems to be doing. It's like something is... exploding.
The same docker configuration (using docker-compose) runs flawlessly every time on my personal MacBook Pro. Block I/O does not 'explode'. I constrain Docker to only use 1 core and 2GB of RAM.
What I've tried
This situation has sent me down the path of learning a lot more about docker, filesystems and memory management, but I'm still not resolving the issue. Some of the things I have tried:
Memory
I set mem_limit options on all containers and tuned them so that during any given run, the memory would not reach 100%. Memory usage now seems fairly stable, and never 'blows up'.
Storage Driver
The default for AWS Linux Docker is devicemapper in loop-lvm mode. After reading this doc
https://docs.docker.com/engine/userguide/storagedriver/device-mapper-driver/#configure-docker-with-devicemapper
I switched to the suggested direct-lvm mode.
docker-compose restart
This does indeed 'reset' the issue, allowing me to get a few more runs in, but it doesn't last. After 2-4 runs, things seize up and the tests start failing.
iotop
Running iotop on the host shows that reads are going through the roof.
My Question...
What is happening that causes the block i/o to grow exponentially? I'm not clear if it's docker, jenkins, selenium or chromedriver that are causing the problem. My first guess is chromedriver, although the other containers are also showing signs of 'exploding'.
What is a good approach to tuning a system like this with multiple moving parts?
Additonal Info
My chromedriver container has the following environment set in docker-compose:
- SE_OPTS=-maxSession 6 -browser browserName=chrome,maxInstances=3
docker info:
$ docker info
Containers: 6
Running: 6
Paused: 0
Stopped: 0
Images: 5
Server Version: 1.12.6
Storage Driver: devicemapper
Pool Name: docker-thinpool
Pool Blocksize: 524.3 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 4.862 GB
Data Space Total: 20.4 GB
Data Space Available: 15.53 GB
Metadata Space Used: 2.54 MB
Metadata Space Total: 213.9 MB
Metadata Space Available: 211.4 MB
Thin Pool Minimum Free Space: 2.039 GB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: overlay null host bridge
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.4.51-40.60.amzn1.x86_64
Operating System: Amazon Linux AMI 2017.03
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.956 GiB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/

Related

volume sizing attached to EC2

Inside an EC2 I have docker with a container that I can't lose, so I noticed that I was out of space on the attached and exclusive volume for docker.
So I increased it with another 15GB and executed the commands according to the aws documentation, but my container is not getting this new size.
sudo growpart /dev/nvme1n1 1
sudo pvresize /dev/nvme1n1p1
sudo lvextend -L+15G /dev/docker/docker-pool
The return of lsblk:
And I noticed that these "dm" are still old, does anyone know what it could be, and how to solve it?
Considerations:
In /etc/sysconfig/docker-storage is already defined --storage-opt dm.basesize=75GB
docker system info
Server Version: 19.03.13-ce
Storage Driver: devicemapper
Pool Name: docker-docker--pool
Pool Blocksize: 524.3kB
Base Device Size: 75.16GB
Backing Filesystem: ext4
Udev Sync Supported: true
Data Space Used: 67.24GB
Data Space Total: 80.38GB
Data Space Available: 13.14GB
Metadata Space Used: 11.81MB
Metadata Space Total: 67.11MB
Metadata Space Available: 55.3MB
Thin Pool Minimum Free Space: 8.038GB
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Library Version: 1.02.135-RHEL7 (2016-11-16)
To solve the problem, in addition to enlarging the disk and performing the manual resizing process mentioned in the question, enable saving the container to a new image with docker commit removing the container and then uploading it again, based on the previously saved image, in this way the new container got all the space available on the instance's disk.

Implicit Process creation when pushing a Spring boot application

I am pushing a minimalistic Spring Boot web application on Cloud Foundry. My manifest looks like
---
applications:
- name: training-app
path: target/spring-boot-initial-0.0.1-SNAPSHOT.jar
instances: 1
memory: 1G
buildpacks:
- java_buildpack
env:
TRAINING_KEY_3: from manifest
When I push the application with Java Buildpack (https://github.com/cloudfoundry/java-buildpack/releases/tag/v4.45) , I see that it is creating an additional process of type -task which does not have any running instance though.
name: training-app
requested state: started
isolation segment: trial
routes: ***************************
last uploaded: Thu 20 Jan 21:29:31 IST 2022
stack: cflinuxfs3
buildpacks:
isolation segment: trial
name version detect output buildpack name
java_buildpack v4.45-offline-https://github.com/cloudfoundry/java-buildpack.git#f1b695a0 java java
type: web
sidecars:
instances: 1/1
memory usage: 1024M
start command: JAVA_OPTS="-agentpath:$PWD/.java-buildpack/open_jdk_jre/bin/jvmkill-1.16.0_RELEASE=printHeapHistogram=1 -Djava.io.tmpdir=$TMPDIR -XX:ActiveProcessorCount=$(nproc)
-Djava.ext.dirs=$PWD/.java-buildpack/container_security_provider:$PWD/.java-buildpack/open_jdk_jre/lib/ext -Djava.security.properties=$PWD/.java-buildpack/java_security/java.security $JAVA_OPTS" &&
CALCULATED_MEMORY=$($PWD/.java-buildpack/open_jdk_jre/bin/java-buildpack-memory-calculator-3.13.0_RELEASE -totMemory=$MEMORY_LIMIT -loadedClasses=13109 -poolType=metaspace -stackThreads=250 -vmOptions="$JAVA_OPTS") && echo JVM Memory Configuration:
$CALCULATED_MEMORY && JAVA_OPTS="$JAVA_OPTS $CALCULATED_MEMORY" && MALLOC_ARENA_MAX=2 SERVER_PORT=$PORT eval exec $PWD/.java-buildpack/open_jdk_jre/bin/java $JAVA_OPTS -cp $PWD/. org.springframework.boot.loader.JarLauncher
state since cpu memory disk details
#0 running 2022-01-20T15:59:55Z 0.0% 62.2M of 1G 130M of 1G
type: task
sidecars:
instances: 0/0
memory usage: 1024M
start command: JAVA_OPTS="-agentpath:$PWD/.java-buildpack/open_jdk_jre/bin/jvmkill-1.16.0_RELEASE=printHeapHistogram=1 -Djava.io.tmpdir=$TMPDIR -XX:ActiveProcessorCount=$(nproc)
-Djava.ext.dirs=$PWD/.java-buildpack/container_security_provider:$PWD/.java-buildpack/open_jdk_jre/lib/ext -Djava.security.properties=$PWD/.java-buildpack/java_security/java.security $JAVA_OPTS" &&
CALCULATED_MEMORY=$($PWD/.java-buildpack/open_jdk_jre/bin/java-buildpack-memory-calculator-3.13.0_RELEASE -totMemory=$MEMORY_LIMIT -loadedClasses=13109 -poolType=metaspace -stackThreads=250 -vmOptions="$JAVA_OPTS") && echo JVM Memory Configuration:
$CALCULATED_MEMORY && JAVA_OPTS="$JAVA_OPTS $CALCULATED_MEMORY" && MALLOC_ARENA_MAX=2 SERVER_PORT=$PORT eval exec $PWD/.java-buildpack/open_jdk_jre/bin/java $JAVA_OPTS -cp $PWD/. org.springframework.boot.loader.JarLauncher
There are no running instances of this process.
I understand that it is a Springboot Web application , and that corresponds to the process of type web , however I do not know
Who is creating the process of type task
What is the purpose of this process ?
It would be great of someone is able to help me here.
Regards
AM
Who is creating the process of type task
The buildpack creates both. This is what's been happening for a while, but recent cf cli changes are making this more visible.
What is the purpose of this process ?
I didn't add that into the buildpack so I can't 100% say its purpose, but I believe it is meant to be used in conjunction with running Java apps ask tasks on CF.
See this commit.
When you run a task, there is a --process flag to the cf run-task command which can be used to set a process to use as the command template. I believe the idea is that you'd set it to task so it can use that command to run your ask. See here for reference to that flag.

Can ml-engine HP optimization be run locally?

I am attempting to tune HPs for my model using the ml-engine on a local server. In my case the model trains a single pass, but no HP trials are performed. Is this a configuration issue, or is HP optimization not supported in local mode?
My local command:
gcloud ml-engine local train --package-path $PWD --module-name example.train --configuration example/hpconfig.yaml -- --param1 16 --param2 2
My config file:
trainingInput:
workerCount: 1
hyperparameters:
goal: MINIMIZE
hyperparameterMetricTag: val_loss
maxTrials: 10
maxParallelTrials: 1
enableTrialEarlyStopping: True
params:
- parameterName: param1
type: INTEGER
minValue: 4
maxValue: 128
scaleType: UNIT_LINEAR_SCALE
- parameterName: param2
type: INTEGER
minValue: 1
maxValue: 4
scaleType: UNIT_LINEAR_SCALE
Unfortunately, HP Tuning cannot be run in local mode. I would recommend a workflow like so:
Run locally with small data, etc. to ensure everything is working (I recommend using GCS paths).
Run a small test on cloud (single job) to ensure dependencies are correct, data files properly point to GCS instead of locally, etc.
Run an HP Tuning job.
Once 1 and 2 are working, 3 generally will, too.
Also, as a side note. Kubeflow supports Katib for running HP tuning jobs from any kubernetes deployment, including Minikube (for local development).

SCDF on PCF - bits have not been uploaded

i'm running through a simple and useless toy using PCF on azure, trying to create and run the stream 'time | log'
i successfully get SCDF started, and the stream created, but when i try to deploy the stream, SCDF creates two (cf) apps that won't run - they exist as far as cf-apps is concerned
○ → cf apps
Getting apps in org tess / space tess as admin...
OK
name requested state instances memory disk urls
yascdf-server started 1/1 2G 2G yascdf-server.apps.cf.tess.info
yascdf-server-LE7xs4r-tess-log stopped 0/1 512M 2G yascdf-server-LE7xs4r-tess-log.apps.cf.tess.info
yascdf-server-LE7xs4r-tess-time stopped 0/1 512M 2G yascdf-server-LE7xs4r-tess-time.apps.cf.tess.info
if i try to view the logs for either, nothing ever returns. but the logs in apps manager look like this:
2017-08-10T10:24:42.147-04:00 [API/0] [OUT] Created app with guid de8fee78-0902-4df7-a7ae-bba8a7710dca
2017-08-10T10:24:43.314-04:00 [API/0] [OUT] Updated app with guid de8fee78-0902-4df7-a7ae-bba8a7710dca ({"route"=>"97e1d26b-d950-479e-b9df-fe1f3b0c8a74", :verb=>"add", :relation=>"routes", :related_guid=>"97e1d26b-d950-479e-b9df-fe1f3b0c8a74"})
the routes don't work:
404 Not Found: Requested route ('yascdf-server-LE7xs4r-tess-log.apps.cf.tess.info') does not exist.
and trying to (re)start the route i get:
○ → cf start yascdf-server-LE7xs4r-tess-log
Starting app yascdf-server-LE7xs4r-tess-log in org tess / space tess as admin...
Staging app and tracing logs...
The app package is invalid: bits have not been uploaded
FAILED
here's the SCDF shell stuff i ran, if this helps:
server-unknown:>dataflow config server http://yascdf-server.apps.cf.tess.info/
Successfully targeted http://yascdf-server.apps.cf.cfpush.info/
dataflow:>app import --uri http://.../1-0-4-GA-stream-applications-rabbit-maven
Successfully registered applications: [<chop>]
dataflow:>stream create tess --definition "time | log"
Created new stream 'tess'
dataflow:>stream deploy tess
Deployment request has been sent for stream 'tess'
dataflow:>
anyone know what's going on here? i'd be grateful for a nudge...
Spring Cloud Data Flow: Server
1.2.3 (using built spring-cloud-dataflow-server-cloudfoundry-1.2.3.BUILD-SNAPSHOT.jar)
Spring Cloud Data Flow: Shell
1.2.3 (using downloaded spring-cloud-dataflow-shell-1.2.3.RELEASE.jar)
Deployment Environment
PCF v1.11.6 (on Azure)
pcf dev v0.26.0 (on mac)
App Starters
http://bit-dot-ly/1-0-4-GA-stream-applications-rabbit-maven
Logs
stream deploy log
It has been identified that java-buildpack 4.4 (JBP4) was used by OP and by running SCDF against this version, there is an issue with memory allocation in reactor-netty (used by JBP4 internally), which causes the out-of-memory error. The reactor team is addressing this issue in the upcoming 0.6.5 release. JBP4 will adapt to it eventually.
With all this said, SCDF is still not compatible with JPB4. It is recommended to downgrade to JPB 3.19 or latest in this release line instead.

celery eats up memory

I have a t2.medium instance on aws. Where two of my python applications and their celery workers runs inside separate docker containers. Total 4 containers are running.
For no reason, celery eats up lots of memory of instance.
Screenshot of ps command output
I have checked that django is running with DEBUG_MODE False. I have configured worker_max_tasks_per_child to 200 and worker_max_memory_per_child to 200MB.
I have:
Ubuntu Version: 16.04
Python Version: 3.5
As of now I am not running any tasks, still it eats up instance memory. Kindly help me debugging the problem.
Output of celery report
software -> celery:4.0.2 (latentcall) kombu:4.0.2 py:3.5.2
billiard:3.5.0.2 py-amqp:2.1.4
platform -> system:Linux arch:64bit, ELF imp:CPython
loader -> celery.loaders.default.Loader
settings -> transport:amqp results:disabled