I have a t2.medium instance on aws. Where two of my python applications and their celery workers runs inside separate docker containers. Total 4 containers are running.
For no reason, celery eats up lots of memory of instance.
Screenshot of ps command output
I have checked that django is running with DEBUG_MODE False. I have configured worker_max_tasks_per_child to 200 and worker_max_memory_per_child to 200MB.
I have:
Ubuntu Version: 16.04
Python Version: 3.5
As of now I am not running any tasks, still it eats up instance memory. Kindly help me debugging the problem.
Output of celery report
software -> celery:4.0.2 (latentcall) kombu:4.0.2 py:3.5.2
billiard:3.5.0.2 py-amqp:2.1.4
platform -> system:Linux arch:64bit, ELF imp:CPython
loader -> celery.loaders.default.Loader
settings -> transport:amqp results:disabled
Related
I'm running Docker on a t2.micro AWS EC2 instance with Ubuntu.
I'm running several containers. One of my long-running containers (always the same) just disappeared after running about 2-5 days for the third time right now. It is just gone with no sign of a crash.
The machine has not been restarted (uptime says 15 days).
I do not use the --rm flag: docker run -d --name mycontainer myimage.
There is no exited zombie of this container when running docker ps -a.
There is no log, i.e. docker logs mycontainer does not find any container.
There is no log entry in journalctl -u docker.service within the time frame
where the container disappears. However, there are some other log entries
regarding another container (let's call it othercontainer) which are
occuring repeatedly about every 6 minutes (it's a cronjob, don't know if relevant):
could not remove cluster networks: This node is not a swarm manager. Use
"docker swarm init" or "docker swarm join" to connect this node to swarm
and try again
Handler for GET /v1.24/networks/othercontainer_default returned error:
network othercontainer_default not found
Firewalld running: false
Even if there would be e.g. an out-of-memory issue or if my application just exits, I would still have an exited Docker container zombie in the ps -a overview, probably with exist status 0 or != 0, right?
I also don't want to --restart automatically, I just want to see the exited container.
Where can I look for more details to trace the issue?
Versions:
OS: Ubuntu 16.04.2 LTS (Kernel: 4.4.0-1013-aws)
Docker: Docker version 17.03.1-ce, build c6d412e
Thanks to a hint to look at dmesg or maybe the general journalctl I think I finally found the issue.
Somehow, one of the cronjobs has been running docker system prune -f at its end every 5 minutes. This command basically seems to remove everything unused and non-running.
I didn't know about this command before but certainly this has to be the way how my exited containers got removed without me knowing how it happened.
I have the following docker containers that I have set up to test my web application:
Jenkins
Apache 1 (serving a laravel app)
Apache 2 (serving a legacy codeigniter app)
MySQL (accessed by both Apache 1 and Apache 2)
Selenium HUB
Selenium Node — ChromeDriver
The jenkins job runs a behat command on Apache 1 which in turn connects to Selenium Hub, which has a ChromeDriver node to actually hit the two apps: Apache 1 and Apache 2.
The whole system is running on an EC2 t2.small instance (1 core, 2GB RAM) with AWS linux.
The problem
The issue I am having is that if I run the pipeline multiple times, the first few times it runs just fine (the behat stage takes about 20s), but on the third and consecutive runs, the behat stage starts slowing down (taking 1m30s) and then failing after 3m or 10m or whenever I lose patience.
If I restart the docker containers, it works again, but only for another 2-4 runs.
Clues
Monitoring docker stats each time I run the jenkins pipeline, I noticed that the Block I/O, and specifically the 'I' was growing exponentially after the first few runs.
For example, after run 1
After run 2
After run 3
After run 4
The Block I/O for the chromedriver container is 21GB and the driver hangs. While I might expect the Block I/O to grow, I wouldn't expect it to grow exponentially as it seems to be doing. It's like something is... exploding.
The same docker configuration (using docker-compose) runs flawlessly every time on my personal MacBook Pro. Block I/O does not 'explode'. I constrain Docker to only use 1 core and 2GB of RAM.
What I've tried
This situation has sent me down the path of learning a lot more about docker, filesystems and memory management, but I'm still not resolving the issue. Some of the things I have tried:
Memory
I set mem_limit options on all containers and tuned them so that during any given run, the memory would not reach 100%. Memory usage now seems fairly stable, and never 'blows up'.
Storage Driver
The default for AWS Linux Docker is devicemapper in loop-lvm mode. After reading this doc
https://docs.docker.com/engine/userguide/storagedriver/device-mapper-driver/#configure-docker-with-devicemapper
I switched to the suggested direct-lvm mode.
docker-compose restart
This does indeed 'reset' the issue, allowing me to get a few more runs in, but it doesn't last. After 2-4 runs, things seize up and the tests start failing.
iotop
Running iotop on the host shows that reads are going through the roof.
My Question...
What is happening that causes the block i/o to grow exponentially? I'm not clear if it's docker, jenkins, selenium or chromedriver that are causing the problem. My first guess is chromedriver, although the other containers are also showing signs of 'exploding'.
What is a good approach to tuning a system like this with multiple moving parts?
Additonal Info
My chromedriver container has the following environment set in docker-compose:
- SE_OPTS=-maxSession 6 -browser browserName=chrome,maxInstances=3
docker info:
$ docker info
Containers: 6
Running: 6
Paused: 0
Stopped: 0
Images: 5
Server Version: 1.12.6
Storage Driver: devicemapper
Pool Name: docker-thinpool
Pool Blocksize: 524.3 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 4.862 GB
Data Space Total: 20.4 GB
Data Space Available: 15.53 GB
Metadata Space Used: 2.54 MB
Metadata Space Total: 213.9 MB
Metadata Space Available: 211.4 MB
Thin Pool Minimum Free Space: 2.039 GB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: overlay null host bridge
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.4.51-40.60.amzn1.x86_64
Operating System: Amazon Linux AMI 2017.03
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.956 GiB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
I clone this repo (it's pretty much based on docker docs here) and run docker-compose up. Docker builds the 2 containers and I see the output from db_1 (psql looks to be completely ready) but nothing at all from web_1, no output whatsoever.
I go to my host IP + 8000 and nothing is running there. I am using docker toolbox for mac. It's pretty much the simplest possible example of using Docker - any idea why I'm not seeing anything from my Django container?
Thanks in advance,
it might be possible that STDOUT of the web_1 Container is mapped only to display WARN and ERROR level. You say youre using Docker Toolbox for Mac? Have you tried to reach the Website over the IP of the DockerToolBox VM or the HostIP? Im not quite aware with DockerToolbox since there is an native MacClient (https://docs.docker.com/engine/installation/mac/). Maybe try to reach the DockerToolboxIp not HostIP. I would also recommend to use Docker for Mac native, since i had problems with the ToolBox but none with the "Native" Client.
Hope i could Help
After taking a better look to the documentation I was able to start your containers.
After the git clone:
cd sane-django-docker
docker-compose up -d
This is the output
Starting sanedjangodocker_db_1
Starting sanedjangodocker_web_1
[root#localhost sane-django-docker]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
cde9e93c1a70 sanedjangodocker_web "python3 manage.py ru" 19 seconds ago Up 1 seconds 0.0.0.0:8000->8000/tcp sanedjangodocker_web_1
73ad8cafe798 postgres:9.4 "/docker-entrypoint.s" 20 seconds ago Up 1 seconds 5432/tcp sanedjangodocker_db_1
When I just performd docker-compose up (running in the forground I saw this issue).
LOG: shutting down
LOG: database system is shut down
After taking a better look in the documentation I saw the problem
Django will complain about the postgres database not existing so we'll
create one:
docker exec sanedjangodocker_db_1 createdb -Upostgres webapp
Now the postgres is fine but I had to restart the webapp to find the db.
docker restart sanedjangodocker_web_1
Now I'm able to acces it on IP:8000
It worked!
Congratulations on your first Django-powered page.
I don't know how the django app really works but the setup is pretty strange.
My current objective is to have Travis deploy our Django+Docker-Compose project upon successful merge of a pull request to our Git master branch. I have done some work setting up our AWS CodeDeploy since Travis has builtin support for it. When I got to the AppSpec and actual deployment part, at first I tried to have an AfterInstall script do docker-compose build and then have an ApplicationStart script do docker-compose up. The containers that have images pulled from the web are our PostgreSQL container (named db, image aidanlister/postgres-hstore which is the usual postgres image plus the hstore extension), the Redis container (uses the redis image), and the Selenium container (image selenium/standalone-firefox). The other two containers, web and worker, which are the Django server and Celery worker respectively, use the same Dockerfile to build an image. The main command is:
CMD paver docker_run
which uses a pavement.py file:
from paver.easy import task
from paver.easy import sh
#task
def docker_run():
migrate()
collectStatic()
updateRequirements()
startServer()
#task
def migrate():
sh('./manage.py makemigrations --noinput')
sh('./manage.py migrate --noinput')
#task
def collectStatic():
sh('./manage.py collectstatic --noinput')
# find any updates to existing packages, install any new packages
#task
def updateRequirements():
sh('pip install --upgrade -r requirements.txt')
#task
def startServer():
sh('./manage.py runserver 0.0.0.0:8000')
Here is what I (think I) need to make happen each time a pull request is merged:
Have Travis deploy changes using CodeDeploy, based on deploy section in .travis.yml tailored to our CodeDeploy setup
Start our Docker containers on AWS after successful deployment using our docker-compose.yml
How do I get this second step to happen? I'm pretty sure ECS is actually not what is needed here. My current status right now is that I can get Docker started with sudo service docker start but I cannot get docker-compose up to be successful. Though deployments are reported as "successful", this is only because the docker-compose up command is run in the background in the Validate Service section script. In fact, when I try to do docker-compose up manually when ssh'd into the EC2 instance, I get stuck building one of the containers, right before the CMD paver docker_run part of the Dockerfile.
This took a long time to work out, but I finally figured out a way to deploy a Django+Docker-Compose project with CodeDeploy without Docker-Machine or ECS.
One thing that was important was to make an alternate docker-compose.yml that excluded the selenium container--all it did was cause problems and was only useful for local testing. In addition, it was important to choose an instance type that could handle building containers. The reason why containers couldn't be built from our Dockerfile was that the instance simply did not have the memory to complete the build. Instead of a t1.micro instance, an m3.medium is what worked. It is also important to have sufficient disk space--8GB is far too small. To be safe, 256GB would be ideal.
It is important to have an After Install script run service docker start when doing the necessary Docker installation and setup (including installing Docker-Compose). This is to explicitly start running the Docker daemon--without this command, you will get the error Could not connect to Docker daemon. When installing Docker-Compose, it is important to place it in /opt/bin/ so that the binary is used via /opt/bin/docker-compose. There are problems with placing it in /usr/local/bin (I don't exactly remember what problems, but it's related to the particular Linux distribution for the Amazon Linux AMI). The After Install script needs to be run as root (runas: root in the appspec.yml AfterInstall section).
Additionally, the final phase of deployment, which is starting up the containers with docker-compose up (more specifically /opt/bin/docker-compose -f docker-compose-aws.yml up), needs to be run in the background with stdin and stdout redirected to /dev/null:
/opt/bin/docker-compose -f docker-compose-aws.yml up -d > /dev/null 2> /dev/null < /dev/null &
Otherwise, once the server is started, the deployment will hang because the final script command (in the ApplicationStart section of my appspec.yml in my case) doesn't exit. This will probably result in a deployment failure after the default deployment timeout of 1 hour.
If all goes well, then the site can finally be accessed at the instance's public DNS and port in your browser.
I have installed Ruby and Gems and also installed VMC following the documentation on the cloudfoundry website. I could deploy a simple hello world application successfully. Several commands seem to work fine. However, few commands just fail and I have no clue why.
When I run the following command:
vmc instances hellor 3
I get an error: Unknwon app '3'
When I just run:
vmc instances hellor
It retrieves the instance fine and displays it without any error. But, when I specify a number after that to increase the instances, it just seem to treat that number as an appname and gives me error. What could be the reason. I could not find anyone else facingup this issue on any of the forums. Any help on this will be highly appreciated. I am deploying on cloudfoundry.com
The behavior of this command depends on the version of vmc you are using. You can see the version of vmc you are running with vmc --version.
With vmc version 0.3.x, the instances command works as you are expecting it to in your question. If you run vmc help with version 0.3.x, you will see this among other output:
instances <appname> <num|delta> Scale the application instances up or down
With vmc version 0.4.x (also known as vmc-ng), the instances command works differently and the scale command is introduced, as Hitesh says. If you run vmc help --all with version 0.4.x, you will see this among other ouput:
instances APPS... List an app's instances
scale [APP] Update the instances/memory limit for an application
"vmc instances [APP]" is used to list the number of instances you have. To actually scale your application you can do "vmc scale [APP]" as shown below:
hghia#SEA-007~/workgalaxy/hello$ vmc scale hello
Instances> 3
1: 64M
2: 128M
3: 256M
4: 512M
5: 1G
6: 2G
Memory Limit> 64M
Scaling hello... OK
hghia#SEA-007~/workgalaxy/hello$ vmc instances hello
Getting instances for hello... OK
instance #0: running
started: 2012-12-10 03:41:39 PM
instance #1: running
started: 2012-12-10 03:46:56 PM
instance #2: running
started: 2012-12-10 03:46:56 PM
Thanks,
- Hitesh