I'm seeing stalled builds on GCR and I don't know how to debug. My last three builds have stalled at installing a YUM package fairly early into the build process. The packages are not exotic: libXrender, libfontenc, etc.
I think our timeout is set to 43200s or 12 hours. Is it possible the GCR builders are stalled because we consumed all the HD space ? Does anyone know what the resources are for a given GCR builder machine?
Related
I have an Azure DevOps pipeline that runs a yaml file in my repo. Part of the process is running some Powershell scripts with the AWS DevOps Toolkit, but it takes a really long time for the AWS CLI to install on the agent every time. I'm wondering if there's a way to parallelize the process or something to start installing the module while the other agent is building?
I'd really like to close the feedback loop on testing this thing out if possible. Everything else in my pipeline takes under a minute, while this can sometimes take up to 6 or 7
Project Detail
I am running open source code of A GAN based Research Paper named "Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition"
source code: here
The dependencies include:
Python 2.7
TensorFlow 1.4.0
I pulled a Docker Image of TensorFlow 1.4.0 with python 2.7 on my GPU Virtual Machine Connected with ssh connection with this command :
docker pull tensorflow/tensorflow:1.4.0-gpu
I am running
bash rsrgan/run_gan_rnn_placeholder.sh
according to readme of source code
Issue's Detail
Everything is working, Model is Training and loss is decreasing, But there is only one issue that After some iterations terminal shows no output, GPU still shows PID but no Memory freed and sometime GPU-Utils becomes 0%. Training on VM's GPU and CPU are same case.
It is not a memory issue Because GPU Memory usage by model is 5400MB out of 11,000MB and RAM for CPU is also very Big
When I ran 21 Iteration on my local Computer each iteration with 0.09 hours with 1st Gen i5 and 4GB RAM all iterations executed. But whenever I run it with ssh inside docker issue happens again and again with both GPU and CPU.
Just keep in mind the issue is happening inside docker with computer connected with ssh and ssh is also not disconnect very often.
exact Numbers
If an iteration take 1.5 hour then issue happens after two to three iterations and if single iteration take 0.06 hours then issue happens exactly after 14 iteration out of 25
Perform operations inside Docker container
The first thing you can try out is to build the Docker image and then enter inside the Docker container by specifying the -ti flag or /bin/bash parameter in your docker run command.
Clone the repository inside the container and while building the image you should also copy your training data from local to inside the docker. Run the training there and commit the changes so that you need not repeat the steps in future runs as after you exit from the container, all the changes are lost if not committed.
You can find the reference for docker commit here.
$ docker commit <container-id> <image-name:tag>
While training is going on check for the GPU and CPU utilization of the VM, see if everything is working as expected.
Use Anaconda environment on you VM
Anaconda is a great package manager. You can install anaconda and create a virtual environment and run your code in the virtual environment.
$ wget <url_of_anaconda.sh>
$ bash <path_to_sh>
$ source anaconda3/bin/activate or source anaconda2/bin/activate
$ conda create -n <env_name> python==2.7.*
$ conda activate <env_name>
Install all the dependencies via conda (recommended) or pip.
Run your code.
Q1: GAN Training with Tensorflow 1.4 inside Docker Stops without Prompting
Although Docker gives OS-level virtualization inside Docker, we face issues in running some processes which run with ease on the system. So to debug the issue you should go inside the image and performs the steps above in order to debug the problem.
Q2: Training stops without Releasing Memory connected to VM with SSH Connection
Ya, this is an issue I had also faced earlier. The best way to release memory is to stop the Docker container. You can find more resource allocation options here.
Also, earlier versions of TensorFlow had issues with allocating and clearing memory properly. You can find some reference here and here. These issues have been fixed in recent versions of TensorFlow.
Additionally, check for Nvidia bug reports
Step 1: Install Nvidia-utils installed via the following command. You can find the driver version from nvidia-smi output (also mentioned in the question.)
$ sudo apt install nvidia-utils-<driver-version>
Step 2: Run the nvidia-bug-report.sh script
$ sudo /usr/bin/nvidia-bug-report.sh
Log file will be generated in your current working directory with name nvidia-bug-report.log.gz. Also, you can access the installer log at /var/log/nvidia-installer.log.
You can find additional information about Nvidia logs at these links:
Nvidia Bug Report Reference 1
Nvidia Bug Report Reference 2
Log GPU load
Hope this helps.
beaker is a automation tool: https://beaker-project.org/. Does beaker support run job/task without re-install operating system in a machine?
In a scheduled job Beaker will always re-install the machine, there is currently no way to avoid that. (I would like to implement optionally skipping the installation for a recipe, one day.)
If you want to run Beaker tasks on some existing system without re-installing it, maybe because you are testing some changes to a task and you don't want to wait for Anaconda over and over again, you can use the restraint harness. It has a client mode command where you can give it a Beaker recipe XML file and it will run it.
Restraint can also fetch task source from git directly, which is particularly handy if you are testing your own patches for a task.
You can grab pre-built restraint packages from the Beaker harness yum repos.
I am trying to debug a problem I have with sidekiq processing jobs in a slow manner, while on the development machine the jobs start immediatly and run fast
My config is saving the original image on S3, pulling it by the worker, process the styles and save them back in s3. The jobs eventually finish, when a job ends, the next job does not immediatly start, making processing a lot of images really slow.
This problem happens no matter how many sidekiq workers I start (tested from 1-20).
I run imagemagick with -limit memory 64 -limit map 128 due to heroku's limited memory dynos.
The latest heroku cedar-14 has an imagemagick version that support multiple processing
Is there any special configuration I need to take into consideration when dealing with sidekiq + heroku + paperclip/imagemagick?
Heroku's 1X dynos are quite slow, they don't have the luxury of an SSD or several cores, unlike your laptop. Use -c 3 and expect to wait or upgrade to a PX.
I am searching for a way to create a job in hudson that is a copy of another job, but deploys the installation on another machine (no problem until here). I need the second hudson build (= development environment) to start only in the night (no problem also) and ONLY if the first build (= check environment) is running properly.
The first build will be triggered by SVN checkins and due to a lot of checkins this build is started several times during the day. The second build must not start during our office time, because this would derail our development. At night this development environment should be reinstalled, if the last build of the check environment worked properly.
Does anyone have an idea how to solve this issue? At the moment I have to check the check environment and reinstall our development environment manually, if the check environment has no errors.
Thanks in advance.