I am attempting to tune HPs for my model using the ml-engine on a local server. In my case the model trains a single pass, but no HP trials are performed. Is this a configuration issue, or is HP optimization not supported in local mode?
My local command:
gcloud ml-engine local train --package-path $PWD --module-name example.train --configuration example/hpconfig.yaml -- --param1 16 --param2 2
My config file:
trainingInput:
workerCount: 1
hyperparameters:
goal: MINIMIZE
hyperparameterMetricTag: val_loss
maxTrials: 10
maxParallelTrials: 1
enableTrialEarlyStopping: True
params:
- parameterName: param1
type: INTEGER
minValue: 4
maxValue: 128
scaleType: UNIT_LINEAR_SCALE
- parameterName: param2
type: INTEGER
minValue: 1
maxValue: 4
scaleType: UNIT_LINEAR_SCALE
Unfortunately, HP Tuning cannot be run in local mode. I would recommend a workflow like so:
Run locally with small data, etc. to ensure everything is working (I recommend using GCS paths).
Run a small test on cloud (single job) to ensure dependencies are correct, data files properly point to GCS instead of locally, etc.
Run an HP Tuning job.
Once 1 and 2 are working, 3 generally will, too.
Also, as a side note. Kubeflow supports Katib for running HP tuning jobs from any kubernetes deployment, including Minikube (for local development).
Related
My flask app deployment via App Engine Flex is timing out and after setting debug=True. I see the following line repeating over and over until it fails. I am not sure however what this is and cannot find anything useful in logs explorer.
Updating service [default] (this may take several minutes)...working DEBUG: Operation [apps/enhanced-bonito-349015/operations/81b83124-17b1-4d90-abdc-54b3fa28df67] not complete. Waiting to retry.
Could anyone share advice on where to look to resolve this issue?
Here is my app.yaml (I thought this was due to a memory issue..):
runtime: python
env:flex
entrypoint: gunicorn - b :$PORT main:app
runtime_config:
python_version:3
resources:
cpu:4
memory_gb: 12
disk_size_gb: 1000
readiness_check:
path: "/readines_check"
check_interval_sec: 5
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
app_start_timeout_sec: 300
Error logs:
ERROR: (gcloud.app.deploy) Error Response: [4] An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2022-05-10T23:21:10.941Z47607.vt.0: Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.
There could be possible ways to resolve such deployment errors.
Increase the value of app_start_timeout_sec to the maximum value which is 1800
Make sure that all the Google Cloud services that Endpoints and ESP require are enabled on your project.
Assuming that splitHealthChecks feature is enabled, make sure to follow all the steps needed when migrating from the legacy version.
I am trying to include the Container Analyis API link in a Cloud Build pipeline.This is a beta component and with command line I need to install it first:
gcloud components install beta local-extract
then I can run the on demand container analyis (if the container is present locally):
gcloud beta artifacts docker images scan ubuntu:latest
My question is how I can use component like beta local-extract within Cloud Build ?
I tried to do a fist step and install the missing componentL
## Update components
- name: 'gcr.io/cloud-builders/gcloud'
args: ['components', 'install', 'beta', 'local-extract', '-q']
id: Update component
but as soon as I move to the next step the update is gone (since it is not in the container)
I also tried to install the component and then run the scan using (& or ;) but it is failling:
## Run vulnerability scan
- name: 'gcr.io/cloud-builders/gcloud'
args: ['components', 'install', 'beta', 'local-extract', '-q', ';', 'gcloud', 'beta', 'artifacts', 'docker', 'images', 'scan', 'ubuntu:latest', '--location=europe']
id: Run vulnaribility scan
and I get:
Already have image (with digest): gcr.io/cloud-builders/gcloud
ERROR: (gcloud.components.install) unrecognized arguments:
;
gcloud
beta
artifacts
docker
images
scan
ubuntu:latest
--location=europe (did you mean '--project'?)
To search the help text of gcloud commands, run:
gcloud help -- SEARCH_TERMS
so my question are:
how can I run "gcloud beta artifacts docker images scan ubuntu:latest" within Cloud Build ?
bonus: from the previous command how can I get the "scan" output value that I will need to pass as a parameter to my next step ? (I guess it should be something with --format)
You should try the cloud-sdk docker image:
https://github.com/GoogleCloudPlatform/cloud-sdk-docker
The Cloud Build team (implicitly?) recommends it:
https://github.com/GoogleCloudPlatform/cloud-builders/tree/master/gcloud
With the cloud-sdk-docker container you can change the entrypoint to bash pipe gcloud commands together. Here is an (ugly) example:
https://github.com/GoogleCloudPlatform/functions-framework-cpp/blob/d3a40821ff0c7716bfc5d2ca1037bcce4750f2d6/ci/build-examples.yaml#L419-L432
As to your bonus question. Yes, --format=value(the.name.of.the.field) is probably what you want. The trick is to know the name of the field. I usually start with --format=json on my development workstation to figure out the name.
The problem comes from Cloud Build. It cache some often used images and if you want to use a brand new feature in GCLOUD CLI the cache can be too old.
I performed a test tonight, the version is 326 in cache. the 328 has just been released. So, the cached version has 2 weeks old, maybe too old for your feature. It could be worse in your region!
The solution to fix this, is to explicitly request the latest version.
Go to this url gcr.io/cloud-builders/gcloud
Copy the latest version
Paste the full version name in the step of your Cloud Build pipeline.
The side effect is a longer build. Indeed, because this latest image isn't cached, it has to be downloaded in Cloud Build.
I have a pretty standard CI pipeline using Cloud Build for my Machine Learning training model based on container:
check python error use flake8
check syntax and style issue using pylint, pydocstyle ...
build a base container (CPU/GPU)
build a specialized ML container for my model
check the vulnerability of the packages installed
run tests units
Now in Machine Learning it is impossible to validate a model without testing it with real data. Normally we add 2 extra checks:
Fix all random seed and run on a test data to see if we find the exact same results
Train the model on a batch and see if we can over fit and have the loss going to zero
This allow to catch issues inside the code of model. In my setup, I have my Cloud Build in a build GCP project and the data in another GCP project.
Q1: did somebody managed to use AI Platform training service in Cloud Build to train on data sitting in another GCP project ?
Q2: how to tell Cloud Build to wait until the AI Platform training job finished and check what is the status (successful/failed) ? It seems that the only option when looking at the documentation link it to use --stream-logsbut it seems non optimal (using such option, I saw some huge delay)
When you submit an AI platform training job, you can specify a service account email to use.
Be sure that the service account has enough authorization in the other project to use data from there.
For you second question, you have 2 solutions
Use --stream-logs as you mentioned. If you don't want the logs in your Cloud Build, you can redirect the stdout and/or the stderr to /dev/null
- name: name: 'gcr.io/cloud-builders/gcloud'
entrypoint: 'bash'
args:
- -c
- |
gcloud ai-platform jobs submit training <your params> --stream-logs >/dev/null 2>/dev/null
Or you can create an infinite loop that check the status
- name: name: 'gcr.io/cloud-builders/gcloud'
entrypoint: 'bash'
args:
- -c
- |
JOB_NAME=<UNIQUE Job NAME>
gcloud ai-platform jobs submit training $${JOB_NAME} <your params>
# test the job status every 60 seconds
while [ -z "$$(gcloud ai-platform jobs describe $${JOB_NAME} | grep SUCCEEDED)" ]; do sleep 60; done
Here my test is simple, but you can customize the status tests as you want to match your requirement
Don't forget to set the timeout as expected.
I am thinking to use Ansible to manage my AWS infrastructure; I have (2 servers with auto scaling).
I will deploy using ansible-playbook -i hosts deploy-plats.yml --limit spring-boot
Here my deploy-plats.yml
---
- hosts: bastion:apache:spring-boot
vars:
remote_user: ec2-user
tasks:
- name: Copies the .jar to the Spring Boot boxes
copy: dest=~/ src=~/dev/plats/target/plats.jar mode=0777
- name: Restarts the plats service
service: name=plats state=restarted enabled=yes
become: yes
become_user: root
and I am wondering if using this technology will be a Blue-green deployment or the servers will be restarted at the same time, producing a downtime
By default, Ansible will try to manage all of the machines referenced
in a play in parallel. For a rolling updates use case, you can define
how many hosts Ansible should manage at a single time by using the
serial keyword: (maybe you look for something like this and not blue
green deployment)
- name: test play
hosts: webservers
serial: 1
ansible-serial-link
Also your playbook is not a blue green deployment, I suggest you to read about it.
little bit. A blue/green deployment is a software deployment strategy
that relies on two identical production configurations that alternate
between active and inactive. One environment is referred to as blue,
and the duplicate environment is dubbed green. The two environments,
blue and green, can each handle the entire production workload and are
used in an alternating manner rather than as a primary and secondary
space. One environment is live and the other is idle at any given
time. When a new software release is ready, the team deploys this
release to the idle environment, where it is thoroughly tested. Once
the new release has been vetted, the team will make the idle
environment active, typically by adjusting a router configuration to
redirect application traffic. This leaves the alternate environment
idle.
By default, Ansible will run each task in paralell. You can set the play-level directive "serial" to force it to run the play on one and one node. This is described in detail here:
"Delegation, Rolling Updates, and Local Actions"
serial tag must solve your problem. Limit the value to 1 so that the restart task will get executed in rolling fashion.
I have the following docker containers that I have set up to test my web application:
Jenkins
Apache 1 (serving a laravel app)
Apache 2 (serving a legacy codeigniter app)
MySQL (accessed by both Apache 1 and Apache 2)
Selenium HUB
Selenium Node — ChromeDriver
The jenkins job runs a behat command on Apache 1 which in turn connects to Selenium Hub, which has a ChromeDriver node to actually hit the two apps: Apache 1 and Apache 2.
The whole system is running on an EC2 t2.small instance (1 core, 2GB RAM) with AWS linux.
The problem
The issue I am having is that if I run the pipeline multiple times, the first few times it runs just fine (the behat stage takes about 20s), but on the third and consecutive runs, the behat stage starts slowing down (taking 1m30s) and then failing after 3m or 10m or whenever I lose patience.
If I restart the docker containers, it works again, but only for another 2-4 runs.
Clues
Monitoring docker stats each time I run the jenkins pipeline, I noticed that the Block I/O, and specifically the 'I' was growing exponentially after the first few runs.
For example, after run 1
After run 2
After run 3
After run 4
The Block I/O for the chromedriver container is 21GB and the driver hangs. While I might expect the Block I/O to grow, I wouldn't expect it to grow exponentially as it seems to be doing. It's like something is... exploding.
The same docker configuration (using docker-compose) runs flawlessly every time on my personal MacBook Pro. Block I/O does not 'explode'. I constrain Docker to only use 1 core and 2GB of RAM.
What I've tried
This situation has sent me down the path of learning a lot more about docker, filesystems and memory management, but I'm still not resolving the issue. Some of the things I have tried:
Memory
I set mem_limit options on all containers and tuned them so that during any given run, the memory would not reach 100%. Memory usage now seems fairly stable, and never 'blows up'.
Storage Driver
The default for AWS Linux Docker is devicemapper in loop-lvm mode. After reading this doc
https://docs.docker.com/engine/userguide/storagedriver/device-mapper-driver/#configure-docker-with-devicemapper
I switched to the suggested direct-lvm mode.
docker-compose restart
This does indeed 'reset' the issue, allowing me to get a few more runs in, but it doesn't last. After 2-4 runs, things seize up and the tests start failing.
iotop
Running iotop on the host shows that reads are going through the roof.
My Question...
What is happening that causes the block i/o to grow exponentially? I'm not clear if it's docker, jenkins, selenium or chromedriver that are causing the problem. My first guess is chromedriver, although the other containers are also showing signs of 'exploding'.
What is a good approach to tuning a system like this with multiple moving parts?
Additonal Info
My chromedriver container has the following environment set in docker-compose:
- SE_OPTS=-maxSession 6 -browser browserName=chrome,maxInstances=3
docker info:
$ docker info
Containers: 6
Running: 6
Paused: 0
Stopped: 0
Images: 5
Server Version: 1.12.6
Storage Driver: devicemapper
Pool Name: docker-thinpool
Pool Blocksize: 524.3 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 4.862 GB
Data Space Total: 20.4 GB
Data Space Available: 15.53 GB
Metadata Space Used: 2.54 MB
Metadata Space Total: 213.9 MB
Metadata Space Available: 211.4 MB
Thin Pool Minimum Free Space: 2.039 GB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: overlay null host bridge
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.4.51-40.60.amzn1.x86_64
Operating System: Amazon Linux AMI 2017.03
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.956 GiB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/