Submit gcp ai-platform training job with network/subnetwork tags?

Submit gcp ai-platform training job with network/subnetwork tags? - google-cloud-ml

I would like to have access to certain databases (not in gcp) in my network when running a training job. How can I specify the subnet or network tags when the training instance is spun up?

Related

How do I periodically refresh the data in an RDS table automatically?

I have a web application that uses two databases. I want to schedule an SQL query to be executed in set intervals to select data from the first database and put it in a table in the second one and the web application takes it's data from here.

You can do that using a scheduled Lambda function.
From Schedule jobs for Amazon RDS and Aurora PostgreSQL using Lambda and Secrets Manager
For on-premises databases and databases that are hosted on Amazon Elastic Compute Cloud (Amazon EC2) instances, database administrators often use the cron utility to schedule jobs. For example, a job for data extraction or a job for data purging can easily be scheduled using cron. For these jobs, database credentials are typically either hard-coded or stored in a properties file. However, when you migrate to Amazon Relational Database Service (Amazon RDS) or Amazon Aurora PostgreSQL, you lose the ability to log in to the host instance to schedule cron jobs. This pattern describes how to use AWS Lambda and AWS Secrets Manager to schedule jobs for Amazon RDS and Aurora PostgreSQL databases after migration.
https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/schedule-jobs-for-amazon-rds-and-aurora-postgresql-using-lambda-and-secrets-manager.html

AWS Sagemaker custom training job container emit loss metric

I have created a customer docker container using an Amazon tensorflow container as a starting point:
763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.15.2-gpu-py36-cu100-ubuntu18.04
inside the container I run a custom keras (with TF backend) training job from the docker SAGEMAKER_PROGRAM. I can access the training data ok (from an EFS mount) and can generate output into /opt/ml/model that gets synced back to S3. So input and output is good: what I am missing is real-time monitoring.
A Sagemaker training job emits system metrics like cpu and gpu loads which you can conveniently view in real-time on the Sagemaker training job console. But I cannot find a way to emit metrics about the progress of the training job. i.e. loss, accuracy etc from my python code.
Actually, ideally I would like to use Tensorboard but as Sagemaker doesn't expose the instance on the EC2 console I cannot see how I can find the IP address of the instance to connect to for Tensorboard.
So the fallback is try and emit relevant metrics from the training code so that we can monitor the job as it runs.
The basic question is how do I real-time monitor key metrics for my custom training job runnning in a container on Sagemaker training job:
- Is a tensorboard solution possible? If so how?
- If not how do I emit metrics from my python code and have them show up in the training job console or as cloudwatch metrics directly?
BTW: so far I have failed to be able to get sufficient credentials inside the training job container to access either s3 or cloudwatch.

If you're using customer images for training, you can specify a name and a regular expression for metrics you want to track for training.
byo_estimator = Estimator(image_name=image_name,
role='SageMakerRole', train_instance_count=1,
train_instance_type='ml.c4.xlarge',
sagemaker_session=sagemaker_session,
metric_definitions=[{'Name': 'test:msd', 'Regex': '#quality_metric: host=\S+, test msd <loss>=(\S+)'},
{'Name': 'test:ssd', 'Regex': '#quality_metric: host=\S+, test ssd <loss>=(\S+)'}])

How to login/ssh into Dataflow VM workers in GCP?

I have a dataflow job. How to define metatags for it to login via sss into dataflow workers in GCP?

If you go to the Compute Engine section in your GCP console, you will be able to find a list of VMs running.
A set of workers for a Dataflow job will run under the same instance group, and next to each instance, you can see a button with a set of options to SSH into it:

Prometheus metrics scraping of docker tasks in ECS

I have multiple clusters in ECS, each cluster has multiple services, each service runs more than 1 task. Each task exposes /metrics with different values, on random port. I'd like to do some kind of dynamic discovery and scrape those metrics (each task has different port and different IP, because they run on multiple container instances), group together tasks' metrics from same service and scrape them using prometheus. How should I do that?

We had the same challenge, and there were two approaches:
Tag EC2 instance based on running tasks, then find EC2 instances in Prometheus based on tags. This worked well when we have one task per instance because the metrics port is known. There are possibly ways to extend this and support mukltiple tasks.
Run a task per EC2 instance that is used as the exporter for all the tasks running on that instance. It interrogates ECS, finds that tasks and the listening ports per task and scrapes all tasks. In Prometheus, you can then find all EC2 instances in the cluster and scrape this exporter in each one. Obviously you will need to label the metrics based on the task they were read from.
if I had to do it again, I would consider using Consul to register the tasks and discover them in Prometheus. if you are already using Consul, this direction could be a good one to try.
Hope this helps.

If you are not willing to for a proper service discovery like Consul or AWS native service discovery (see https://aws.amazon.com/blogs/aws/amazon-ecs-service-discovery/) you can leverage Prometheus file service discovery and a service that queries the AWS API, retrieves all required information and prepares the files for Prometheus. One example of such a tool can be found here: https://pypi.org/project/prometheus-ecs-discoverer/ (created by me based on another similar project).

Autoscaling a running Hadoop cluster setup on AWS EC2

My goal is to understand how can I auto-scale a Hadoop cluster on AWS EC2.
I am exploring AWS offerings from elastic scaling perspective for a Hadoop as service (EMR) and Hadoop on EC2.
For EMR, I gathered that using CloudWatch, performance metrics can be monitored and the user can be alerted once they reach the set threshold, thereafter the cluster can be scaled up or down depending on its utilization state.
This approach would require some custom implementation to automate the steps.(correct me if I am missing anything here)
For Hadoop on EC2, I came across with the auto scaling option which can add or remove instances as per configured scaling policies.
But I am not clear how a newly added node would get bootstrapped to the cluster automatically? How would YARN know that it can spawn a new container on this newly added node?
Does auto-scaling work for master-slave kind of setup as well or is limited to the web application?
There is 'Qubole' offering services to manage Hadoop on AWS as well....should that be used for automatically managing scaling the cluster?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js