Control GPU machine to start and stop from one function? - google-cloud-platform

Thanks to Google cloud we get free credits for running GPU on cloud, but we getting stuck at a very beginning.
We use to get images daily for processing through machine learning model, but somehow GPU System are not getting used through out the day is there any way we can control this system to start and stop once all the images are processed through one function? Which we can call through cron at specific day and timing.
I have heard about aws lambda but am not sure of what google cloud can provide for this problem.
Thanks in Advance.

If you are willing to spend effort, you can achieve this using Google Kubernetes Engine. As far is I know this is the only way right now to have self-starting and stopping GPU instances on GCP. To achieve this, you have to add a GPU node pool with auto-scaling to your Kubernetes cluster.
gcloud container node-pools create gpu_pool \
--cluster=${GKE_CLUSTER_NAME} \
--machine-type=n1-highmem-96 \
--accelerator=nvidia-tesla-v100,8 \
--node-taints=reserved-pool=true:NoSchedule \
--enable-autoscaling \
--min-nodes=0 \
--max-nodes=4 \
--zone=${GCP_ZONE} \
--project=${PROJECT_ID}
Make sure to sub the env variables with your actual project ID etc. and also make sure to use a GCP zone that actually has the GPU types available that you want (not all zones have all GPUs types). Make sure to specifiy the zone like europe-west1-b not europe-west1.
This command will start all the nodes at once, but they will be automatically shut down after whatever the default timeout for autoscaling nodes is in your default cluster configuration (for me I think it was 5minutes). However, you can change that setting.
You can then start a Kubernetes Job (NOT deployment) from CLI or using any of the available Kubernetes API Client libraries which explicitly request GPU resources.
Here is some example job.yaml with the main necessary components, however, you would need to tweak that according to your cluster config:
apiVersion: batch/v1
kind: Job
metadata:
name: some-job
spec:
parallelism: 1
template:
metadata:
name: some-job
labels:
app: some-app
spec:
containers:
- name: some-image
image: gcr.io/<project-id>/some-image:latest
resources:
limits:
cpu: 3500m
nvidia.com/gpu: 1
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- some-app
topologyKey: "kubernetes.io/hostname"
tolerations:
- key: reserved-pool
operator: Equal
value: "true"
effect: NoSchedule
- key: nvidia.com/gpu
operator: Equal
value: "present"
effect: NoSchedule
restartPolicy: OnFailure
It is vital that the tolerations are set up like this and that the actual resource limit is set to how many GPUs you want. Otherwise it won't work.
The nodes will then be started (if none are available) and the job will be computed. Idle nodes will once again be shutdown after the specified autoscaling timeout.
I got the idea from here.

You can use Cloud Scheduler for this use cases or you can trigger Cloud Function when images are available and process it.
However, the free $300 quota is for training and innovation purpose not for actual production application.

You can try and optimize the GPU usage of the instances by following the guide over here, however, you would need to manage it through a cron or something in the instance.
Also, watch out for your credit usage when using GPU on free trial. Free trial gives you only $300 USD in credits, however, as seen over here, GPU usage is expensive and you may spend all your credits in 1 or 2 weeks if you are not careful.
Hope you find this useful!

Related

Is 'No Workload identity for a node level' or 'failure to load CA secret' stopping service mesh from working?

This is the first time I have been trying to install managed Anthos into one of the clusters in GKE. I admit I do not fully understand the full process of installation and troubleshooting I have already done.
It looks like a managed service has failed to install.
When I run:
kubectl describe controlplanerevision asm-managed -n istio-system
I get this status:
Status:
Conditions:
Last Transition Time: 2022-03-15T14:16:21Z
Message: The provisioning process has not completed successfully
Reason: NotProvisioned
Status: False
Type: Reconciled
Last Transition Time: 2022-03-15T14:16:21Z
Message: Provisioning has finished
Reason: ProvisioningFinished
Status: True
Type: ProvisioningFinished
Last Transition Time: 2022-03-15T14:16:21Z
Message: Workload identity is not enabled at node level
Reason: PreconditionFailed
Status: True
Type: Stalled
Events: <none>
However, I have Workload identity enabled on a cluster level and I cannot see any options in GCP Console to set that for just a node level.
I am not sure if this is related to istiod-asm-1125-0 logging some errors. One of them is about failure to load CA secret:
Nevertheless, the service mesh does not show as added or connected in Anthos Dashboard. The cluster is registered with Anthos.
I created a new node pool with more CPU and more nodes as I was getting warning about not having enough CPU. Istio service mesh increases the need for CPU.
I migrated my deployment from old node pool to the new one.
I run istioctl analyze -A and found a few warnings about istio-injection not being enabled in a few namespaces. I fixed that.
I re run asmcli install command without CA
./asmcli install --project_id my-app --cluster_name my-cluster --cluster_location europe-west1-b --fleet_id my-app --output_dir anthos-service-mesh --enable_all
All or some of the above did the trick.

Managing volume rollbacks in K8s using persistent volumes

I have a kubernetes deployment managed by a helm chart that I am planning an upgrade of. The app has 2 persistent volumes attached which are are EBS volumes in AWS. If the deployment goes wrong and needs rolling back I might also need to roll back the EBS volumes. How would one manage that in K8s? I can easily create the volume manually in AWS from my snapshot I've taken pre deployment but for the deployment to use it would I need to edit the pv yaml file to point to my new volume ID? Or would I need to create a new PV using the volume ID and a new PVC and then edit my deployment to use that claim name?
First you need to define a storage class with reclaimPolicy: Delete
https://kubernetes.io/docs/concepts/storage/storage-classes/
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
reclaimPolicy: Delete
allowVolumeExpansion: true
mountOptions:
- debug
volumeBindingMode: Immediate
Then, in your helm chart, you need to use that storage class. So, when you delete the helm chart, the persistent claim will be deleted and because the ReclaimPolicy=Delete for the storage class used, the corresponding persistent volume will also be deleted.
Be careful though. Once PV is deleted, you will not be able to recover that volume's data. There is no "recycle bin".

EksCtl : Update node-definitions via cluster config file not working

I am using eksctl to create our EKS cluster.
For the first run, it works out good, but if I want to upgrade the cluster-config later in the future, it's not working.
I have a cluster-config file with me, but any changes made to it are not reflect with update/upgrade command.
What am I missing?
Cluster.yaml :
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: supplier-service
region: eu-central-1
vpc:
subnets:
public:
eu-central-1a: {id: subnet-1}
eu-central-1b: {id: subnet-2}
eu-central-1c: {id: subnet-2}
nodeGroups:
- name: ng-1
instanceType: t2.medium
desiredCapacity: 3
ssh:
allow: true
securityGroups:
withShared: true
withLocal: true
attachIDs: ['sg-1', 'sg-2']
iam:
withAddonPolicies:
autoScaler: true
Now, if in the future, I would like to make change to instance.type or replicas, I have to destroy entire cluster and recreate...which becomes quite cumbersome.
How can I do in-place upgrades with clusters created by EksCtl? Thank you.
I'm looking into the exact same issue as yours.
After a bunch of searches against the Internet, I found that it is not possible yet to in-place upgrade your existing node group in EKS.
First, eksctl update has become deprecated. When I executed eksctl upgrade --help, it gave a warning like this:
DEPRECATED: use 'upgrade cluster' instead. Upgrade control plane to the next version.
Second, as mentioned in this GitHub issue and eksctl document, up to now the eksctl upgrade nodegroup is used only for upgrading the version of managed node group.
So unfortunately, you'll have to create a new node group to apply your changes, migrate your workload/switch your traffic to new node group and decommission the old one. In your case, it's not necessary to nuke the entire cluster and recreate.
If you're seeking for seamless upgrade/migration with minimum/zero down time, I suggest you try managed node group, in which the graceful draining of workload seems promising:
Node updates and terminations gracefully drain nodes to ensure that your applications stay available.
Note: in your config file above, if you specify nodeGroups rather than managedNodeGroups, an unmanaged node group will be provisioned.
However, don't lose hope. An active issue in eksctl GitHub repository has been lodged to add eksctl apply option. At this stage it's not yet released. Would be really nice if this came true.
To upgrade the cluster using eksctl:
Upgrade the control plane version
Upgrade coredns, kube-proxy and aws-node
Upgrade the worker nodes
If you just want to update nodegroup and keep the same configuration, you can just change nodegroup names, e.g. append -v2 to the name. [0]
If you want to change the node group configuration 'instance type', you need to just create a new node group: eksctl create nodegroup --config-file=dev-cluster.yaml [1]
[0] https://eksctl.io/usage/cluster-upgrade/#updating-multiple-nodegroups-with-config-file
[1] https://eksctl.io/usage/managing-nodegroups/#creating-a-nodegroup-from-a-config-file

change ElastiCache node DNS record in cloud formation template

I need to create CNAME record for ElastiCache Cluster. However, I build redis cluster and there is only one node. As far as I found there is no
ConfigurationEndpoint.Address for redis cluster. Is there any chance to change DNS name for node in cluster and how to do it?
Currently template looks like:
"ElastiCahceDNSRecord" : {
"Type" : "AWS::Route53::RecordSetGroup",
"Properties" : {
"HostedZoneName" : "example.com.",
"Comment" : "Targered to ElastiCache",
"RecordSets" : [{
"Name" : "elche01.example.com.",
"Type" : "CNAME",
"TTL" : "300",
"ResourceRecords" : [
{
"Fn::GetAtt": [ "myelasticache", "ConfigurationEndpoint.Address" ]
}
]
}]
}
}
For folks coming to this page for a solution. There is now a way to get the Redis endpoint directly from within the CFN.
There is now the ability to get the RedisEndpoint.Address from the AWS::ElastiCache::CacheCluster or PrimaryEndPoint.Address from the AWS::ElastiCache::ReplicationGroup
Per the documentation (http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-elasticache-cache-cluster.html):
RedisEndpoint.Address - The DNS address of the configuration endpoint for the Redis cache cluster.
RedisEndpoint.Port - The port number of the configuration endpoint for the Redis cache cluster.
or
Per the documentation (http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticache-replicationgroup.html):
PrimaryEndPoint.Address -
The DNS address of the primary read-write cache node.
PrimaryEndPoint.Port -
The number of the port that the primary read-write cache engine is listening on.
An example CFN (other bits not included):
Resources:
DnsRedis:
Type: 'AWS::Route53::RecordSetGroup'
Properties:
HostedZoneName: 'a.hosted.zone.name.'
RecordSets:
- Name: 'a.record.set.name'
Type: CNAME
TTL: '300'
ResourceRecords:
- !GetAtt
- RedisCacheCluster
- RedisEndpoint.Address
DependsOn: RedisCacheCluster
RedisCacheCluster:
Type: 'AWS::ElastiCache::CacheCluster'
Properties:
ClusterName: cluster-name-redis
AutoMinorVersionUpgrade: 'true'
AZMode: single-az
CacheNodeType: cache.t2.small
Engine: redis
EngineVersion: 3.2.4
NumCacheNodes: 1
CacheSubnetGroupName: !Ref ElastiCacheSubnetGroupId
VpcSecurityGroupIds:
- !GetAtt
- elasticacheSecGrp
- GroupId
Looks like the ConfigurationEndpoint.Address is only supported for Memcached clusters, not for Redis. Please see this relevant discussion in the AWS forums.
Also, the AWS Auto Discovery docs (still) state:
Note
Auto Discovery is only available for cache clusters running the
Memcached engine. Redis cache clusters are single node clusters, thus
there is no need to identify and track all the nodes in a Redis
cluster.
Looks like your 'best' solution is to query the individual endpoint(s) in us, in order to determine the addresses to connect to, using AWS::CloudFormation::Init as is suggested on the AWS forums thread.
UPDATE
As #slimdrive pointed out below, this IS now possible, through the AWS::ElastiCache::CacheCluster. Please read further below for more details.
You should be able to use PrimaryEndPoint.Address instead of ConfigurationEndpoint.Address in the template provided to get the DNS address of the primary read-write cache node as documented on the AWS::ElastiCache::ReplicationGroup page.
This can be extremely confusing-- depending on what you're trying to do, you use either ConfigurationEndpoint or PrimaryEndpoint... I'm adding my findings here as this was one of the first posts I found when trying to search. I'll also detail some other issues I've had with ElastiCache redis engine setup with CloudFormation. I was trying to set up a CloudFormation type of AWS::ElastiCache::ReplicationGroup
Let me preface this with the fact that I had previously set up a clustered instance of redis ElastiCache using a t2.micro build type with no problems. In fact, I received an error from the node-redis npm package saying that clusters weren't supported, so I also implemented the redis-clustr wrapper around that. Anyway, all that was working fine.
We then moved forward with trying to create a CloudFormation template for this, and I ran into all sorts of limitations that the aws console UI must be hiding from people. In chronological order of how I ran into the problems, here were my struggles:
t2.micro instances are not supported with auto-failover. So I set AutomaticFailoverEnabled to false.
Fix: t2.micro instances actually can use auto-failover. Use the Parameter Group that has clustered mode enabled. The default one for me was default.redis3.2.cluster.on (I used version 3.2.6, as this is the most current that supports encryption at rest and in transit). The parameter group can not be changed after the instance is created, so don't forget this part.
We received an error from the redis-clustr/node-redis package: this instance has cluster support disabled.
(This is how I found the parameter group needed the value on)
We received an error in the CF template that cluster mode can not be used if auto failure is off
This is what made me try using a t2.micro instance again, since I knew I had auto-failover turned on in my other instance and was using a t2.micro instance. Sure enough, this combination does work together.
I had outputs to the stack and creation of parameters in the Parameter Store of the connection url and port. This failed with x attribute/property does not exist on the ReplicationGroup.
Fix: It turns out that if cluster mode is disabled (using parameter group default.redis3.2, for example), you must use the PrimaryEndPoint.Address and PrimaryEndPoint.Port values. If cluster mode is enabled, use ConfigurationEndPoint.Address and ConfigurationEndPoint.Port. I had tried using the RedisEndpoint.Address and RedisEndpoint.Port with no luck, though this may work with a single redis node with no replica (I also could have had the casing wrong-- see the note below).
NOTE
Also, a major issue affected me is the casing: The P in EndPoint must be capitalized in the PrimaryEndPoint and ConfigurationEndPoint variations if you are creating a AWS::ElastiCache::ReplicationGroup, but the p is lower case if you are creating a AWS::ElastiCache::CacheCluster: RedisEndpoint, ConfigurationEndpoint. I'm not sure why there's a discrepancy there, but it may be the cause of some problems.
Link to AWS docs for GetAtt, which lists available attributes for different CloudFormation resources

Unreliable discovery for elasticsearch nodes on ec2

I'm using elasticsearch (0.90.x) with the cloud-aws plugin. Sometimes nodes running on different machines aren't able to discover each other ("waited for 30s and no initial state was set by the discovery"). I've set "discovery.ec2.ping_timeout" to "15s", but this doesn't seem to help. Are there other settings that might make a difference?
discovery:
type: ec2
ec2:
ping_timeout: 15s
Not sure if you are aware of this blog post: http://www.elasticsearch.org/tutorials/elasticsearch-on-ec2/. It explains the plugin settings in depth.
Adding cluster name, like so
cluster.name: your_cluster_name
discovery:
type: ec2
...
might help.