I tried to downsize the node type of my Elasticache from cache.r4.large to cache.t2.medium but it does not proceed.
what happened is after I modify.. The Status of my Cluster will become Modifying but after a few minutes, the status will become Available but the node type did not change at all.
The changes aren't applied till the next maintenance window. Select the apply immediately option to scale down immediately.
Related
I am using SLURM on AWS to manage jobs as part of AWS parallelcluster. I have two questions :
When using scancel *jobid* to cancel a job, the associated node(s) do not stop. How can I achieve that ?
When starting, I made the mistake of not making my script executable so the sbatch *script.sh* worked but the compute node was doing nothing. How could I identify such behaviour and handle it properly ? Is the proper to e.g. stop the idle node after some time for example and output that in a log ? How can I achieve that ?
Check out this page in the docs: https://docs.aws.amazon.com/parallelcluster/latest/ug/autoscaling.html
Bottom line is that instances that have no jobs for a period of time longer than the scaledown_idletime (the default setting is 10 minutes) will get scaled down (terminated) by the cluster, automagically.
You can tweak the setting in the config file when you build your cluster, if 10 mins is too long. Just think about your workload first, because you don't want small delays between jobs to cause you a lot of churn whilst you wait for nodes to die and then get created again shortly after, hence the 10 minute thing.
I've been lurking for years and the time has finally come to post my first question!
So, my GitLab/Terraform/AWS pipeline pushes containers to Fargate. Once the task definition gets updated, new containers go live and pass health checks. At this point both the old and the new containers are up:
It takes several minutes until the auto-scaler shuts down the old containers. This is in a dev environment so nobody is accessing anything and there are no connections to drain. Other than manually, is there a way to make this faster or even instant?
Thanks in advance!
There is a way to reduce the time you have to wait for tasks to drain. Go to EC2 -> Target Groups -> (Select your target group) -> Description and scroll down. At the bottom is a property called "Deregistration Delay". This is the amount of time the target group will allow connections to drain before shutting down a container (I think it defaults to 5 minutes). Just reduce that value and you should be able to deploy much quicker. Hope this helps!
I recently started to use spot block instances because it is guaranteed for a certain hours.
When I send the request with on-demand price as max price with a few hours block time, the request stuck on capacity-not-available status all the time.
I keep trying from around noon till evening, the request stays at capacity-not-available status.
Then I tried to request for regular spot instances with same parameters, the request got fulfilled immediately.
Does anyone know if this behavior is reasonable? If it is true, I don't see much value in spot block instances then.
I use us-west-2 region by the way.
Thanks everyone for your advice in advance.
Source: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-bid-status.html
Holding:
If one or more request constraints are valid but can't be met yet,
or if there is not enough capacity, the request goes into a holding
state waiting for the constraints to be met.
The request options affect the likelihood of the request being fulfilled.
For example, if you specify a maximum price below the current Spot price,
your request stays in a holding state until the Spot price goes below
your maximum price. If you specify an Availability Zone group,
the request stays in a holding state until the Availability Zone constraint is met.
Maybe you could try another availability zone?
Also you could check the current prices on https://aws.amazon.com/ec2/spot/pricing/ to see if your bid is in range.
I have a scheduled AWS Data Pipeline that failed partway through its execution. I fixed the problem without modifying the Pipeline in any way (changed a script in S3). However, there seems to be no good way to restart the Pipeline from the beginning.
I tried Deactivating/Reactivating the Pipeline, but the previously "FINISHED" nodes were not restarted. This is expected; according to the docs, this only pauses and un-pauses execution of the Pipeline, which is not that we want.
I tried Rerunning one of the nodes (call it x) individually, but it did not respect dependencies: none of the nodes x depends on reran, nor did the nodes that depend on x.
I tried activating it from a time in the past, but received the error: startTimestamp should be later than any Schedule StartDateTime in the pipeline (Service: DataPipeline; Status Code: 400; Error Code: InvalidRequestException; Request ID: <SANITIZED>).
I would rather not change the Schedule node, since I want the Pipeline to continue to respect it; I only need this one manual execution. How can I restart the Pipeline from the beginning, once?
So far, the best way to accomplish this that I've found is to Clone the Pipeline, make it On-Demand (instead of Scheduled) and activate that one. This new Pipeline will activate and run immediately. This seems cumbersome, however; I'd be happy to hear a better way.
The ActivatePipeline API has a startTimestamp parameter using which you can restart execution from any previous time interval. Please see http://docs.aws.amazon.com/datapipeline/latest/APIReference/API_ActivatePipeline.html
I am trying to write to an HDFS directory at hdfs:///home/bryan/test_file/ by submitting a Spark job to a Dataproc cluster.
I get an error that the Name Node is in safe mode. I have a solution to get it out of safe mode, but I am concerned this could be happening for another reason.
Why is the Dataproc cluster in safe mode?
ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1443726448000 ms.0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /home/bryan/test_file/_temporary/0. Name node is in safe mode.
The reported blocks 125876 needs additional 3093 blocks to reach the threshold 0.9990 of total blocks 129098.
The number of live datanodes 2 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
What safe mode means
The NameNode is in safemode until data nodes report on which blocks are online. This is done to make sure the NameNode does not start replicating blocks even though there is (actually) sufficient (but unreported) replication.
Why this happened
Generally this should not occur with a Dataproc cluster as you describe. In this case I'd suspect a virtual machine in the cluster did not come online properly or ran into an issue (networking, other) and, therefore, the cluster never left safe mode. The bad news is this means the cluster is in a bad state. Since Dataproc clusters are quick to start, I'd recommend you delete the cluster and create a new one. The good news, these errors should be quite uncommon.
The reason is that you probably started the master node (housing namenode) before starting the workers. If you shutdown all the nodes, start the workers first and then start the master node it should work. I suspect the master node starting first, checks the workers are there. If they are offline it goes into a safe mode. In general this should not happen because of the existence of heart beat. However, it is what it is and restart of master node will resolver the matter. In my case it was with spark on Dataproc.
HTH