I have clustering nfs server with corosync and pacemaker.
I Installed environment successfully then i fount a problem while testing.
That screen is captured after add resources.
nfs1 server is working well, and all resources are watched by pacemaker.
The problem occur after stop nfs-service.
If i input commend "systemctl stop nfs", nfs service is stopped.
Then cluster is move to nfs2 automatically(This is ok)
Then i input commend "pcs cluster standby bp-nfs2", as a result cluster server is moved to bp-nfs1, and all resources was be satrted without nfsserver.
Even if i started again nfs-service, pacemaker's nfs resource is still stopped.
I want that pacemaker make nfs-service run when pacemaker start.
This is nfs resource create commend.
pcs resource create nfsserver ocf:heartbeat:nfsserver \
nfs_shared_infodir="/mnt/sharedisk/" \
--group resource-group
Somebody know about this issue, please teach me.
If you can't under stand my English, I'm sorry. Thank you.
Solved myself.
Reason.
If resource is stopped by trouble, the resource have FailAction.
We can see what resource have failAction with pcs status commend.
And Pacemaker never start to watch resource which have failaction, It is pacemaker's specification.
Solution.
Clear failaction manually with commend "pcs resource cleanup [resource name]".
If you want to cleanup automatically, input this commend "pcs resource defaults failure-timeout=60s".
When occur trouble your resource, your activated node move to another node(failover).
Then start up watching pacemaker resources. then cleaned up failaction automatically after 60 sec.
Related
As said in the title, I am attempting to put a CloudWatch Agent (CW agent) on my On-Premise-Server (OPS).
After running this line of code that I got from the AWS User Guide to start the CW agent:
& $Env:ProgramFiles\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent-ctl.ps1 -m ec2 -a start
I got this error:
****** processing cwagent-otel-collector ******
cwagent-otel-collector will not be started as it has not been configured yet.
****** processing amazon-cloudwatch-agent ******
AmazonCloudWatchAgent has been started
I did/do not know what this was so I searched and found that when someone else had this issue, they did not create a config file.
I did create a config file (named config.json by default) using the configuration wizard and I am still having the issue.
I have tried looking into a number of pages on that user guide, but nothing has resolved the issue.
Thank you in advance for any assistance you can provide.
This message is info and not an error.
CloudWatch agent is bundled with the AWS OpenTelemetry collector agent. They're actually two agents. CloudWatch agent and Otel collector have separate configuration files. If you provide a config for one and not the other, it will only start the one that is configured. This is expected behavior.
Thank you for taking the time to answer. I have since resolved the issue (recently).
Everything from the command I was using to the path where the file resided was incorrect.
Starting over and going through all the steps again with background information helped.
The first installation combined with learning everything for the first time produced the issue.
Anyone having this issue I recommend that when you hit a wall like this you start over. I know it is not what anyone wants to do, but in the end it saved time.
My goal is to execute a benchmark deployed as a docker image. While doing so, I had too many issues, so I decided to first make something extremely trivial work.
So I decided to follow the guide in https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-task-definition.html
and use the "ping" example - it should just ping a domain couple of times, and stop.
The problem is, I always receive this message in the task status:
STOPPED (CannotStartContainerError: Error response from dae)
I tried it with various subnets and security groups, but the result is always the same - the task starts, and after a minute or two fails with the message above.
I even tried it on a fresh new AWS account, using these steps:
in https://us-east-2.console.aws.amazon.com/ecs/ created new cluster (networking only)
in task definitions, created a taskdef
with docker image alpine:latest, command ping -c 4 google.com
then I select the cluster, switch to "tasks" tab, and enter the run dialog
with one of pre-created subnets
After executing:
the task appears in the cluster's tasks list in PENDING state
it takes couple of minutes
eventually (using refresh button), it changes to the mentioned message - STOPPED (CannotStartContainerError: Error response from dae)
My guess is that the reason is:
either the task cannot download the image
or the instance cannot reach outside net
What can I be doing wrong? How to fix?
In my case too the log group was the problem. The one I had configured wasnt working. Hence I enabled the "Auto-configure CloudWatch Logs" option in the "Log Configuration" of the container settings.
Also if you open the stopped task, navigate to the container section, expand it, under the Details section you can see a detailed error message. Screenshot below
It could be a problem with the entry point as pointed in the comments of the question (in the task definition) Entrypoint: ["sh","-c"]
It could also be a bad reference, for example a wrong log group in the LogConfiguration or something similar.
I just create de group log in my cloudwatch console because it have not created, and now everything is going well.
One spark job was running for more than 23 days and eventually caused the resource manager to crash. After restarting the resource manager istance (there are two of them in our cluster) both of them stayed in standby state.
And we are getting this error:
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager Failed to load/recover state
org.apache.hadoop.yarn.exceptions.YarnException: Application with id application_1470300000724_40101 is already present! Cannot add a duplicate!
We could not kill 'application_1470300000724_40101' from yarn as the resource manager is not working. So we killed all the instances from Unix level on all nodes but dint work. We have tried rebooting all nodes and still the same.
Somewhere one entry of that job is still there and preventing the resource manager to get elected as active. We are using cloudera 5.3.0 and I can see that this issue has been addressed and resolved in cloudera 5.3.3. But at this moment we need a workaround to get past for now.
To resolve this issue we can format RMStateStore by executing the below command:
yarn resourcemanager -format-state-store
But be careful as this will clear all the application history that were executed before executing this command.
I'm trying to create a simple dataFlow pipeline with a single Activity of ShellCommandActivity type. I've attached the configuration of the activity and ec2 resource.
When I execute this the Ec2Resource sits in the WAITING_ON_DEPENDENCIES state then after sometime changes to TIMEDOUT. The ShellCommandActivity is always in the CANCELED state. I see the instance launch and very quicky changes to the terminated stated.
I've specified a s3 log file url, but that never gets updated.
Can anyone give me any pointers? Also is there any guidance out there on debugging this?
Thanks!!
You are currently forcing your instance to shut down after 1 minute which gives the TIMEOUT status if it can't execute in that time. Try increasing it to 50 minutes.
Also make sure you are using an AMI that runs Amazon Linux and that you are using full absolute paths in your scripts.
S3 log files are written as:
s3://bucket/folder/
I want to use AWS AutoScaling to scaledown a group of instances when SQS queue is short.
These instances do some heavy work that sometimes requires 5-10 minutes to complete. And I want this work to be completed before the instance termination.
I know a lot of people should have faced the same problem. Is it possible on EC2 to handle the AWS termination request and complete all my running processes before the instance is actually terminated? What is the best approach to this?
You could also use Lifecycle hooks. You would need a way to control a specific worker remotely, because AWS will select a particular instance to put in Terminating:Wait state and you need to manage that instance. You would want to take the following actions:
instruct the worker process running on the instance to not accept any more work.
wait for the worker to finish the work it already is handling
call the complete-lifecycle action.
AWS will take care of the rest for you.
ps. if you are using celery to power your workers then you can remotely ask a worker to shutdown gracefully. It won't shutdown unless it finishes with the tasks it had started executing.
Assuming you are using linux, you can create a pre-baked AMI that you use in your Launch Config attached to your Auto Scaling Group.
In the AMI you can put a script under /etc/init.d say /etc/init.d/servicesdown. This script would execute anything that you need to shutdown which would be scripts under /usr/share/services for example.
Here's kind like the gist:
servicesdown
It would always get executed when doing a graceful shutdown.
Then say on Ubuntu/Debian you would do something like this to add it to your shutdown sequence:
/usr/sbin/update-rc.d servicesdown stop 25 0 1 6 .
On CentOS/RedHat you can use the chkconfig command to add it to the right shutdown runlevel.
I stumbled onto this problem because I didn't want to terminate an instance that was doing work. Thought I'd share my findings here. There are two ways to look at this though :
I need to terminate a worker, but I only want to terminate one that's not working
I need to terminate a SPECIFIC worker and I want that specific worker to wait until it's done with the work.
If you're goal is #1, Amazon's new "Instance Protection" looks like it was designed to resolve this.
See the below link for an example, they give this code snippet as an example:
https://aws.amazon.com/blogs/aws/new-instance-protection-for-auto-scaling/
while (true)
{
SetInstanceProtection(False);
Work = GetNextWorkUnit();
SetInstanceProtection(True);
ProcessWorkUnit(Work);
SetInstanceProtection(False);
}
I haven't tested this myself, but I see API calls related to setting the protection, so it appears that this could be integrated into the EC2 Worker App code-base and then when Scaling In, instances shouldn't be terminated if they are protected (currently working).
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/autoscaling/AmazonAutoScaling.html
As far as I know currently there is no option to terminate instance while gracefully shutdown and let process to complete work.
I suggest you to look at http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-configure-healthcheck.html.
We implemented it for resque workers while we are moving instance to unhealthy state and than downsizing AS. There is a script which checking constantly health state on each instance. Once instance moved to unhealthy state it stops all services gracefully and sending terminate signal to ec2.
Hope it helps you.