AWS CloudWatch -- Signature Expired - amazon-web-services

I'm setting up CloudWatch for several of my EC2 instances, using AWS custom CloudWatch metrics.
Everything is going fine, except one of my instances gives me the below error:
ubuntu#my-host:~$ /etc/aws-scripts-mon/mon-put-instance-data.pl --mem-util --disk-space-util --disk-path=/ --aws-credential-file=/etc/aws-scripts-mon/awscreds.template
ERROR: Failed to call CloudWatch: HTTP 400. Message: Signature expired: 20150515T204709Z is now earlier than 20150515T204917Z (20150515T205417Z - 5 min.)
For more information, run 'mon-put-instance-data.pl --help'
I've tried searching around the Internet, including this link, but no luck. Any ideas?
One hint: I have a cron job invoking this same command every 5 minutes. It is also unsuccessful. That may be related to the x - 5 min. message bit in the above error message.

The system time of the problem EC2 instance is off by several minutes. See AWS SDK Error - Signature not yet current
This is likely the solution!

Related

AWS EKS Returns Error 'certificate has expired or is not yet valid'

When I deploy new deployments or edit any settings, It returns following Error
Error creating: Internal error occurred: failed calling webhook
"mpod.elbv2.k8s.aws": Post
"https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-pod?timeout=10s":
x509: certificate has expired or is not yet valid: current time
2022-01-28T02:05:13Z is after 2022-01-20T10:00:30Z
How can i fix it??
I think the reason is because your time and date are not right. As I can see in the log, your time is 8 days behind the current day.
Please sync your time in this server and try again.
You need to have new certificate for aws-load-balancer-webhook-service. We have issuer set up in the cluster and when we get similar error in OPA we do a rollout restart for opa.

Error while doing aws iam list-users using AWS_CLI

I am trying to run aws iam list-users in the AWS CLI but got an error. The error is:
An error occurred (SignatureDoesNotMatch) when calling the ListUsers operation: Signature not yet current: 20210606T055848Z is still later than 20210605T174350Z (20210605T172850Z + 15 min.)
Please if anyone know this solution, please tell to me.
The error is pretty clear that the request is signed for 20210606T055848Z but it "currently" is 20210605T172850Z. In different format: 05:58:48 # 06.06.2021 (signed) vs. 17:28:05 # 05.06.2021 (current). There is a difference of 12 and half hours between the two timestamps.
That means either the local time of your computer / the process creating the request is incorrect or the request is intentionally scheduled for the future and is simply not intended to be submitted yet. Solution: fix your clock, change the code to not sign for the future or submit the request at a later point in time.

How to run docker task with Amazon ECS - getting error `STOPPED (CannotStartContainerError: Error response from dae)`

My goal is to execute a benchmark deployed as a docker image. While doing so, I had too many issues, so I decided to first make something extremely trivial work.
So I decided to follow the guide in https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-task-definition.html
and use the "ping" example - it should just ping a domain couple of times, and stop.
The problem is, I always receive this message in the task status:
STOPPED (CannotStartContainerError: Error response from dae)
I tried it with various subnets and security groups, but the result is always the same - the task starts, and after a minute or two fails with the message above.
I even tried it on a fresh new AWS account, using these steps:
in https://us-east-2.console.aws.amazon.com/ecs/ created new cluster (networking only)
in task definitions, created a taskdef
with docker image alpine:latest, command ping -c 4 google.com
then I select the cluster, switch to "tasks" tab, and enter the run dialog
with one of pre-created subnets
After executing:
the task appears in the cluster's tasks list in PENDING state
it takes couple of minutes
eventually (using refresh button), it changes to the mentioned message - STOPPED (CannotStartContainerError: Error response from dae)
My guess is that the reason is:
either the task cannot download the image
or the instance cannot reach outside net
What can I be doing wrong? How to fix?
In my case too the log group was the problem. The one I had configured wasnt working. Hence I enabled the "Auto-configure CloudWatch Logs" option in the "Log Configuration" of the container settings.
Also if you open the stopped task, navigate to the container section, expand it, under the Details section you can see a detailed error message. Screenshot below
It could be a problem with the entry point as pointed in the comments of the question (in the task definition) Entrypoint: ["sh","-c"]
It could also be a bad reference, for example a wrong log group in the LogConfiguration or something similar.
I just create de group log in my cloudwatch console because it have not created, and now everything is going well.

Cloud Run finishes but Cloud Scheduler thinks that job has failed

I have a Cloud Run service setup and I have a Cloud Scheduler task that calls an endpoint on that service. When the task completes (http handler returns), I'm seeing the following error:
The request failed because the HTTP connection to the instance had an error.
However, the actual handler returns HTTP 200 and successfully exists. Does anyone know what this error means and under what circumstances it shows up?
I'm also attaching a screenshot of the logs.
Does your job take longer than 120 seconds? I was having the same issue and figured out node versions prior to 13 has 120 seconds server.timeout limit. I installed node 13 on docker and problem is gone.
Error 503 is returned by the Google Frontend (GFE). The Cloud Run service either has a transient issue, or the GFE has determined that your service is not ready or not working correctly.
In your log entries, I see a POST request. 7 ms later is the error 503. This tells me your Cloud Run application is not yet ready (in a ready state determined by Cloud Run).
One minute, 8 seconds before, I see ReplaceService. This tells me that your service is not yet in a running state and that if you retry later, you will see success.
I've run an incremental sleep test on my FLASK endpoint which returns 200 within 1 min, 2 min and 10 min of waiting time. Having triggered the endpoint via the Cloud Scheduler, the job failed only in the 10 min test. I've found that it was one of the properties of my Cloud Scheduler job causing the failure. The following solved my issue.
gcloud scheduler jobs describe <my_test_scheduler>
There, you'll see a property called 'attemptDeadline' which was set to 180 seconds by default.
You can update that property using:
gcloud scheduler jobs update http <my_test_scheduler> --attempt-deadline 1000s
Ref: scheduler update

Authentication with Cognito - where to find logs

We have 2 React Native app are using AWS Cognito for authentication. We use library react-native-aws-cognito-js in our code. The apps are working fine until these 2 days. Apps are experiencing intermittent "Internal Server Error".
How can I find more information about this error? Any tool can help us pinpoint the cause?
Update
From CloudTrail, each API call has an event "CreateNetworkInterface". Many of such API calls have error code "Client.NetworkInterfaceLimitExceeded". What is the cause and solution to this?
According to this AWS Doc (in Chinese), CloudWatch will not write to log when error is due to insufficient IP/ENI. That explains the increase in error number but no logs in CloudWatch.
Upate 2
We have found a scheduled Lambda job which may exhausted IP addresses. We stopped the batch job. But still can't have too many user login to server due to "Client.NetworkInterfaceLimitExceeded" error. I realized that there are many "CreateNetworkInterface" event and few "DeleteNetworkInterface" event. How can I "clean up / reset" all network interface in VPC?
Short answer: Cloud Trail.
Long answer with a suggestion
Assuming your application code is fine, most likely the cause of your 500 error is based on Cognito's initial limitations (e.g., number of calls per user): https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html.
AWS suggests to use Cloud Trail, for logging Api calls.
However I would suggest, to prove the limitations first, add some logs around the api call yourself, and in development you could call your app/api with a high number of calls; and most likely you will see the 500 error due to the limitations.
You could do the following in the terminal:
for i in `seq 1 1000`; do curl --cookie SecureCookie=TokenValueFromAWS http://localhost:desirablePort/SecuredPath; done