EC2 Instance Vanished - amazon-web-services

We have a peculiar situation today where we see that one of our EC2 instance has disappeared from the console and we weren't sure what caused this. Cloudtrail doesn't have any terminated event against this instance-id.
The last noted cloudtrail event for the instance-id that went down goes something like this
{
"eventVersion": "1.08",
"userIdentity": {
"type": "AWSService",
"invokedBy": "ec2.amazonaws.com"
},
"eventTime": "2022-03-23T05:46:40Z",
"eventSource": "sts.amazonaws.com",
"eventName": "AssumeRole",
"awsRegion": "ap-south-1",
"sourceIPAddress": "ec2.amazonaws.com",
"userAgent": "ec2.amazonaws.com",
"requestParameters": {
"roleArn": "arn:aws:iam::2************:role/ec2-instance-***********",
"roleSessionName": "i-06135ad01bb90****"
},
"responseElements": {
"credentials": {
"accessKeyId": "<redacted>",
"sessionToken": "<redacted>",
"expiration": "Mar 23, 2022, 12:01:34 PM"
}
},
"requestID": "d9882911-39e7-449b-9701-***********"",
"eventID": "0fa1b79b-08aa-48e6-8232-***********"",
"readOnly": true,
"resources": [
{
"accountId": "2************",
"type": "AWS::IAM::Role",
"ARN": "arn:aws:iam::2************:role/ec2-instance-***********"
}
],
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "2************",
"sharedEventID": "4b842373-e89d-438b-be3b-*********",
"eventCategory": "Management"
}
The only thing that I can think of is either a hardware failure from AWS side or some crude command ran within the OS by some user that took the instance down. Unfortunately we dont have AWS developer support as that's quite costly.
Has anyone faced anything similar? Any leads on how i can go ahead to find the root cause?

For anyone that is interested or may face this in future, we had to opt for AWS developer support to get an answer, here is what they had to say
From the case notes, I understood that the instance
'i-06135ad01bb******' was missing from yesterday. However, you tried
to check cloudtrail and could not find any traces of termination.
Please correct me if I misunderstood your query.
Upon reviewing the case description, I started checking the instance
using our internal tools and observed that the instance got terminated
on '2022-03-23 06:33 UTC' with the reason
'INSTANCE-INITIATED-SHUTDOWN'.
This means that the shutdown got initiated from OS. Please allow me to
inform you that AWS engineers do not have access or visibility to the
customer's instance/OS level due to data privacy[1] and shared
responsibility model[2]. Hence, I will not be in a position to check
how shutdown call got initiated.
To further investigate this, I checked using our internal tools and
could see that you have selected the 'termination' option: on shut
down which means that when an instance gets shutdown it will be
automatically terminated. So, I would request you to change the option
'termination' : on shut down to 'STOP' : on shut down. With this, if
the OS initiates a shutdown by any chance, the instance will be
stopped instead of getting terminated.
As per my analysis, I can confirm that the AWS infrastructure was healthy and there weren't any issues from our end. However, for the future, I would request you to consider the following best practices which you already might be aware of, however I am mentioning them here for the sake of completeness:
Enable termination protection
Regularly back up your data
Preserving root volume after termination:

Thank You fannymug. I think this was the case also for me.
Just to add up:
in cloudtrail search for the instance ID and select the RunInstances eventName
here it is possible to check the event details.
Double check the value for deleteOnTermination value. If it is set to true, termination protection is not enabled.
To avoid this, during EC2 creation process, look in advanced details > Termination Protection > Enable.
Shutdown behavior option can also help and can be set to Stop.
I hope it helps

Related

Does Cloud Run have an equivalent of Cloud Functions' execution_id?

Any record logged from a GCP Cloud Function contains a labels.execution_id, e.g.:
{
"textPayload": "Function execution started",
"insertId": "12mylqhfm6hy8i",
"resource": {
"type": "cloud_function",
"labels": {
"function_name": "redacted",
"region": "europe-west2",
"project_id": "redacted"
}
},
"timestamp": "2022-09-26T10:57:26.917823762Z",
"severity": "DEBUG",
"labels": {
"execution_id": "1l1qb00ft6kv"
},
"logName": "projects/redacted/logs/cloudfunctions.googleapis.com%2Fcloud-functions",
"trace": "projects/redacted/traces/d2f793cf6e2fb149a8ce8dc6fd0498b4",
"receiveTimestamp": "2022-09-26T10:57:26.920210899Z"
}
This is very useful for correlating all logs from a single invocation of the cloud function because it can be filtered upon in Logs Explorer:
labels.execution_id="1l1qb00ft6kv"
I see no equivalent for Cloud Run though. Cloud Run logs do have labels.instance_id but my understanding is that that pertains to the Cloud Run app instance so will be the same for all invocations on that instance. Hence its not the same as Cloud Functions' labels.execution_id.
Does Cloud Run have an equivalent of Cloud Functions' execution_id or would I have to roll my own? If the latter, does anyone have any strategies for doing so?
No there isn't an execution ID, only the instanceID. To have that, you can use instrumentation tools, like Open Telemetry as mentioned by guillaume at stackoverflow question, you can refer this video. You can also customize the app logs with a custom/random execution ID (similar of what OT does).
Also Have a look at this link1 & link2 which might help

Google Cloud VM Shutdown by "Integrity Event"

It seems like my Gooogle Cloud VM was shutdown by an "integrity event":
{
"insertId": "3",
"jsonPayload": {
"lateBootReportEvent": {
"policyEvaluationPassed": false,
"policyMeasurements": [
],
"actualMeasurements": [
]
},
"#type": "type.googleapis.com/cloud_integrity.IntegrityEvent",
"bootCounter": "3"
},
"resource": {
"type": "gce_instance",
"labels": {
"zone": "us-central1-a",
"instance_id": "xxx",
"project_id": "xxx"
}
},
"timestamp": "2022-02-09T03:58:16.830409192Z",
"severity": "ERROR",
"logName": "projects/xxx/logs/compute.googleapis.com%2Fshielded_vm_integrity",
"receiveTimestamp": "2022-02-09T03:58:18.846995634Z"
}
Can those be prevented or even disabled somehow?
Can those be prevented or even disabled somehow?
The answer depends on what you mean. You are using a Shielded VM which protects you from:
Prevent tampering with the guest VM image.
Prevent altering sensitive crypto operations.
Prevent exfiltrating secrets sealed in the vTPM
Prevent modifying the system with UEFI drivers.
Prevent modifying guest firmware.
Prevent modifying the kernel.
Those actions will trigger an integrity event. To prevent an integrity event, do not modify the system.
Refer to logName for more information.
Note: lateBootReportEvent compares the original baseline to the latest boot sequence. The integrity policy baseline is used for comparison with measurements from subsequent VM boots to determine if anything has changed.
What is Shielded VM?

Sns mail notification when a step is not kicked off within a threshold timeframe

I have an emr step which is submitted through step function. During step run I can see task is submitted, but emr step is not executed and emr console don’t have any information .
How can I debug this?
How can I send an sns when a step doesn’t start execution with in a threshold timeframe?in my case step function shows emr task submitted but no information on emr console and pipeline is long running without failing for more than half hr
You could start the debugging process through the Step Functions execution log and identify the specific step that has failed, and later, you can move on looking for the EMR console or the specific service that has failed. Usually when the EMR step doesn't appear in the EMR console, is due to a Runtime Error, caused by an exception raised when calling the EMR step.
For this scenario, you can use the Error Handling that Step Functions has, using the Catch and Timeout fields, you can find more details in the AWS documentation here.
Basically you need to add this fields as show bellow:
{
"StartAt": "EmrStep",
"States": {
"EmrStep": {
"Type": "Task",
"Resource": "arn:aws:emr:execute-X-step",
"Comment": "This is your EMR step",
"TimeoutSeconds": 10,
"Catch": [ {
"ErrorEquals": ["States.Timeout"],
"Next": "ShutdownClusterAndSendSNS"
} ],
"End": true
},
"ShutdownClusterAndSendSNS": {
"Type": "Pass",
"Comment": "This step handles the timeout exception raised",
"Result": "You can shutdown the EMR cluster to avoid increased cost here and later send a sns notification!",
"End": true
}
}
Note: To catch the timeout exception, you have to catch the error States.Timeout, but also you can define the same catch field for other types of error.

Cloud Tasks Not Triggering HTTPrequest Endpoints

I have one simple cloud task queue and have successfully submitted a task to the queue. It is supposed to deliver a JSON payload to my API to perform a basic database update. The task is created at the end of a process in a .net core 3.1 app running locally on my desktop triggered by postman and the API is a golang app running in cloud run. However, the task never seems to fire and never registers an error.
The tasks in queue is always 0 and the tasks running is always blank. I have hit the "Run Now" button dozens of times but it never changes anything and no log entries or failed attempts are ever registered.
The task is created with the OIDCToken with a service account and audience set for the service account that has the authorization to create tokens and execute the cloud run instance.
Screen Shot of Tasks Queue in Google Cloud Console
Task creation log entry shows that it was created OK:
{
"insertId": "efq7sxb14",
"jsonPayload": {
"taskCreationLog": {
"targetAddress": "PUT https://{readacted}",
"targetType": "HTTP",
"scheduleTime": "2020-04-25T01:15:48.434808Z",
"status": "OK"
},
"#type": "type.googleapis.com/google.cloud.tasks.logging.v1.TaskActivityLog",
"task": "projects/{readacted}/locations/us-central1/queues/database-updates/tasks/0998892809207251757"
},
"resource": {
"type": "cloud_tasks_queue",
"labels": {
"target_type": "HTTP",
"project_id": "{readacted}",
"queue_id": "database-updates"
}
},
"timestamp": "2020-04-25T01:15:48.435878120Z",
"severity": "INFO",
"logName": "projects/{readacted}/logs/cloudtasks.googleapis.com%2Ftask_operations_log",
"receiveTimestamp": "2020-04-25T01:15:49.469544393Z"
}
Any ideas as to why the tasks are not running? This is my first time using Cloud Tasks so don't rule out the idiot between the keyboard and the chair.
Thanks!
You might be using a non-default service. See Configuring Cloud Tasks queues
Try creating a task from the command line and watch the logs e.g.
gcloud tasks create-app-engine-task --queue=default \
--method=POST --relative-uri=/update_counter --routing=service:worker \
--body-content=10
In my own case, I used --routing=service:api and it worked straight away. Then I added AppEngineRouting to the AppEngineHttpRequest.

AWS EFS - Script to create mount target after creating the file system

I am writing a script that will create an EFS file system with a name from input. I am using the AWS SDK for PHP Version 3.
I am able to create the file system using the createFileSystem command. This new file system is not usable until it has a mount target created. If I run the CreateMountTarget command after the createFileSystem command then I receive an error that the file system's life cycle state is not in the 'available' state.
I have tried using createFileSystemAsync to create a promise and calling the wait function on that promise to force the script to run synchronously. However, the promise is always fulfilled while the file system is still in 'creating' life cycle state.
Is there a way to force the script to wait for the file system to be in the available state using the AWS SDK?
One way is to check the status of the file system using DescribeFileSystems API. In the response look at the LifeCycleState, if it is available fire the CreateMountTarget API. You can keep checking the DescribeFileSystems in a loop with a few seconds delay until the LifeCycleState is Available
It looks like you want a waiter for FileSystemAvailable, but the elasticfilesystem files don't specify one. I'd file an issue on GitHub asking for one. You'd need to wait for DescribeFileSystems to have a LifeCycleState of available.
In the mean time, you can probably write your own with something like the following and following the waiters guide.
{
"version":2,
"FileSystemAvailable": {
"delay": 15,
"operation": "DescribeFileSystems",
"maxAttempts": 40,
"acceptors": [
{
"expected": "available",
"matcher": "pathAll",
"state": "success",
"argument": "FileSystems[].LifeCycleState"
},
{
"expected": "deleted",
"matcher": "pathAny",
"state": "failure",
"argument": "FileSystems[].LifeCycleState"
},
{
"expected": "deleting",
"matcher": "pathAny",
"state": "failure",
"argument": "FileSystems[].LifeCycleState"
}
]
},
}
Promises in the AWS SDK for PHP are used for making the HTTP request concurrently. This doesn't help in this case because the behavior of the API call is to start an asynchronous task in EFS.