It seems like my Gooogle Cloud VM was shutdown by an "integrity event":
{
"insertId": "3",
"jsonPayload": {
"lateBootReportEvent": {
"policyEvaluationPassed": false,
"policyMeasurements": [
],
"actualMeasurements": [
]
},
"#type": "type.googleapis.com/cloud_integrity.IntegrityEvent",
"bootCounter": "3"
},
"resource": {
"type": "gce_instance",
"labels": {
"zone": "us-central1-a",
"instance_id": "xxx",
"project_id": "xxx"
}
},
"timestamp": "2022-02-09T03:58:16.830409192Z",
"severity": "ERROR",
"logName": "projects/xxx/logs/compute.googleapis.com%2Fshielded_vm_integrity",
"receiveTimestamp": "2022-02-09T03:58:18.846995634Z"
}
Can those be prevented or even disabled somehow?
Can those be prevented or even disabled somehow?
The answer depends on what you mean. You are using a Shielded VM which protects you from:
Prevent tampering with the guest VM image.
Prevent altering sensitive crypto operations.
Prevent exfiltrating secrets sealed in the vTPM
Prevent modifying the system with UEFI drivers.
Prevent modifying guest firmware.
Prevent modifying the kernel.
Those actions will trigger an integrity event. To prevent an integrity event, do not modify the system.
Refer to logName for more information.
Note: lateBootReportEvent compares the original baseline to the latest boot sequence. The integrity policy baseline is used for comparison with measurements from subsequent VM boots to determine if anything has changed.
What is Shielded VM?
Related
Any record logged from a GCP Cloud Function contains a labels.execution_id, e.g.:
{
"textPayload": "Function execution started",
"insertId": "12mylqhfm6hy8i",
"resource": {
"type": "cloud_function",
"labels": {
"function_name": "redacted",
"region": "europe-west2",
"project_id": "redacted"
}
},
"timestamp": "2022-09-26T10:57:26.917823762Z",
"severity": "DEBUG",
"labels": {
"execution_id": "1l1qb00ft6kv"
},
"logName": "projects/redacted/logs/cloudfunctions.googleapis.com%2Fcloud-functions",
"trace": "projects/redacted/traces/d2f793cf6e2fb149a8ce8dc6fd0498b4",
"receiveTimestamp": "2022-09-26T10:57:26.920210899Z"
}
This is very useful for correlating all logs from a single invocation of the cloud function because it can be filtered upon in Logs Explorer:
labels.execution_id="1l1qb00ft6kv"
I see no equivalent for Cloud Run though. Cloud Run logs do have labels.instance_id but my understanding is that that pertains to the Cloud Run app instance so will be the same for all invocations on that instance. Hence its not the same as Cloud Functions' labels.execution_id.
Does Cloud Run have an equivalent of Cloud Functions' execution_id or would I have to roll my own? If the latter, does anyone have any strategies for doing so?
No there isn't an execution ID, only the instanceID. To have that, you can use instrumentation tools, like Open Telemetry as mentioned by guillaume at stackoverflow question, you can refer this video. You can also customize the app logs with a custom/random execution ID (similar of what OT does).
Also Have a look at this link1 & link2 which might help
We have a peculiar situation today where we see that one of our EC2 instance has disappeared from the console and we weren't sure what caused this. Cloudtrail doesn't have any terminated event against this instance-id.
The last noted cloudtrail event for the instance-id that went down goes something like this
{
"eventVersion": "1.08",
"userIdentity": {
"type": "AWSService",
"invokedBy": "ec2.amazonaws.com"
},
"eventTime": "2022-03-23T05:46:40Z",
"eventSource": "sts.amazonaws.com",
"eventName": "AssumeRole",
"awsRegion": "ap-south-1",
"sourceIPAddress": "ec2.amazonaws.com",
"userAgent": "ec2.amazonaws.com",
"requestParameters": {
"roleArn": "arn:aws:iam::2************:role/ec2-instance-***********",
"roleSessionName": "i-06135ad01bb90****"
},
"responseElements": {
"credentials": {
"accessKeyId": "<redacted>",
"sessionToken": "<redacted>",
"expiration": "Mar 23, 2022, 12:01:34 PM"
}
},
"requestID": "d9882911-39e7-449b-9701-***********"",
"eventID": "0fa1b79b-08aa-48e6-8232-***********"",
"readOnly": true,
"resources": [
{
"accountId": "2************",
"type": "AWS::IAM::Role",
"ARN": "arn:aws:iam::2************:role/ec2-instance-***********"
}
],
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "2************",
"sharedEventID": "4b842373-e89d-438b-be3b-*********",
"eventCategory": "Management"
}
The only thing that I can think of is either a hardware failure from AWS side or some crude command ran within the OS by some user that took the instance down. Unfortunately we dont have AWS developer support as that's quite costly.
Has anyone faced anything similar? Any leads on how i can go ahead to find the root cause?
For anyone that is interested or may face this in future, we had to opt for AWS developer support to get an answer, here is what they had to say
From the case notes, I understood that the instance
'i-06135ad01bb******' was missing from yesterday. However, you tried
to check cloudtrail and could not find any traces of termination.
Please correct me if I misunderstood your query.
Upon reviewing the case description, I started checking the instance
using our internal tools and observed that the instance got terminated
on '2022-03-23 06:33 UTC' with the reason
'INSTANCE-INITIATED-SHUTDOWN'.
This means that the shutdown got initiated from OS. Please allow me to
inform you that AWS engineers do not have access or visibility to the
customer's instance/OS level due to data privacy[1] and shared
responsibility model[2]. Hence, I will not be in a position to check
how shutdown call got initiated.
To further investigate this, I checked using our internal tools and
could see that you have selected the 'termination' option: on shut
down which means that when an instance gets shutdown it will be
automatically terminated. So, I would request you to change the option
'termination' : on shut down to 'STOP' : on shut down. With this, if
the OS initiates a shutdown by any chance, the instance will be
stopped instead of getting terminated.
As per my analysis, I can confirm that the AWS infrastructure was healthy and there weren't any issues from our end. However, for the future, I would request you to consider the following best practices which you already might be aware of, however I am mentioning them here for the sake of completeness:
Enable termination protection
Regularly back up your data
Preserving root volume after termination:
Thank You fannymug. I think this was the case also for me.
Just to add up:
in cloudtrail search for the instance ID and select the RunInstances eventName
here it is possible to check the event details.
Double check the value for deleteOnTermination value. If it is set to true, termination protection is not enabled.
To avoid this, during EC2 creation process, look in advanced details > Termination Protection > Enable.
Shutdown behavior option can also help and can be set to Stop.
I hope it helps
I've been tasked to look at a service built on AWS Lambda that performs a long-running task of turning VMs on and off. Mind you, I come from the Azure team so I am not familair with the styling or best practices of AWS services.
The approach the original developer has taken is to send the entire workload to one Lambda function and then have that function take a section of the workload and then recursively call itself with the remaining workload until all items are gone (workload = 0).
Pseudo-ish Code:
// Assume this gets sent to a HTTP Lambda endpoint as a whole
let workload = [1, 2, 3, 4, 5, 6, 7, 8]
// The Lambda HTTP endpoint
function Lambda(workload) {
if (!workload.length) {
return "No more work!"
}
const toDo = workload.splice(0, 2) // get first two items
doWork(toDo)
// Then... except it builds a new HTTP request with aws sdk
Lambda(workload) // 3, 4, 5, 6, 7, 8, etc.
}
This seems highly inefficient and unreliable (correct me if I am wrong). There is a lot of state being stored in this process and in my opinion that creates a lot of failure points.
My plan is to suggest we re-engineer the entire service to use a Queue/Worker type framework instead, where ideally the endpoint would handle one workload at a time, and be stateless.
The queue would be populated by a service (Jenkins? Lambda? Manually?), then a second service would read from the queue (and ideally scale-out as well, as needed).
UPDATE: AWS EventBridge now looks like the preferred solution.
It's "Coupling" that I was thinking of, see here: https://www.jeffersonfrank.com/insights/aws-lambda-design-considerations
Coupling
Coupling goes beyond Lambda design considerations—it’s more about the system as a whole. Lambdas within a microservice are sometimes tightly coupled, but this is nothing to worry about as long as the data passed between Lambdas within their little black box of a microservice is not over-pure HTTP and isn’t synchronous.
Lambdas shouldn’t be directly coupled to one another in a Request Response fashion, but asynchronously. Consider the scenario when an S3 Event invokes a Lambda function, then that Lambda also needs to call another Lambda within that same microservice and so on.
aws lambda coupling
You might be tempted to implement direct coupling, like allowing Lambda 1 to use the AWS SDK to call Lambda 2 and so on. This introduces some of the following problems:
If Lambda 1 is invoking Lambda 2 synchronously, it needs to wait for the latter to be done first. Lambda 1 might not know that Lambda 2 also called Lambda 3 synchronously, and Lambda 1 may now need to wait for both Lambda 2 and 3 to finish successfully. Lambda 1 might timeout as it needs to wait for all the Lambdas to complete first, and you’re also paying for each Lambda while they wait.
What if Lambda 3 has a concurrency limit set and is also called by another service? The call between Lambda 2 and 3 will fail until it has concurrency again. The error can be returned to all the way back to Lambda 1 but what does Lambda 1 then do with the error? It has to store that the S3 event was unsuccessful and that it needs to replay it.
This process can be redesigned to be event-driven: lambda coupling
Not only is this the solution to all the problems introduced by the direct coupling method, but it also provides a method of replaying the DLQ if an error occurred for each Lambda. No message will be lost or need to be stored externally, and the demand is decoupled from the processing.
AWS Step Functions is one way you can achieve this. Step Functions are used to orchestrate multiple Lambda functions in any manner you want - parallel executions, sequential executions or a mix of both. You can also put wait steps, condition checks, retries in between if you need.
Your overall step function might look something like this (say you want 1,2,3 to execute in parallel. Then when all these are complete, you want to execute 4, and then again 5 and 6 in parallel)
Configuring this is also pretty simple. It accepts a JSON like the following
{
"Comment": "An example of the Amazon States Language using a parallel state to execute two branches at the same time.",
"StartAt": "Parallel",
"States": {
"Parallel": {
"Type": "Parallel",
"Next": "Task4",
"Branches": [
{
"StartAt": "Task1",
"States": {
"Task1": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"End": true
}
}
},
{
"StartAt": "Task2",
"States": {
"Task2": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"End": true
}
}
},
{
"StartAt": "Task3",
"States": {
"Task3": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"End": true
}
}
}
]
},
"Task4": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"Next": "Parallel2"
},
"Parallel2": {
"Type": "Parallel",
"Next": "Final State",
"Branches": [
{
"StartAt": "Task5",
"States": {
"Task5": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"End": true
}
}
},
{
"StartAt": "Task6",
"States": {
"Task6": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"End": true
}
}
}
]
},
"Final State": {
"Type": "Pass",
"End": true
}
}
}
I have one simple cloud task queue and have successfully submitted a task to the queue. It is supposed to deliver a JSON payload to my API to perform a basic database update. The task is created at the end of a process in a .net core 3.1 app running locally on my desktop triggered by postman and the API is a golang app running in cloud run. However, the task never seems to fire and never registers an error.
The tasks in queue is always 0 and the tasks running is always blank. I have hit the "Run Now" button dozens of times but it never changes anything and no log entries or failed attempts are ever registered.
The task is created with the OIDCToken with a service account and audience set for the service account that has the authorization to create tokens and execute the cloud run instance.
Screen Shot of Tasks Queue in Google Cloud Console
Task creation log entry shows that it was created OK:
{
"insertId": "efq7sxb14",
"jsonPayload": {
"taskCreationLog": {
"targetAddress": "PUT https://{readacted}",
"targetType": "HTTP",
"scheduleTime": "2020-04-25T01:15:48.434808Z",
"status": "OK"
},
"#type": "type.googleapis.com/google.cloud.tasks.logging.v1.TaskActivityLog",
"task": "projects/{readacted}/locations/us-central1/queues/database-updates/tasks/0998892809207251757"
},
"resource": {
"type": "cloud_tasks_queue",
"labels": {
"target_type": "HTTP",
"project_id": "{readacted}",
"queue_id": "database-updates"
}
},
"timestamp": "2020-04-25T01:15:48.435878120Z",
"severity": "INFO",
"logName": "projects/{readacted}/logs/cloudtasks.googleapis.com%2Ftask_operations_log",
"receiveTimestamp": "2020-04-25T01:15:49.469544393Z"
}
Any ideas as to why the tasks are not running? This is my first time using Cloud Tasks so don't rule out the idiot between the keyboard and the chair.
Thanks!
You might be using a non-default service. See Configuring Cloud Tasks queues
Try creating a task from the command line and watch the logs e.g.
gcloud tasks create-app-engine-task --queue=default \
--method=POST --relative-uri=/update_counter --routing=service:worker \
--body-content=10
In my own case, I used --routing=service:api and it worked straight away. Then I added AppEngineRouting to the AppEngineHttpRequest.
If you use a "Push" subscription to a Google Cloud Pub/Sub, you'll be registering an HTTPS endpoint that receives messages from Google's managed service. This is great if you wish to avoid dependencies on Google Cloud's SDKs and instead trigger your asynchronous services via a traditional web request. However, the intended casing of the properties of the payload is not clear, and since I'm using Push subscriptions I don't have a SDK to defer to for deserialization.
If you look at this documentation, you see references to message_id using snake_case (Update 9/18/18: As stated in Kamal's answer, the documentation was updated since this was incorrect), e.g.:
{
"message": {
"attributes": {
"key": "value"
},
"data": "SGVsbG8gQ2xvdWQgUHViL1N1YiEgSGVyZSBpcyBteSBtZXNzYWdlIQ==",
"message_id": "136969346945",
"publish_time": "2014-10-02T15:01:23.045123456Z"
},
"subscription": "projects/myproject/subscriptions/mysubscription"
}
If you look at this documentation, you see references to messageId using camelCase, e.g.:
{
"message": {
"attributes": {
"key": "value"
},
"data": "SGVsbG8gQ2xvdWQgUHViL1N1YiEgSGVyZSBpcyBteSBtZXNzYWdlIQ==",
"messageId": "136969346945",
"publishTime": "2014-10-02T15:01:23.045123456Z"
},
"subscription": "projects/myproject/subscriptions/mysubscription"
}
If you subscribe to the topics and log the output, you actually get both formats, e.g.:
{
"message": {
"attributes": {
"key": "value"
},
"data": "SGVsbG8gQ2xvdWQgUHViL1N1YiEgSGVyZSBpcyBteSBtZXNzYWdlIQ==",
"messageId": "136969346945",
"message_id": "136969346945",
"publishTime": "2014-10-02T15:01:23.045123456Z",
"publish_time": "2014-10-02T15:01:23.045123456Z"
},
"subscription": "projects/myproject/subscriptions/mysubscription"
}
An ideal response would answer both of these questions:
Why are there two formats?
Is one more correct or authoritative?
The officially correct names for the variables should be camel case (messageId), based on the Google JSON style guide. In the early phases of Cloud Pub/Sub, snake case was used for message_id and publish_time, but was changed later in order to conform to style standards. The snake case ones were kept in addition to the camel case ones in order to ensure push endpoints depending on the original format did not break. The first documentation link you point apparently was not updated at the time and it will be fixed shortly.