GCP dataflow quota issue - google-cloud-platform

I'm new in GCP plateform.
I am setting up a dataflow job that collects data from pub/sub topic and send it to bigquery but i am having a quota issue.
See below the error code. I am looking forward to having a recommandation. Please help.
{
"textPayload": "Project gold-apparatus-312313 has insufficient quota(s) to execute this workflow with 1 instances in region us-central1. Quota summary (required/available): 1/7 instances, 4/4 CPUs, 1230/818 disk GB, 0/250 SSD disk GB, 1/99 instance groups, 1/49 managed instance groups, 1/99 instance templates, 1/3 in-use IP addresses.\n\nPlease see https://cloud.google.com/compute/docs/resource-quotas about requesting more quota.",
"insertId": "fjxvatcdw1",
"resource": {
"type": "dataflow_step",
"labels": {
"project_id": "76303513563",
"region": "us-central1",
"step_id": "",
"job_id": "2021-05-07_16_07_09-4793338291722263008",
"job_name": "my_job1"
}
},
"timestamp": "2021-05-07T23:16:20.877954859Z",
"severity": "WARNING",
"labels": {
"dataflow.googleapis.com/region": "us-central1",
"dataflow.googleapis.com/log_type": "system",
"dataflow.googleapis.com/job_id": "2021-05-07_16_07_09-4793338291722263008",
"dataflow.googleapis.com/job_name": "my_job1"
},
"logName": "projects/gold-apparatus-312313/logs/dataflow.googleapis.com%2Fjob-message",
"receiveTimestamp": "2021-05-07T23:16:21.809934742Z"
}

Your error code shows insufficient disk size quota (Required - 1230 Available-818).So hereby its recommended to increase your quota.
For increasing your quota,you can request an increase in quota.
To submit a GCE quota increase :
https://support.google.com/cloud/answer/6075746
Reference:
Quotas & limits for Dataflow - https://cloud.google.com/dataflow/quotas

Related

Find out which IAM account manually triggered a scheduled function

I have a GCP Cloud function which runs on a schedule every morning. The logs show that it has been triggered off-schedule three other times in the last week, which I presume can only happen if someone has gone to the Cloud Scheduler page and clicked 'Run now' on that function. How can I find out who did this? The Logs Explorer doesn't show this information. (Heads will not roll, but IAM permissions may be stripped. Bonus points if it turns out to have been me.)
For scheduled functions, there are two sets of logs - one for the cloud function triggered by the schedule, and one for the Cloud Scheduler itself. In the logs for the Cloud Scheduler, only the daily schedule shows up, not the extra triggers.
Log of the function starting in the logs explorer for the Cloud Function:
{
"textPayload": "Function execution started",
"insertId": "REDACTED",
"resource": {
"type": "cloud_function",
"labels": {
"region": "REDACTED",
"function_name": "REDACTED",
"project_id": "REDACTED"
}
},
"timestamp": "2022-05-04T08:49:37.980952884Z",
"severity": "DEBUG",
"labels": {
"execution_id": "REDACTED"
},
"logName": "projects/REDACTED/logs/cloudfunctions.googleapis.com%2Fcloud-functions",
"trace": "projects/REDACTED/traces/REDACTED",
"receiveTimestamp": "2022-05-04T08:49:37.981500851Z"
}
If the trigger of your Cloud Function is http and without authorization, it will be very hard (or near impossible) to figure out who called it.
Additionally, it is possible there were not enough available instances when it was scheduled, and then the Cloud Function ran later by a retry.

Dataflow logs from Stackdriver

The resource.labels.region field for the dataflow_step logs in stackdriver, points to global even though the specified regional endpoint is Europe-west2.
Any idea on what is it exactly pointing to?
Once you've supplied GCP Logs Viewer with the desired filtering option, as most simple query based on your inputs seeking for dataflow_step resource type:
resource.type="dataflow_step"
resource.labels.region="europe-west2"
You would probably observe query results retrieved from Cloud Dataflow REST API, consisting with logs entries formatted as a JSON outputs for all Dataflow Jobs that are residing within your GCP project in europe-west2 Regional endpoint:
{
"insertId": "insertId",
"jsonPayload": {
....
"message": "Message content",
....
},
"resource": {
"type": "dataflow_step",
"labels": {
"job_id": "job_id",
"region": "europe-west2",
"job_name": "job_name",
"project_id": "project_id",
"step_id": "step_id"
}
},
"timestamp": "timestamp",
"severity": "severity_level",
"labels": {
"compute.googleapis.com/resource_id": "resource_id",
"dataflow.googleapis.com/job_id": "job_id",
"compute.googleapis.com/resource_type": "resource_type",
"compute.googleapis.com/resource_name": "resource_name",
"dataflow.googleapis.com/region": "europe-west2",
"dataflow.googleapis.com/job_name": "job_name"
},
"logName": "logName",
"receiveTimestamp": "receiveTimestamp"
According to GCP logging service documentation each monitoring resource type derives particular labels from the nested service API, dataflow.googleapis.com corresponds to Dataflow service.
Therefore, if you run Dataflow Job defining the location for job's metadata region, GCP logging service will fetch up this regional endpoint from job description throughout dataflow.googleapis.com REST methods.
The resource.labels.region field on Dataflow Step logs should refer to the regional endpoint that the job is using. "Global" is not an expected value there.

Streaming Cloudwatch Logs to Amazon ES

I'm using Fargate to deploy my application. To log the container logs, I'm using awslogs as the log-driver. Now I want to ship my logs to Amazon ES service. While going through the docs for shipping, I encountered a note that mentions
Streaming large amounts of CloudWatch Logs data to other
destinations might result in high usage charges.
I want to understand what all will I be billed for while shipping the logs to ELK? How do they define large amounts?
Will I be billed for
a) Cloudwatch?
b) Log driver?
c) Lambda function? Does every log-line triggers a lambda function?
Lastly, is there still a possibility to lower the cost more?
Personally I would look running fluent or fluentbit in another container along side your application https://docs.fluentbit.io/manual/pipeline/outputs/elasticsearch
You can send your logs direct to ES then without any cloudwatch costs.
EDIT
Here's the final solution, just in case someone is looking for a cheaper solution.
Run Fluentd/Fuentbit in another container alongside your application
Using the Github Config, I was able to forward the logs to ES with the below config.
{
"family": "workflow",
"cpu": "256",
"memory": "512",
"containerDefinitions": [
{
"name": "log_router",
"image": "docker.io/amazon/aws-for-fluent-bit:latest",
"essential": true,
"firelensConfiguration": {
"type": "fluentbit",
"options":{
"enable-ecs-log-metadata":"true"
}
},
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-create-group": "true",
"awslogs-group": "your_log_group",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
"memoryReservation": 50
},
{
"name": "ContainerName",
"image": "YourImage",
"cpu": 0,
"memoryReservation": 128,
"portMappings": [
{
"containerPort": 5005,
"protocol": "tcp"
}
],
"essential": true,
"command": [
"YOUR COMMAND"
],
"environment": [],
"logConfiguration": {
"logDriver": "awsfirelens",
"secretOptions": [],
"options": {
"Name": "es",
"Host": "YOUR_ES_DOMAIN_URL",
"Port": "443",
"tls": "On",
"Index": "INDEX_NAME",
"Type": "TYPE"
}
},
"resourceRequirements": []
}
]
}
The log_router container collects the logs and ships it to ES. For more info, refer Custom Log Routing
Please note that the log_router container is required in the case of Fargate, but not with ECS.
This is the cheapest solution I know which does not involves Cloudwatch, Lamdas, Kinesis.
Like every resource, AWS charges for use and for maintenance. therefore, the charges will be for the execution of the lambda function and Storing the data in CloudWatch. the reason they mentioned that: Streaming large amounts of CloudWatch Logs data to other destinations might result in high usage charges. Is because it takes time for the lambda function to process the log and insert it into ES, When you try to stream a large number of logs the lambda function will be executed for a longer time.
Lambda function? Does every log-line triggers a lambda function?
Yes, when enabling the streaming from CloudWatch to ES every log inserted to CloudWatch triggers the lambda function.
Image from demonstration (see the trigger):
Is there still a possibility to lower the cost more?
The only way to lower the cost (when using this implementation) is to write your own lambda function which will be triggered every X seconds\minutes and insert to log to ES.
As much as I can tell the cost gap will be Meaningless.
More information:
Lambda code .
How this is working behind the scenes .

AWS Batch Job stuck RUNNABLE when Launch template is configured

I have configured Step Function with AWS Batch Jobs. All configuration working well but I need to customize starting instance. For this purpose I use Launch Template service and build simple (empty) configuration based on instance type used in AWS Batch configuration. When Compute Environment is build with Launch Template, Batch Job is stuck on RUNNABLE stage. When I run AWS Batch Job without Launch Template everything works OK. Lunching instance form template also works OK. Could anyone give me any advice what is wrong or missing? Below are definitions of whole stack elements.
Launch Template definition
Compute environment details Overview
Compute environment name senet-cluster-r5ad-2xlarge-v3-4
Compute environment ARN arn:aws:batch:eu-central-1:xxxxxxxxxxx:compute-environment/senet-cluster-r5ad-2xlarge-v3-4
ECS Cluster name arn:aws:ecs:eu-central-1:xxxxxxxxxxxx:cluster/senet-cluster-r5ad-2xlarge-v3-4_Batch_3323aafe-d7a4-3cfe-91e5-c1079ee9d02e
Type MANAGED
Status VALID
State ENABLED
Service role arn:aws:iam::xxxxxxxxxxx:role/service-role/AWSBatchServiceRole
Compute resources
Minimum vCPUs 0
Desired vCPUs 0
Maximum vCPUs 25
Instance types r5ad.2xlarge
Allocation strategy BEST_FIT
Launch template lt-023ebdcd5df6073df
Launch template version $Default
Instance rolearn:aws:iam::xxxxxxxxxxx:instance-profile/ecsInstanceRole
Spot fleet role
EC2 Keypair senet-test-keys
AMI id ami-0b418580298265d5c
vpcId vpc-0917ea63
Subnets subnet-49332034, subnet-8902a7e3, subnet-9de503d1
Security groups sg-cdbbd9af, sg-047ea19daf36aa269
AWS Batch Job Definition
{
"jobDefinitionName": "senet-cluster-job-def-3",
"jobDefinitionArn": "arn:aws:batch:eu-central-1:xxxxxxxxxxxxxx:job-definition/senet-cluster-job-def-3:9",
"revision": 9,
"status": "ACTIVE",
"type": "container",
"parameters": {},
"containerProperties": {
"image": "xxxxxxxxxxx.dkr.ecr.eu-central-1.amazonaws.com/senet/batch-process:latest",
"vcpus": 4,
"memory": 60000,
"command": [],
"jobRoleArn": "arn:aws:iam::xxxxxxxxxxxxx:role/AWSS3BatchFullAccess-senet",
"volumes": [],
"environment": [
{
"name": "BATCH_FILE_S3_URL",
"value": "s3://senet-batch/senet_jobs.sh"
},
{
"name": "AWS_DEFAULT_REGION",
"value": "eu-central-1"
},
{
"name": "BATCH_FILE_TYPE",
"value": "script"
}
],
"mountPoints": [],
"ulimits": [],
"user": "root",
"resourceRequirements": [],
"linuxParameters": {
"devices": []
}
}
}
For those of you who had the same problem. Here are the solution works for me. it took me a few days to figure it out.
The default AWS AMI snapshots need at least 30G of storage. When you do not have the launch template, the cloudformation will use the correct storage size.
In my case, I defined only 8G of storage in my launch template. And when the launch template is used, the jobs are stuck in runnable.
Simply change the storage in your launch template to anything bigger than 30G. It shall work.
Also, do not forget IamInstanceProfile and SecurityGroupIds are required in the launch template for the job to get started.

After updating Fargate TaskDefinition, CloudWatch events that trigger tasks fail because of inactive task definitions

I have a series of tasks defined in ECS that run on a recurring schedule. I recently made a minor change to update my task definition in Terraform to change default environment variables for my container (from DEBUG to PRODUCTION):
"environment": [
{"name": "ENVIRONMENT", "value": "PRODUCTION"}
]
I had this task running using the Scheduled Tasks feature of Fargate, setting it at a rate of every 4 hours. However, after updating my task definition, I began to see that the tasks were not being triggered by CloudWatch, since my last container log was from several days ago.
I dug deeper into the issue using CloudTrail, and noticed one particular part of the entry for a RunTask event:
"eventTime": "2018-12-10T17:26:46Z",
"eventSource": "ecs.amazonaws.com",
"eventName": "RunTask",
"awsRegion": "us-east-1",
"sourceIPAddress": "events.amazonaws.com",
"userAgent": "events.amazonaws.com",
"errorCode": "InvalidParameterException",
"errorMessage": "TaskDefinition is inactive",
Further down in the log, I noticed that the task definition ECS was attempting to run was
"taskDefinition": "arn:aws:ecs:us-east-1:XXXXX:task-
definition/important-task-name:2",
However, in my ECS task definitions, the latest version of important-task-name was 3. So it looks like the events are not triggering because I am using an "inactive" version of my task definition.
Is there any way for me to schedule tasks in AWS Fargate without having to manually go through the console and stop/restart/update each cluster's scheduled update? Isn't there any way to simply ask CloudWatch to pull the latest active task definition?
You can use CloudWatch Event Rules to control scheduled tasks and whenever you update a task definition you can also update your rule. Say you have two files:
myRule.json
{
"Name": "run-every-minute",
"ScheduleExpression": "cron(0/1 * * * ? *)",
"State": "ENABLED",
"Description": "a task that will run every minute",
"RoleArn": "arn:aws:iam::${IAM_NUMBER}:role/ecsEventsRole",
"EventBusName": "default"
}
myTargets.json
{
"Rule": "run-every-minute",
"Targets": [
{
"Id": "scheduled-task-example",
"Arn": "arn:aws:ecs:${REGION}:${IAM_NUMBER}:cluster/mycluster",
"RoleArn": "arn:aws:iam::${IAM_NUMBER}:role/ecsEventsRole",
"Input": "{\"containerOverrides\":[{\"name\":\"myTask\",\"environment\":[{\"name\":\"ENVIRONMENT\",\"value\":\"production\"},{\"name\":\"foo\",\"value\":\"bar\"}]}]}",
"EcsParameters": {
"TaskDefinitionArn": "arn:aws:ecs:${REGION}:${IAM_NUMBER}:task-definition/myTaskDefinition",
"TaskCount": 1,
"LaunchType": "FARGATE",
"NetworkConfiguration": {
"awsvpcConfiguration": {
"Subnets": [
"subnet-xyz1",
"subnet-xyz2",
],
"SecurityGroups": [
"sg-xyz"
],
"AssignPublicIp": "ENABLED"
}
},
"PlatformVersion": "LATEST"
}
}
]
}
Now, whenever there's a new revision of myTaskDefinition you may update your rule, e.g.:
aws events put-rule --cli-input-json file://myRule.json --region $REGION
aws events put-targets --cli-input-json file://myTargets.json --region $REGION
echo 'done'
But of course, replace IAM_NUMBER and REGION with your credentials,
Cloud Map seems like a solution for these types of problems.
https://aws.amazon.com/about-aws/whats-new/2018/11/aws-fargate-and-amazon-ecs-now-integrate-with-aws-cloud-map/