Understanding AWS Glue detailed job metrics - amazon-web-services

Please see the attached screen shot of the CPU Load: Driver and Executors. It looks fine in the first 6 minutes, multiple executors are active. But after 6 minutes the chart only shows the Executor Average and Driver lines. When I put the mouse on the line, there are no usage data for all 17 executors. Does that mean all the executors are inactive after 6 minutes? How the Executor Average is calculated?
Thank you.

After talked to AWS support, I finally got the answer for why after 04:07 there are no lines for individual executors but only the Executor Average and the Driver.
I was told there are 62 executors for each job, however, at each moment at most 17 executors are used. So the Executor Average is the average of different sets of 17 executors at different moment. The default CPU Load chart only shows Executor 1 to 17, not 18 to 62. In order to show other executors, you need to manually add the metrics.

Related

How to reduce system lag in dataflow job

How to reduce System Lag from Dataflow streaming job?
Job details -
Machine Type - n1-highmem-2 ,
num_workers - 120 ,
max_num_workers - 120,
region - us-central1,
worker_zone - us-central1-a
System lag at the start of the job is usually under 20 sec, as the job time increases the system lag starts to creep up max upto an hour i.e. 60 minutes.
Tried troubleshooting the code whether something is stuck while processing the messages but everything seems to be fine.
Are there any specific metrics we need to check which will give us hint on why this system lag is increasing and how can we know more about this lag and why it is occuring?

The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of 10 seconds

I have multiple clusters in AWS (Managed by AWS - EKS). Some nodes generate following alerts (each node is m6i.xlarge, 4 CPU, 16GB, running 5.4.156-83.273.amzn2.x86_64, kubelet version v1.21.5-eks-bc4871b)
The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of 10 seconds
Other nodes from same cluster continue to work fine. The node is small (16GB, 4 CPU), but does not have any real load. (less than .6 when I noticed one such alert). Alert did come from multiple clusters. Number of pods reached about 70 when alert came.
The /var/log/messages from the node has tons of message similar to below
Feb 21 04:53:39 ip-10-11-19-219.prod.us-east-1.xxxxx kubelet[3394]: I0221 04:53:39.904698 3394 kubelet.go:1973] "SyncLoop (PLEG): event for pod" pod="zcr-aqua/scan-configauditreport-fd9bd5d4b-rc6zm" event=&{ID:0731658f-5eeb-412f-b85a-4a16f1a298b6 Type:ContainerStarted Data:9adeeeb232ff8ed47a160ee2894f6537dd433e4a1f3321e3778baf0d68cd92f0}
Is it possible that pods from one bad job (zcr-aqua/scan-configauditreport*) can create this situation and cause alert? All messages in kubelet log has this name
(There are some PLEG related observations in https://developers.redhat.com/blog/2019/11/13/pod-lifecycle-event-generator-understanding-the-pleg-is-not-healthy-issue-in-kubernetes#conclusions , did not appear to be applicable in this case)

Network data out - nmon/nload vs AWS Cloudwatch disparity

We are running a video conferencing server in an EC2 instance.
Since this is a data out (egress) heavy app, we want to monitor the network data out closely (since we are charged heavily for that).
As seen in the screenshot above, in our test, using nmon (top right) or nload (left) in our EC2 server shows the network out as 138 Mbits/s in nload and 17263 KB/s in nmon which are very close (138/8 = 17.25).
But, when we check the network out (bytes) in AWS Cloudwatch (bottom right), the number shown is very high (~ 1 GB/s) (which makes more sense for the test we are running), and this is the number for which we are finally charged.
Why is there such a big difference between nmon/nload and AWS Cloudwatch?
Are we missing some understanding here? Are we not looking at the AWS Cloudwatch metrics correctly?
Thank you for your help!
Edit:
Adding the screenshot of a longer test which shows the average network out metric in AWS Cloudwatch to be flat around 1 GB for the test duration while nmon shows average network out of 15816 KB/s.
Just figured out the answer to this.
The following link talks about the periods of data capture in AWS:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html
Periods
A period is the length of time associated with a specific
Amazon CloudWatch statistic. Each statistic represents an aggregation
of the metrics data collected for a specified period of time. Periods
are defined in numbers of seconds, and valid values for period are 1,
5, 10, 30, or any multiple of 60. For example, to specify a period of
six minutes, use 360 as the period value. You can adjust how the data
is aggregated by varying the length of the period. A period can be as
short as one second or as long as one day (86,400 seconds). The
default value is 60 seconds.
Only custom metrics that you define with a storage resolution of 1
second support sub-minute periods. Even though the option to set a
period below 60 is always available in the console, you should select
a period that aligns to how the metric is stored. For more information
about metrics that support sub-minute periods, see High-resolution
metrics.
As seen in the link above, if we don't set a custom metric with custom periods, AWS by default does not capture sub-minute data. So, the lowest resolution of data available is every 1 minute.
So, in our case, the network out data within 60 seconds is aggregated and captured as a single data point.
Even if I change the statistic to Average and the period to 1 second, it still shows every 1 minute data.
Now, if I divide 1.01 GB (shown by AWS) with 60, I get the per second data which is roughly around 16.8 MBps which is very close to the data shown by nmon or nload.
From the AWS docs:
NetworkOut: The number of bytes sent out by the instance on all network interfaces. This metric identifies the volume of outgoing network traffic from a single instance.
The number reported is the number of bytes sent during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
The NetworkOut graph in your case does not represent the current speed, it represents the number of bytes sent out by all network interfaces in the last 5 minutes. If my calculations are correct, we should get the following values:
1.01 GB ~= 1027 MB (reading from your graph)
To get the average speed for the last 5 minutes:
1027 MB / 300 = 3.42333 MB/s ~= 27.38 Mbits/s
It is still more than what you are expecting, although this is just an average for the last 5 minutes.

AWS GlueJob Error - Command failed with exit code 137

I am executing a AWS-Gluejob with python shell. It fails inconsistently with the error "Command failed with exit code 137" and executes perfectly fine with no changes sometimes.
What does this error signify? Are there any changes we can do in the job configuration to handle the same?
Error Screenshot
Adding the worker type to the Job Properties will resolve the issue.Based on the file size please select the worker type as below:
Standard – When you choose this type, you also provide a value for Maximum capacity. Maximum capacity is the number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. The Standard worker type has a 50 GB disk and 2 executors.
G.1X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 1 DPU (4 vCPU, 16 GB of memory, 64 GB disk), and provides 1 executor per worker. We recommend this worker type for memory-intensive jobs.
G.2X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 2 DPU (8 vCPU, 32 GB of memory, 128 GB disk), and provides 1 executor per worker. We recommend this worker type for memory-intensive jobs and jobs that run ML transforms.

Do Google Cloud background functions have max timeout?

We have been using Google Cloud Functions with http-triggers, but ran into the limitation of a maximum timeout of 540 s.
Our jobs are background jobs, typically datapipelines, with processing times often longer than 9 minutes.
Do background functions have this limit, too? It is not clear to me from the documentation.
All functions have a maximum configurable timeout of 540 seconds.
If you need something to run longer than that, consider delegating that work to run on another product, such as Compute Engine or App Engine.
2nd Generation Cloud Functions that are triggered by https can have a maximum timeout of 1 hour instead of the 10 minute limit.
See also: https://cloud.google.com/functions/docs/2nd-gen/overview
You can then trigger this 2nd gen Cloud Function with for example Cloud Scheduler.
When creating the job on Cloud Scheduler you can set the Attempt deadline config to 30 minutes. This is the deadline for job attempts. Otherwise it is cancelled and considered a failed job.
See also: https://cloud.google.com/scheduler/docs/reference/rest/v1/projects.locations.jobs#Job
The maximum run time of 540 seconds applies to all Cloud Functions, no matter how they're triggered. If you want to run something longer you will have to either chop it into multiple parts, or run it on a different platform.