How to reduce system lag in dataflow job - google-cloud-platform

How to reduce System Lag from Dataflow streaming job?
Job details -
Machine Type - n1-highmem-2 ,
num_workers - 120 ,
max_num_workers - 120,
region - us-central1,
worker_zone - us-central1-a
System lag at the start of the job is usually under 20 sec, as the job time increases the system lag starts to creep up max upto an hour i.e. 60 minutes.
Tried troubleshooting the code whether something is stuck while processing the messages but everything seems to be fine.
Are there any specific metrics we need to check which will give us hint on why this system lag is increasing and how can we know more about this lag and why it is occuring?

Related

Network data out - nmon/nload vs AWS Cloudwatch disparity

We are running a video conferencing server in an EC2 instance.
Since this is a data out (egress) heavy app, we want to monitor the network data out closely (since we are charged heavily for that).
As seen in the screenshot above, in our test, using nmon (top right) or nload (left) in our EC2 server shows the network out as 138 Mbits/s in nload and 17263 KB/s in nmon which are very close (138/8 = 17.25).
But, when we check the network out (bytes) in AWS Cloudwatch (bottom right), the number shown is very high (~ 1 GB/s) (which makes more sense for the test we are running), and this is the number for which we are finally charged.
Why is there such a big difference between nmon/nload and AWS Cloudwatch?
Are we missing some understanding here? Are we not looking at the AWS Cloudwatch metrics correctly?
Thank you for your help!
Edit:
Adding the screenshot of a longer test which shows the average network out metric in AWS Cloudwatch to be flat around 1 GB for the test duration while nmon shows average network out of 15816 KB/s.
Just figured out the answer to this.
The following link talks about the periods of data capture in AWS:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html
Periods
A period is the length of time associated with a specific
Amazon CloudWatch statistic. Each statistic represents an aggregation
of the metrics data collected for a specified period of time. Periods
are defined in numbers of seconds, and valid values for period are 1,
5, 10, 30, or any multiple of 60. For example, to specify a period of
six minutes, use 360 as the period value. You can adjust how the data
is aggregated by varying the length of the period. A period can be as
short as one second or as long as one day (86,400 seconds). The
default value is 60 seconds.
Only custom metrics that you define with a storage resolution of 1
second support sub-minute periods. Even though the option to set a
period below 60 is always available in the console, you should select
a period that aligns to how the metric is stored. For more information
about metrics that support sub-minute periods, see High-resolution
metrics.
As seen in the link above, if we don't set a custom metric with custom periods, AWS by default does not capture sub-minute data. So, the lowest resolution of data available is every 1 minute.
So, in our case, the network out data within 60 seconds is aggregated and captured as a single data point.
Even if I change the statistic to Average and the period to 1 second, it still shows every 1 minute data.
Now, if I divide 1.01 GB (shown by AWS) with 60, I get the per second data which is roughly around 16.8 MBps which is very close to the data shown by nmon or nload.
From the AWS docs:
NetworkOut: The number of bytes sent out by the instance on all network interfaces. This metric identifies the volume of outgoing network traffic from a single instance.
The number reported is the number of bytes sent during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
The NetworkOut graph in your case does not represent the current speed, it represents the number of bytes sent out by all network interfaces in the last 5 minutes. If my calculations are correct, we should get the following values:
1.01 GB ~= 1027 MB (reading from your graph)
To get the average speed for the last 5 minutes:
1027 MB / 300 = 3.42333 MB/s ~= 27.38 Mbits/s
It is still more than what you are expecting, although this is just an average for the last 5 minutes.

Do Google Cloud background functions have max timeout?

We have been using Google Cloud Functions with http-triggers, but ran into the limitation of a maximum timeout of 540 s.
Our jobs are background jobs, typically datapipelines, with processing times often longer than 9 minutes.
Do background functions have this limit, too? It is not clear to me from the documentation.
All functions have a maximum configurable timeout of 540 seconds.
If you need something to run longer than that, consider delegating that work to run on another product, such as Compute Engine or App Engine.
2nd Generation Cloud Functions that are triggered by https can have a maximum timeout of 1 hour instead of the 10 minute limit.
See also: https://cloud.google.com/functions/docs/2nd-gen/overview
You can then trigger this 2nd gen Cloud Function with for example Cloud Scheduler.
When creating the job on Cloud Scheduler you can set the Attempt deadline config to 30 minutes. This is the deadline for job attempts. Otherwise it is cancelled and considered a failed job.
See also: https://cloud.google.com/scheduler/docs/reference/rest/v1/projects.locations.jobs#Job
The maximum run time of 540 seconds applies to all Cloud Functions, no matter how they're triggered. If you want to run something longer you will have to either chop it into multiple parts, or run it on a different platform.

high cpu in redis 2.8 (elasticache) cache.r3.large

looking for some help in ElasticCache
We're using ElasticCache Redis to run a Resque based Qing system.
this means it's a mix of sorted sets and Lists.
at normal operation, everything is OK and we're seeing good response times & throughput.
CPU level is around 7-10%, Get+Set commands are around 120-140K operations. (All metrics are cloudwatch based. )
but - when the system experiences a (mild) burst of data, enqueing several K messages, we see the server become near non-responsive.
the CPU is steady # 100% utilization (metric says 50, but it's using a single core)
number of operation drops to ~10K
response times are slow to a matter of SECONDS per request
We would expect, that even IF the CPU got loaded to such an extent, the throughput level would have stayed the same, this is what we experience when running Redis locally. redis can utilize CPU, but throughput stays high. as it is natively single-cored, not context switching appears.
AFAWK - we do NOT impose any limits, or persistence, no replication. using the basic config.
the size: cache.r3.large
we are nor using periodic snapshoting
This seems like a characteristic of a rouge lua script.
having a defect in such a script could cause a big CPU load, while degrading the overall throughput.
are you using such ? try to look in the Redis slow log for one

AWS EC2 reaches high CPU uses during nighttime

I just implemented a few alarms with CloudWatch last week and I noticed a strange behavior with EC2 small instances everyday between 6h30 and 6h45 (UTC time).
I implemented one alarm to warn me when a AutoScallingGroup have its CPU over 50% during 3 minutes (average sample) and another alarm to warn me when the same AutoScallingGroup goes back to normal, which I considered to be CPU under 30% during 3 minutes (also average sample). Did that 2 times: one for zone A, and another time for zone B.
Looks OK, but something is happening during 6h30 to 6h45 that takes certain amount of processing for 2 to 5 min. The CPU rises, sometimes trigger the "High use alarm", but always triggers the "returned to normal alarm". Our system is currently under early stages of development, so no user have access to it, and we don't have any process/backups/etc scheduled. We barely have Apache+PHP installed and configured, so it can only be something related to the host machines I guess.
Anybody can explain what is going on and how can we solve it besides increasing the sample time or % in the "return to normal" alarm? Folks at Amazon forum said the Service Team would have a look once they get a chance, but it's been almost a week with no return.

Auto scaling batch jobs on Elastic Beanstalk

I am trying to set up a scalable background image processing using beanstalk.
My setup is the following:
Application server (running on Elastic Beanstalk) receives a file, puts it on S3 and sends a request to process it over SQS.
Worker server (also running on Elastic Beanstalk) polls the SQS queue, takes the request, load original image from S3, processes it resulting in 10 different variants and stores them back on S3.
These upload events are happening at a rate of about 1-2 batches per day, 20-40 pics each batch, at unpredictable times.
Problem:
I am currently using one micro-instance for the worker. Generating one variant of the picture can take anywhere from 3 seconds to 25-30 (it seems first ones are done in 3, but then micro instance slows down, I think this is by its 2 second bursty workload design). Anyway, when I upload 30 pictures that means the job takes: 30 pics * 10 variants each * 30 seconds = 2.5 hours to process??!?!
Obviously this is unacceptable, I tried using "small" instance for that, the performance is consistent there, but its about 5 seconds per variant, so still 30*10*5 = 26 minutes per batch. Still not really acceptable.
What is the best way to attack this problem which will get fastest results and will be price efficient at the same time?
Solutions I can think of:
Rely on beanstalk auto-scaling. I've tried that, setting up auto scaling based on CPU utilization. That seems very slow to react and unreliable. I've tried setting measurement time to 1 minute, and breach duration at 1 minute with thresholds of 70% to go up and 30% to go down with 1 increments. It takes the system a while to scale up and then a while to scale down, I can probably fine tune it, but it still feels weird. Ideally I would like to get a faster machine than micro (small, medium?) to use for these spikes of work, but with beanstalk that means I need to run at least one all the time, since most of the time the system is idle that doesn't make any sense price-wise.
Abandon beanstalk for the worker, implement my own monitor of of the SQS queue running on a micro, and let it fire up larger machine(or group of larger machines) when there are enough pending messages in the queue, terminate them the moment we detect queue is idle. That seems like a lot of work, unless there is a solution for this ready out there. In any case, I lose the benefits of beanstalk of deploying the code through git, managing environments etc.
I don't like any of these two solutions
Is there any other nice approach I am missing?
Thanks
CPU utilization on a micro instance is probably not the best metric to use for autoscaling in this case.
Length of the SQS queue would probably be the better metric to use, and the one that makes the most natural sense.
Needless to say, if you can budget for a bigger base-line machine everything would run that much faster.