CloudWatch Agent: batch size equal to "1" - is it a bad idea?

CloudWatch Agent: batch size equal to "1" - is it a bad idea? - amazon-web-services

If I correctly understand, a CloudWatch Agent publishes events to CloudWatch by using a of kind of batching, the size of which is specified by the two params:
batch_count:
Specifies the max number of log events in a batch, up to 10000. The
default value is 1000.
batch_size
Specifies the max size of log events in a batch, in bytes, up to
1048576 bytes. The default value is 32768 bytes. This size is
calculated as the sum of all event messages in UTF-8, plus 26 bytes
for each log event.
I guess, that in order to eliminate a possibility of loosing any log data in case of a EC2 instance termination, the batch_count should be equal to 1 (because in case of the instance termination all logs will be destroyed). Am I right that this is only one way to achieve it, and how this can affect the performance? Will it have any noticeable side-effects?

Yes, it's a bad idea. You are probably more likely to lose data that way. The PutLogEvents API that the agent uses is limited to 5 requests per second per log stream (source). With a batch_count of 1, you'd only be able to publish 5 log events per second. If the application were to produce more than that consistently, the agent wouldn't be able to keep up.
If you absolutely can't afford to lose any log data, maybe you should be writing that data to a database instead. There will always be some risk of losing log data, even if with a batch_count of 1. The host could always crash before the agent polls the log file... which BTW is every 5 seconds by default (source).

Related

What does "batch size" exactly mean in case of AWS Lambda?

I read about batch size of getting events from triggers into AWS Lambda, over here: link
If the batch size is set to a 10 (say), then does it mean that lambda will wait until there are 10 messages available to process or is it the upper limit - i.e. when fetching events to process, it'll fetch up to 10 at a time?
If 10 events are fetched to process, then how will they be provided as input to the lambda? Will they be provided as an array of 10 elements?

Batch size is an upper limit:
The number of items in the event can be smaller than the batch size if there aren't enough items available, or if the batch is too large to send in one event and has to be split up.
Yes, as an array. For example, format of such an event is shown for SQS.

AWS Elasticsearch publishing wrong total request metric

We have an AWS Elasticsearch cluster setup. However, our Error rate alarm goes off at regular intervals. The way we are trying to calculate our error rate is:
((sum(4xx) + sum(5xx))/sum(ElasticsearchRequests)) * 100
However, if you look at the screenshot below, at 7:15 4xx was 4, however ElasticsearchRequests value is only 2. Based on the metrics info on AWS Elasticsearch documentation page, ElasticsearchRequests should be total number of requests, so it should clearly be greater than or equal to 4xx.
Can someone please help me understand in what I am doing wrong here?

AWS definitions of these metrics are:
OpenSearchRequests (previously ElasticsearchRequests): The number of requests made to the OpenSearch cluster. Relevant statistics: Sum
2xx, 3xx, 4xx, 5xx: The number of requests to the domain that resulted in the given HTTP response code (2xx, 3xx, 4xx, 5xx). Relevant statistics: Sum
Please note the different terms used for the subjects of the metrics: cluster vs domain
To my understanding, OpenSearchRequests only considers requests that actually reach the underlying OpenSearch/ElasticSearch cluster, so some the 4xx requests might not (e.g. 403 errors), hence the difference in metrics.
Also, AWS only recommends comparing 5xx to OpenSearchRequests:
5xx alarms >= 10% of OpenSearchRequests: One or more data nodes might be overloaded, or requests are failing to complete within the idle timeout period. Consider switching to larger instance types or adding more nodes to the cluster. Confirm that you're following best practices for shard and cluster architecture.

I know this was posted a while back but I've additionally struggled with this issue and maybe I can add a few pointers.
First off, make sure your metrics are properly configured. For instance, some responses (4xx for example) take up to 5 minutes to register, while OpensearchRequests are refershed every minute. This makes for a very wonky graph that will definitely throw off your error rate.
In the picture above, I send a request that returns 400 every 5 seconds, and send a response that returns 200 every 0.5 seconds. The period in this case is 1 minute. This makes it so on average it should be around a 10% error rate. As you can see by the green line, the requests sent are summed up every minute, whereas the the 4xx are summed up every 5 minute, and in between every minute they are 0, which makes for an error rate spike every 5 minutes (since the opensearch requests are not multiplied by 5).
In the next image, the period is set to 5 minutes. Notice how this time the error rate is around 10 percent.
When I look at your graph, I see metrics that look like they are based off of a different period.
The second pointer I may add is to make sure to account for when no data is coming in. The behavior the alarm has may vary based on your how you define the "treat missing data" parameter. In some cases, if no data comes in, your expression might make it so it stays in alarm when in fact there is only no new data coming in. Some metrics might return no value when no requests are made, while some may return 0. In the former case, you can use the FILL(metric, value) function to specify what to return when no value is returned. Experiment with what happens to your error rate if you divide by zero.
Hope this message helps clarify a bit.

Azure EventHub / Azure Monitoring - Create alert when incoming bytes exceed outgoing bytes, preferrably by a certain amount?

I want to know when my consumer group fails to handle the amount of events coming in into EventHub. Looking at the metrics, I think the symptom is incoming bytes exceed outgoing bytes.
In Azure portal, I only see alert condition when incoming bytes greater than a static number, which is not what I want. Is it even possible to set up condition like this?

I have checked Azure Monitor docs and alerts don't seem to support metric comparison.
One solution might be to create an Azure Logic App which can store and remember incoming/outgoing byte values and then compare them on a scheduled basis.
For variable support you can look at - https://learn.microsoft.com/en-us/azure/logic-apps/logic-apps-create-variables-store-values

Estimate SQS processing time and load

I am going to use AWS SQS(regular queue, not FIFO) to process different client side metrics.
I’m expect to have ~400 messages per second (worst case).My SQS message will contain S3 location of the file.
I created an application, which will listen to my SQS Queue, and process messages from it.
By process I mean:
read SQS message ->
take S3 location from that SQS message ->
call S3 client ->
Read that file ->
Add a few additional fields —>
Publish data from this file to AWS Kinesis Firehose.
Similar process will be for each SQS message in the Queue. The size of S3 file is small, less than 0,5 KB. How can calculate if I will be able to process those 400 messages per second? How can I estimate that my solution would handle x5 increase in data?

How can calculate if I will be able to process those 400 messages per second? How can I estimate that my solution would handle x5 increase in data?
Test it! Start with a small scale, and do the math to extrapolate from there. Make your test environment as close to what it will be in production as feasible.
On a single host and single thread, the math is simple:
1000 / AvgTotalTimeMillis = AvgMessagesPerSecond, or
1000 / AvgMessagesPerSecond = AvgTotalTimeMillis
How to approach testing this:
Start with a single thread and host, and generate some timing metrics for each step that you outlined, along with a total time.
Figure out your average/max/min time, and how many messages per second that translates to
400 messages per second on a single thread & host would be under 3ms per message. Hopefully this makes it obvious you need multiple threads/hosts.
Scale up!
Now that you know how much a single thread can handle, figure out how many threads a single host can effectively handle (you'll need to experiment). Consider batching messages where possible - SQS provides batch operations.
Use math to calculate how many hosts you need
If you need 5X that number, go up from there
While you're doing this math, consider any limits of the systems you're using:
Review the throttling limits of SQS / S3 / Firehose / etc. If you plan to use Lambda to do the work instead of EC2, it has limits too. Make sure you're within those limits, and consider contacting AWS support if you are close to exceeding them.
A few other suggestions based on my experience:
Based on your workflow outline & details, using EC2 you can probably handle a decent number of threads per host
M5.large should be more than enough - you can probably go smaller, as the performance bottleneck will likely be networking I/O to fetch and send messages.
Consider using autoscaling to handle message spikes for when you need to increase throughput, though keep in mind autoscaling can take several minutes to kick in.

The only way to determine this is to create a test environment that mirrors your scenario.
If your solution is designed to handle messages in parallel, it should be possible to scale-up your system to handle virtually any workload.
A good architecture would be to use AWS Lambda functions to process the messages. Lambda defaults to 1000 concurrent functions. So, if a function takes 3 seconds to run, it would support 333 messages per second consistently. You can request for the Lambda concurrency to be increased to handle higher workloads.
If you are using Amazon EC2 instead of Lambda functions, then it would just be a matter of scaling-out and adding more EC2 instances with more workers to handle whatever workload you desired.

How does AWS Kinesis throttle write throughput?

AWS Kinesis has a fairly low write throughput of 1000 writes/sec and 1MB/writes-sec. How does Kinesis enforce this limit? If I were to try to do 1500 writes in a second, would the extra 500 writes be placed into some sort of queue or would they simply fail?

It looks like it simply fails and throws an exception.
An unsuccessfully processed record includes ErrorCode and ErrorMessage values. ErrorCode reflects the type of error and can be one of the following values: ProvisionedThroughputExceededException or InternalFailure. ErrorMessage provides more detailed information about the ProvisionedThroughputExceededException exception including the account ID, stream name, and shard ID of the record that was throttled. For more information about partially successful responses, see Adding Multiple Records with PutRecords in the Amazon Kinesis Data Streams Developer Guide.
https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html
How the rate limiting is done
Rate Limiting
The KPL includes a rate limiting feature, which limits per-shard throughput sent from a single producer. Rate limiting is implemented using a token bucket algorithm with separate buckets for both Kinesis Data Streams records and bytes. Each successful write to an Kinesis data stream adds a token (or multiple tokens) to each bucket, up to a certain threshold. This threshold is configurable but by default is set 50% higher than the actual shard limit, to allow shard saturation from a single producer.
You can lower this limit to reduce spamming due to excessive retries. However, the best practice is for each producer is to retry for maximum throughput aggressively and to handle any resulting throttling determined as excessive by expanding the capacity of the stream and implementing an appropriate partition key strategy.
https://docs.aws.amazon.com/streams/latest/dev/kinesis-producer-adv-retries-rate-limiting.html

This depends on the way that you're writing the data.
If you're using PutRecord then any request that exceeds the limit will fail with ProvisionedThroughputExceededException and you'll have to retry the request. However, since round-trip times for a single request are on the order of 20-30 ms, you'll need to have a large number of clients to get throttled.
The PutRecords call has a much higher likelihood of being throttled, because you can send up to 500 records in a single request. And if it's throttled, the throttling may affect the entire request or individual records within the request (this could happen if one shard accepts records but another doesn't).
To deal with this, you need to examine the Records list from the PutRecords response. This array corresponds exactly with the Records list from the request, but contains PutRecordsResultEntry values.
If an entry has a SequenceNumber then you're OK: that record was written to a shard. If, however, it has an ErrorCode then you need to copy the record from the request and re-send it (assuming that the error code is throughput exceeded; you could also try resending if it's internal error, but that may not work).
You will need to loop, calling PutRecords until the response doesn't have any unsent messages.
Beware that, due to the possibility of individual records being throttled and resent, you can't guarantee the order that records will appear on a shard (they are stored in the shard in the order that they were received).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js