Limits for AWS Batch job details retention - amazon-web-services

I'm trying to understand how long the details associated with an AWS Batch job are retained. For example, the Kinesis limits page describes how each stream defaults to a 24 hour retention period that is extendable up to 7 days.
The AWS Batch limits page does not include any details about either the maximum time or count allowed for jobs. It does say that one million is the limit for SUBMITTED jobs, but its unclear if that is exclusively for SUBMITTED or includes other states as well.
Does anybody know the details of batch job retention?

Job metadata for SUCCEEDED and FAILED jobs are retained for 24 hours. Metadata for Jobs in SUBMITTED, PENDING, RUNNABLE, STARTING, and RUNNING remain in the queue until the job completes. Your AWS Batch Jobs also log STDERR/STDOUT to CloudWatch Logs where you control the retention policy.

From AWS Batch official doc - https://docs.aws.amazon.com/batch/latest/userguide/batch_user.pdf
Under Jobs -> Job States (Page 23)
FAILED
The job has failed all available attempts. The job state for FAILED jobs is persisted in AWS Batch for at least 24 hours.
Note
Logs for FAILED jobs are available in CloudWatch Logs; the log group is /aws/batch/job, and the log stream name format is first200CharsOfJobDefinitionName/default/ecs_task_id (this format may change in the future). After a job reaches the RUNNING status, you can programmatically retrieve its log stream with the DescribeJobs API operation. For more information, see View Log Data Sent to CloudWatch Logs in the Amazon CloudWatch Logs User Guide. By default, these logs are set to never expire, but you can modify the retention period. For more information, see Change Log Data Retention in CloudWatch Logs in the Amazon CloudWatch Logs User Guide.

Related

AWS CloudWatch Logs Data limitation

Is there any data limitation on aws cloudwatch logs to send the logs , because in my case I am getting the logs data 6 million records per 3 days from my application. So is aws cloudwatch logs will able to handle that much data?
Check out the aws quotas page. Not sure what you mean by "60lac" but the limits on CloudWatch are more than adequate for the majority of use cases.
There is no published limit on the overall data volume held. There'll be a practical limit somewhere but it won't be hit by a single AWS customer. If you're using the putLogEvents API you could be constrained by the limit of 5 requests per second per log stream, in which case consider using more streams or larger batches of events (up to 1MB).

Do I need to send the original application log files to S3 if I have Cloudwatch running?

I have my app writing logs to /var/log/my_app.log. I have the logrotator set up daily to rotate the log, so presumably when the log rotate condition is met it will copy over my_app.log to my_app<date>.log. I also have the Cloudwatch agent on the same ec2 instance sending files over to Cloudwatch logs. There they will stay indefinitely I assume (or until a set time set in the aws console). Is it correct to assume that Cloudwatch will always have the first log created and logged regardless of how I rotate the actual log files on the ec2 instance? That is to say, no matter what happens with the rotated logs, I will always have ALL the logs that have been created because they've been sent to cloudwatch?
Any logs that is sent to CloudWatch will not be deleted because of the log rotation. Check out the FAQ section in the following link that has some important questions answered including the log rotation naming schemes and the scenarios in which log events can be truncated or skipped.
(Search for CloudWatch Logs Agent FAQs in the following link)
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html
Your assumption is correct on the log retention. CloudWatch logs are stored indefinitely by default.
Here is the quote from Amazon documentation
Log Retention – By default, logs are kept indefinitely and never expire. You can adjust the retention policy for each log group, keeping the indefinite retention, or choosing a retention period between 10 years and one day.

How can we visualize the Dataproc job status in Google Cloud Plarform?

How can we visualize (via Dashboards) the Dataproc job status in Google Cloud Platform?
We want to check if jobs are running or not, in addition of their status like running, delay, blocked. On top of it we want to set alerting (Stackdriver Alerting) as well.
In this page, you have all the metrics available in Stackdriver
https://cloud.google.com/monitoring/api/metrics_gcp#gcp-dataproc
You could use cluster/job/submitted_count, cluster/job/failed_count and cluster/job/running_count to create the dashboard and metrics
Also, you could use cluster/job/completion_time to warn about long-running jobs and cluster/job/duration to check if jobs are enqueued in PENDING status for a long time.
cluster/job/completion_time is logged only after the job is completed. i.e. if the job takes 7 hours to complete, it is only registered at the 7th hour.
Similarly cluster/job/duration logs the time spent in each state only after the state is complete. Say if a job was in pending state for 1 hour, only at the 60th minute you would see this metric.
Dataproc has an open issue to introduce more metric that would help with this active alerting use case -> https://issuetracker.google.com/issues/211910984

Cloudwatch monitor for Stl_Load_Errors

We use Kinesis Firehose to load data into a number of Redshift tables. There are monitors available to see successful deliveries. However, there is no monitor for checking if there are any errors in the delivery - the ones that get recorded to stl_load_errors table.
I do have an option to create a lambda that reads the stl_load_errors table and writes to cloudwatch metrics. But, I would like to know if there is any out of the box solution to monitor it.
Check the Firehose Redshift delivery stream metrics in the monitoring tab DeliveryToRedshift Success (Average) .
Also, you can see Monitoring with Amazon CloudWatch Metrics.
Enable error logging if it is not already enabled, and check error logs for delivery failure. Monitoring with Amazon CloudWatch Logs
If I’ve made a bad assumption please comment and I’ll refocus my answer.

How does Amazon CloudWatch batch logs when streaming to AWS Lambda?

The AWS documentation indicates that multiple log event records are provided to Lambda when streaming logs from CloudWatch.
logEvents
The actual log data, represented as an array of log event
records. The "id" property is a unique identifier for every log event.
How does CloudWatch group these logs?
Time? Count? Randomly, from my perspective?
Currently you get one Lambda invocation for every PutLogEvents batch that CloudWatch Logs had received against that log group. However you should probably not rely on that because AWS could always change it (for example batch more, etc).
You can observe this behavior by running the CWL -> Lambda example in the AWS docs.
Some aws services allow you to configure the log intervals such as elastic load balancing. There's a choice between five and sixty minute log intervals. You may not see a specific increment or parameter in the docs because they are configurable based on each service.