Are there fixed conditions for how long a log stream is open? - amazon-web-services

I'm writing a concurrent tailing utility for watching multiple AWS CloudWatch log groups across many regions simultaneously, and in CloudWatch logs, there are log groups, which contain many log streams that are rotated occasionally. Thus, to tail a log group, one must find the latest log stream, read it in a loop, and occasionally check for a new log stream, and start reading that in a loop.
I can't seem to find any documentation on this, but is there a set of published conditions upon which I can conclude that a log stream has been "closed?" I'm assuming I'll need to have multiple tasks tailing multiple log streams in a group up until a certain cut-off point, but I don't know how to logically determine that a log stream has been completed and to abandon tailing it.
Does anyone know whether such published conditions exist?

I don't think you'll find that published anywhere.
If AWS had some mechanism to know that a log stream was "closed" or would no longer receive log entries, I believe their own console for a stream would make use of it somehow. As it stands, when you view even a very old stream in the console, it will show this message at the bottom:
I know it is not a direct answer to your question, but I believe that is strong indirect evidence that AWS can't tell when a log stream is "closed" either. Resuming auto retry on an old log stream generates traffic that would be needless, so if they had a way to know the stream was "closed" they would disable that option for such streams.
Documentation says
A log stream is a sequence of log events that share the same source.
Since each new "source" will create a new log stream, and since CloudWatch supports many different services and options, there won't be a single answer. It depends on too many factors. For example, with the Lambda service, each lambda container will be a new source, and AWS Lambda may create new containers based on many factors like lambda execution volume, physical work in its data center, outages, changes to lambda code, etc. And that is just for one potential stream source for log streams.
You've probably explored options, but these may give some insights into ways to achieve what you're looking to do:
The CLI has an option to tail that will include all log streams in a group: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/logs/tail.html though if you're building your own utility the CLI won't likely be an option
Some options are discussed at how to view aws log real time (like tail -f) but there are no mentions of published conditions for when a stream is "closed"
When does AWS CloudWatch create new log streams? may yield some insights

Related

How to handle bad events in a batch job on EMR

I am running an EMR which processes some logs containing around 15-20M log events. Sometimes few log events contain badly formatted data that break my pipeline. I am looking for some options to drop those log events in a file or a queue. Then I can verify them, report them to the corresponding service and reprocess them maybe not in the same pipeline as the analysis would require some time to correct the logs.
What are the best options available and widely used by different companies running batch job?

Is an SQS needed with a Lambda in this use case?

I'm trying to build a flow that allows a user to enter data and it's being stored in RDS. My question is do I need to go from USER -> SQS -> Lambda -> RDS to or is it better to go straight from USER -> Lambda -> RDS which skips the queue entirely. Are there going to be scalability issues with the latter?
I do like that the SQS can retry a large number of times to guarantee the data, but is there a similar way to retry with a lambda alone? It's important that all of the data is stored and done so in a timely manner. I'm struggling to see the tradeoffs of the two scenarios.
If anyone has any input on the situation, that would be amazing.
Are there going to be scalability issues with the latter?
It depends on multiple metrics including traffic, spikes, size of the database, rpm etc.
Putting SQS before lambda provides you to manage number of database queries in t time according to your needs. It is a "queue" and you are consuming that queue. In some business cases it may not be useful(banking transactions etc) but in some cases(analytic calculations) it may be helpful. Instead of making a single insert whenever lambda is invoked, you can set batch size and insert in batch(10 records at once) which reduces the number of queries.
Also you can define dead letter queue to push your problematic data(couldn't make it to database). It will be another queue that you to check later to identify problematic inputs. The document can be found here

Logging errors and exceptions on AWS Lambda

I have some AWS Lambda functions that get run about twenty thousand times a day. So I would like to enable logging/alert to monitor all the errors and exceptions.
The cloudwatch log is giving too much noise, and difficult to see the error.
Now I'm planning to write the log to AWS S3 Bucket, this will have an impact on the performance.
What's the best way you suggest to log and alert the errors?
An alternative would be to leave everything as it is (from application perspective) and check AmazonCloudWatch Logs Filter.
You use metric filters to search for and match terms, phrases, or
values in your log events. When a metric filter finds one of the
terms, phrases, or values in your log events, you can increment the
value of a CloudWatch metric.
If you defined your filter you can create a CloudWatch Alarm on the metric and get notified as soon as your defined threshold is reached :-)
Edit
I didnt check the link from #Renato Gama. Sorry. Just follow the instructions behind the link and your problem should be solved easily...
If you did not try this already I suggest that you try creating CloudWatch alerts based on custom metric filters; Have a look here; https://www.opsgenie.com/blog/2014/08/how-to-use-cloudwatch-to-generate-alerts-from-logs
(Of course you don't have to use OpsGenie service as suggested on the post I linked, you can implement anything that will help you debug the problems)

How to handle reprocessing scenarios in AWS Kinesis?

I am exploring AWS Kinesis for a data processing requirement that replaces old batch ETL processing with a stream based approach.
One of the key requirements for this project is the ability to reprocess data in cases when
A bug is discovered and fixed and the application is redeployed. Data needs to be reprocessed from the beginning.
New features are added and the history needs to be reprocessed either fully or partially.
The scenarios are very nicely documented here for Kafka - https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Data+%28Re%29Processing+Scenarios.
I have seen the timestamp based ShardIterator in Kinesis and I think a Kafka like resetter-tool can be built using Kinesis APIs but it would be great if something like this already exists. Even if it doesn't, it would be good to learn from those who have solved similar problems.
So, does anyone know of any existing resources, patterns and tools available to do this in Kinesis?
I have run into scenarios where i want to reprocess the kinesis processed records, I have used Kinesis-VCR for re-processing the kinesis generated records.
Kinesis-VCR records the kinesis streams and maintains a metadata of the files processed by kinesis at a given time.
Later, we can use to re-process/replay the events for any given time range.
Here is the github link for the same.
https://github.com/scopely/kinesis-vcr
Let me know if this works for you.
Thanks & Regards,
Srivignesh KN

AWS CloudWatchLog limit

I am trying to find centralized solution to move my application logging from database (RDS).
I was thinking to use CloudWatchLog but noticed that there is a limit for PutLogEvents requests:
The maximum rate of a PutLogEvents request is 5 requests per second
per log stream.
Even if I will break my logs into many streams (based on EC2, log type - error,info,warning,debug) the limit of 5 req. per second is still very restrictive for an active application.
The other solution is to somehow accumulate logs and send PutLogEvents with log records batch, but it means then I am forced to use database to accumulate that records.
So the questions is:
May be I'm wrong and limit of 5 req. per second is not so restrictive?
Is there any other solution that I should consider, for example DynamoDB?
PutLogEvents is designed to put several events by definition (as per it name: PutLogEvent"S") :) Cloudwatch logs agent is doing this on its own and you don't have to worry about this.
However please note: I don't recommend you to generate to much logs (e.g don't run debug mode in prodution), as cloudwatch logs can become pretty expensive as your volume of log is growing.
My advice would be to use a Logstash solution on an AWS instance.
In alternative, you can run logstash on another existing instance or container.
https://www.elastic.co/products/logstash
It is designed for this scope and it does it wonderfully.
Cloudwatch, is not designed mainly for your needs.
I hope this helps somehow.
If you are calling this API directly from your application: the short answer is that you need to batch you log events (it's 5 for PutLogEvents).
If you are writing the logs to disk and after that you are pushing them there is already an agent that knows how to push the logs (http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/QuickStartEC2Instance.html)
Meta: I would suggest that you prototype this and ensure that it works for the log volume that you have. Also, keep in mind that, because of how the cloudwatch api works, only one application/user can push to a log stream at a time (see the token you have to pass in) - so that you probably need to use multiple stream, one per user / maybe per log type to ensure that your applicaitions are not competing for the log.
Meta Meta: think about how your application behaves if the logging subsystem fails and if you can live with the possibility of losing the logs (ie is it critical for you to always/always have the guarantee that you will get the logs?). this will probably drive what you do / what solution you ultimately pick.