Best way to save "event" data on AWS - amazon-web-services

I have an app running on AWS and I need to save every "event" in a file.
An "event" happens, for instance, when a user logs in to the app. When that happens I need to save this information on a file (presumably I would need to save a time stamp and the session id)
I expect to have a lot of events (of the order of a million per month) and I was wondering what would be the best way to do this.
I thought of writing on S3, but I think I can't append to existing files.
Another option would be to redirect the "event" to the standard output, but would not be the smartest solution.
Any ideas? Also, this needs to be done in python.

There are bunch of options, and your choice would depend on following factors:
Scale of these events.
Cost you are willing to bear.
How you intend on consuming them.
Options:
Log every event into a Kinesis Firehose stream, which in turn can dump your data into S3 or Redshift, as per your configuration.
Setup an Elasticsearch cluster. Log all events in a file on disk, and use Logstash to asynchronously push these into the cluster.
Create a DynamoDB table and log one event per row.

Using CloudWatch Logs you can export the logs to S3. You can use the CloudWatch Logs agent to send your application log file to CloudWatch.

Related

Went over cloudwatch event size for structured application logs, now what?

We have a service that outputs application logs to cloudwatch. We structure the logs into json format, and output them through stdout, which is forwarded by fluentbit to cloudwatch. We then have a stream set up to forward the logs from cloudwatch to s3, followed by glue crawlers, Athena, and quick sight for dashboards.
We go all this working, and I just saw today that there is a 256kb limit in cloudwatch which we went over for some of our application logs. How else can we get our logs out of our service to s3 (or maybe a different data store?) for analysis? Is cloudwatch not the right approach for this? Other option I thought of us to break up the application logs into multiple events, but then we need to plumb through a joinable ID, as well as write etl logic that does more complex joins. Was hoping to avoid it unless it’s considered a better practice than what we are doing.
Thanks!

Storing CloudWatch Logs into S3 (some structured format)

I have a CloudWatch Log Group, this log group continuously receives logging information from my AWS services.
I want to extract some of the logging information from this log-group and want to store that data into S3 in some format (CSV, PARQUET).
I will then use Athena to query this logging data.
I want some sort of automatic mechanism to send these logs continuously to S3.
Can anyone suggest solution for this?
It looks like Athena is able to communicate directly with cloudwatch as shown here. Not sure how performant this is and how costly this turns out.
The other option is to configure Cloudwatch to send data to Firehose via Subscriptions which then dumps it to S3.

Periodic Read from AWS S3 and publish to SQS

I have an S3 bucket with different files. I need to read those files and publish SQS msg for each row in the file.
I cannot use S3 events as the files need to be processed with a delay - put to SQS after a month.
I can write a scheduler to do this task, read and publish. But can I was AWS for this purpose?
AWS Batch or AWS data pipeline or Lambda.?
I need to pass the date(filename) of the data to be read and published.
Edit : The data volume to be dealt is huge
I can think of two ways to do this entirely using AWS serverless offerings without even having to write a scheduler.
You could use S3 events to start a Step Function that waits for a month before reading the S3 file and sending messages through SQS.
With a little more work, you could use S3 events to trigger a Lambda function which writes the messages to DynamoDB with a TTL of one month in the future. When the TTL expires, you can have another Lambda that listens to the DynamoDB streams, and when there’s a delete event, it publishes the message to SQS. (A good introduction to this general strategy can be found here.)
While the second strategy might require more effort, you might find it less expensive than using Step Functions depending on the overall message throughput and whether or not the S3 uploads occur in bursts or in a smooth distribution.
At the core, you need to do two things:
Enumerate all of the object in a bucket in S3, and perform some action on any object uploaded more than a month ago.
Can you use Lambda or Batch to do this? Sure. A Lambda could be set to trigger once a day, enumerate the files, and post the results to SQS.
Should you? No clue. A lot depends on your scale, and what you plan to do if it takes a long time to perform this work. If your S3 bucket has hundreds of objects, it won't be a problem. If it has billions, your Lambda will need to be able to handle being interrupted, and continuing paging through files from a previous run.
Alternatively, you could use S3 events to trigger a simple Lambda that adds a row to a database. Then, again, some Lambda could run on a cron job that asks the database for old rows, and publishes that set to SQS for others to consume. That's slightly cleaner, maybe, and can handle scaling up to pretty big bucket sizes.
Or, you could do the paging through files, deciding what to do, and processing old files all on a t2.micro if you just need to do some simple work to a few dozen files every day.
It all depends on your workload and needs.

Register AWS Redshift activity

As per AWS docs, there's no Redshift-Lambda integration yet.
What we would like to do is monitoring redshift activity in order to do something when a redshift table is created, a copy from S3 is made or a bulk insert is performed.
Is there a way to register this kind of activity, and then do something similar to run a lambda function ir order run a small script or so?
Redshift provides an event notification mechanism. You can find a full list of the event categories and messages here. If that covers the kind of information you are interested in you can simply have your Lambda function add the SNS topic used by Redshift for event notification as an event source and your Lambda function will get called every time an event is sent by Redshift.
You can enable audit logs that end up in s3.
All the info you want is also available in various admin tables with prefixes like stl_, stv_ and pg_. For example, COPY commands from S3 are recorded in stl_load_commits, and stl_utilitytext has info on non-select queries like CREATE.
As for triggering events, you could have S3 trigger a lambda when one of the log files lands or run occasional jobs that query the system tables and take action with something like cron jobs or airflow.

Using DynamoDB to replace logfiles

We are hosting our services in AWS beanstalk managed instances. That is forcing us to move away from files based logging to use database based logging.
Is DynamoDB a good choice for replacing file based logging. If so, what should be the primary key. I thought of using timestamp but multiple messages may be logged by the same service within the same timeStamp so that might not be reliable.
Any advice would be appreciated.
Don't use DynamoDB to store logs. You'll be paying for throughput and space needlessly.
Amazon CloudWatch has built-in logging capabilities.
http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/WhatIsCloudWatchLogs.html
Another alternative is a dedicated logging service such as Loggly which is cloud-based and can receive logs in many common formats, plus they have an API to send custom logs. In the web-based console, you can search and filter through the logs.
As an alternative, why don't you use cloudwatch? I ended up writing a whole app to consolidate logs across ec2 instances in a beanstalk app, then last year AWS opened up cloudwatch as a service, so I junked my stuff. You tell cloudwatch where your logs are on the instance, give it a log group and stream name, and all your logs are consolidated in one spot, in cloudwatch. You can also run alarms off them using the standard AWS setup. It's pretty slick, and easy - don't have to write a front end to do lookups, it's already there.
Don't know what you're using for logging - we are a node.js shop, used winston for logging, and there is a nice NPM module that works with Winston to log automatically, called winston-cloudwatch.