As per AWS docs, there's no Redshift-Lambda integration yet.
What we would like to do is monitoring redshift activity in order to do something when a redshift table is created, a copy from S3 is made or a bulk insert is performed.
Is there a way to register this kind of activity, and then do something similar to run a lambda function ir order run a small script or so?
Redshift provides an event notification mechanism. You can find a full list of the event categories and messages here. If that covers the kind of information you are interested in you can simply have your Lambda function add the SNS topic used by Redshift for event notification as an event source and your Lambda function will get called every time an event is sent by Redshift.
You can enable audit logs that end up in s3.
All the info you want is also available in various admin tables with prefixes like stl_, stv_ and pg_. For example, COPY commands from S3 are recorded in stl_load_commits, and stl_utilitytext has info on non-select queries like CREATE.
As for triggering events, you could have S3 trigger a lambda when one of the log files lands or run occasional jobs that query the system tables and take action with something like cron jobs or airflow.
Related
Our Google Analytics data events are exported to BigQuery tables. I have reports that need to run when the events data arrives which are set up as AWS lambdas with python code (for various reasons and I can't immediately move these to be Google Cloud Functions etc).
Is it possible to have the creation of a table trigger a lambda? At present, I have a lambda periodically checking to see if the table has been created which seems suboptimal. Eventarc looks like it might possibly be the way to monitor for the creation event at the BigQuery end but it doesn't seem obvious how you'd interface with AWS.
Any genius ideas? I have dug repeatedly through StackOverflow, but can't see a match for this issue
Eventarc isn't magic, it's only a wrapper of different things that you can do and customize (with a custom destination and not a Cloud Run).
Typically, Eventarc do:
Create a Cloud Logging sink on a specific log filter (filter what you want to get custom events)
Sink the filtered log entries in PubSub topic
Create a PubSub push subscription that invoke Cloud Run HTTP endpoint.
You can create piece by piece all those steps. And in the latest one, invoke your AWS Lambda instead of Cloud Run.
But the difficulty is not here. The difficulty comes from the variety of table creation possibilities:
By API call (table creation API)
By Load Job (load a file into a table create it automatically but without invoking the table creation API)
Directly in SQL with CREATE TABLE statement (but you can have also this statement in a script, you can have dynamic SQL,...)
And you might want to capture also the other creations (views, materialized views, procedure, functions,....)
At the end, your current method (invoque periodically the schema metadata info and get the recent addition in a dataset) could be the most "effortless" efficient!
Is this possible?
I did my research but this is the only possible events for RDS:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Events.Messages.html
Mostly for maintenance type events but what I want is - let's I have a RDS Oracle table called Users. Whenever a record is inserted in the table, an event or stream can be picked up by a Lambda and do the necessary action.
In short, no, not with the existing events you refer to - these are for monitoring the RDS service, not what you actually use it for, i.e. contents auditing (manipulation/tracking)
You can of course create notifications when an insert occurs, but you'll probably need to build/setup a few things.
A couple of ideas:
Building something closer to the database logic, i.e. in your code base add something that fires a SQS / SNS event.
If you can't (or don't want to) modify the logic that handle the database, maybe you could add a trigger that gets fired on INSERTs to the user table. Unfortunately I don't think there's support to execute a Lamdba from a trigger (as it is possible to do with PostgreSQL at the moment).
Set up a database activity stream from RDS to Kinesis to monitor the INSERTS. This is a bit of a additional infrastructure to set up, so it might be a bit too much depending on your use case:
"Database Activity Streams is an Amazon RDS feature that provides a near real-time stream of the activity in your Oracle DB instance. Amazon RDS pushes activities to an Amazon Kinesis data stream."
From Kinesis, you can configure AWS Lambda to consume the stream and take action on INSERT events.
Some references:
https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis-example.html
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/DBActivityStreams.Enabling.html
I'm trying to create the Architecture on AWS where a lambda function runs SQL Code to refresh a materialized view on AWS Redshift. I would like the materialized view to refresh after the daily ETL processes have completed on the Redshift cluster. Is there a way of setting up the lambda function to be triggered after a particular SQL command on the Redshift Cluster has completed?
Unfortunately, I've only seen examples of people scheduling the Lambda Function to run on particular intervals/at a particular time. Any help would be much appreciated.
A couple of ways that this can be done (out of many):
Have the ETL process trigger the Lambda - this is straight forward
if the ETL tool can generate the trigger but organizational factors
can make changing ETL frameworks difficult.
Use an S3 semaphore - have your ETL SQL UNLOAD some small data (like
a text string of metadata) to S3 where the objects creation will
trigger the Lambda. Insert the UNLOAD at the point in the ETL SQL
where you want the update to occur.
I'd like to peform the following tasks on a regular basis (e.g. every day at 6AM) using AWS:
get new set of data using API. This dataset is updated on a daily basis.
run a python script that would process the obtained dataset by the means of several python libraries like matplotlib, pandas, plotly
automatically send the output of the script, which would be a single pdf file or a html dashboard, via email to a group of specified recipients
I know how to perform all of the above items locally - my goal is to automate this routine. I'm new to AWS and would appreciate some advice on how to perform these tasks in a straightforward way. Based on the reading I did so far, it looks like the serverless approach may be able to do the job and also reduce the complexity, but I'm not sure which functionalities exactly I should use.
For scheduling you can use aws event bridge.
You can schedule AWS lambda or AWS Step Functions both of these are serverless :).
You can have 3 lambdas
To get the data and save it in S3/dynamo (if you want to persist the data)
Processor lambda and save the report to S3.
Another lambda to send email using AWS SES which will read the report from S3 and send it.
If you don't want to use step function you can start your lambda from S3 put event or you can trigger one lambda from another lambda using aws-sdk.
So there are different approaches you can take.
First off, I would create a Lambda. You can schedule the function to run on a cron job.
If the Message you want to send is small:
I would create a SNS Topic with a email fan out.
Inside your lambda you can then transform the data and send out via SNS.
Otherwise:
I would use SES and send a mail via the SES SDK.
I have an app running on AWS and I need to save every "event" in a file.
An "event" happens, for instance, when a user logs in to the app. When that happens I need to save this information on a file (presumably I would need to save a time stamp and the session id)
I expect to have a lot of events (of the order of a million per month) and I was wondering what would be the best way to do this.
I thought of writing on S3, but I think I can't append to existing files.
Another option would be to redirect the "event" to the standard output, but would not be the smartest solution.
Any ideas? Also, this needs to be done in python.
There are bunch of options, and your choice would depend on following factors:
Scale of these events.
Cost you are willing to bear.
How you intend on consuming them.
Options:
Log every event into a Kinesis Firehose stream, which in turn can dump your data into S3 or Redshift, as per your configuration.
Setup an Elasticsearch cluster. Log all events in a file on disk, and use Logstash to asynchronously push these into the cluster.
Create a DynamoDB table and log one event per row.
Using CloudWatch Logs you can export the logs to S3. You can use the CloudWatch Logs agent to send your application log file to CloudWatch.