Querying and updating Redshift through AWS lambda - amazon-web-services

I am using a step function and it gives a JSON to the Lambda as event(object data from s3 upload). I have to check the JSON and compare 2 values in it(file name and eTag) to the data in my redshift DB. If the entry does not exist, I have to classify the file to a different bucket and add an entry to the redshift DB(versioning). Trouble is, I do not have a good idea of how I can query and update Redshift through Lambda. Can someone please give suggestions on what methods I should adopt? Thanks!
Edit: Should've mentioned the lambda is in Python

One way to achieve this use case is you can write the Lambda function by using the Java run-time API and then within the Lambda function, use a RedshiftDataClient object. Using this API, you can perform CRUD operations on a Redshift cluster.
To see examples:
https://github.com/awsdocs/aws-doc-sdk-examples/tree/master/javav2/example_code/redshift/src/main/java/com/example/redshiftdata
If you are unsure how to build a Lambda function by using the Lambda Java run-time API that can invoke AWS Services, please refer to :
Creating an AWS Lambda function that detects images with Personal Protective Equipment
This example shows you how to develop a Lambda function using the Java runtime API that invokes AWS Services. So instead of invoking Amazon S3 or Rekognition, use the RedshiftDataClient within the Lambda function to perform Redshift CRUD opertions.

Related

Trigger a Custom Function Every X Hours in AWS

I am looking to trigger code every 1 Hour in AWS.
The code should: Parse through a list of zip codes, fetch data for each of the zip codes, store that data somewhere in AWS.
Is there a specific AWS service would I use for parsing through the list of zip codes and call the api for each zip code? Would this be Lambda?
How could I schedule this service to run every X hours? Do I have to use another AWS Service to call my Lambda function (assuming that's the right answer to #1)?
Which AWS service could I use to store this data?
I tried looking up different approaches and services in AWS. I found I could write serverless code in Lambda which made me think it would be the answer to my first question. Then I tried to look into how that could be ran every x time, but that's where I was struggling to know if I could still use Lambda for that. Then knowing where my options were to store the data. I saw that Glue may be an option, but wasn't sure.
Yes, you can use Lambda to run your code (as long as the total run time is less than 15 minutes).
You can use Amazon EventBridge Scheduler to trigger the Lambda every 1 hour.
Which AWS service could I use to store this data?
That depends on the format of the data and how you will subsequently use it. Some options are
Amazon DynamoDB for key-value, noSQL data
Amazon Aurora for relational data
Amazon S3 for object storage
If you choose S3, you can still do SQL-like queries on the data using Amazon Athena

AWS Lambda and loading of reference data

I'm building a Lambda function in AWS that need to load reference data from a mysql database. There is no real issue right now as it very limited amount of data. But what is best practice here? Is there away to keep this data within Lambda (or some other similar functionality) so that I don't need to request it for every invocation of the function? I'm using Node js though I don't think that affects this question.
Many thanks,
Marcus
There is no build-in persistent storage for lambda. Any data that you would like to keep reliably (not counting temporary persistence due to lambda execution context) between invocations is to store data outside of lambda itself.
You already store it in MySQL, but other popular choices are:
SSM Parameter Store
S3
EFS
DynamoDB
ElastiCache if you really need fast access to the data.
Since you already get the data from MySQL the only advantage of using SSM or DynamoDB would be that you can use AWS API to access and update them, or inspect/modify in AWS Console. You don't need to bundle any MySQL client with your function nor establish any connections to the database.

Serverless-ly Query External REST API from AWS and Store Results in S3?

Given a REST API, outside of my AWS environment, which can be queried for json data:
https://someExternalApi.com/?date=20190814
How can I setup a serverless job in AWS to hit the external endpoint on a periodic basis and store the results in S3?
I know that I can instantiate an EC2 instance and just setup a cron. But I am looking for a serverless solution, which seems to be more idiomatic.
Thank you in advance for your consideration and response.
Yes, you absolutely can do this, and probably in several different ways!
The pieces I would use would be:
CloudWatch Event using a cron-like schedule, which then triggers...
A lambda function (with the right IAM permissions) that calls the API using eg python requests or equivalent http library and then uses the AWS SDK to write the results to an S3 bucket of your choice:
An S3 bucket ready to receive!
This should be all you need to achieve what you want.
I'm going to skip the implementation details, as it is largely outside the scope of your question. As such, I'm going to assume your function already is written and targets nodeJS.
AWS can do this on its own, but to make it simpler, I'd recommend using Serverless. We're going to assume you're using this.
Assuming you're entirely new to serverless, the first thing you'll need to do is to create a handler:
serverless create --template "aws-nodejs" --path my-service
This creates a service based on the aws-nodejs template on the provided path. In there, you will find serverless.yml (the configuration for your function) and handler.js (the code itself).
Assuming your function is exported as crawlSomeExternalApi on the handler export (module.exports.crawlSomeExternalApi = () => {...}), the functions entry on your serverless file would look like this if you wanted to invoke it every 3 hours:
functions:
crawl:
handler: handler.crawlSomeExternalApi
events:
- schedule: rate(3 hours)
That's it! All you need now is to deploy it through serverless deploy -v
Below the hood, what this does is create a CloudWatch schedule entry on your function. An example of it can be found over on the documentation
First thing you need is a Lambda function. Implement your logic, of hitting the API and writing data to S3 or whatever, inside the Lambda function. Next thing, you need a schedule to periodically trigger your lambda function. Schedule expression can be used to trigger an event periodically either using a cron expression or a rate expression. The lambda function you created earlier should be configured as the target for this CloudWatch rule.
The resulting flow will be, CloudWatch invokes the lambda function whenever there's a trigger (depending on your CloudWatch rule). Lambda then performs your logic.

How to invoke athena triggered automatically by lambda when objects are updated in the s3 bucket?

I have following 2 use case to apply on this
Case 1. I would need to call the lambda alone to invoke athena to perform query on s3 data? Question: How to invoke lambda alone via api?
Case 2. I would need lambda function to invoke athena whenever a file copied to the same s3 bucket that already mapped to the athena?
Iam referring following link to do the same to perform the Lambda operation over athena
Link:
https://dev.classmethod.jp/cloud/run-amazon-athenas-query-with-aws-lambda/
For the case 2: Following are eg want to integrate:
File in s3-1 is sales.csv - and i would updating sales details by copying data from other s3-2 . And the schema/column defined in the s3-1 data would remain same.
so when i copy some file to the same s3 data that mapped to the athena, the lambda should call athena to perform the query
Appreciate if can provide the better way to achieve above cases?
Thanks
Case 1
An AWS Lambda can be directly invoked via the invoke() command. This can be done via the AWS Command-Line Interface (CLI) or from a programming language using an AWS SDK.
Case 2
An Amazon S3 event can be configured on a bucket to automatically trigger an AWS Lambda function when a file is uploaded. The event provides the bucket name and file name (object name) to the Lambda function.
The Lambda function can extract these details from the event record and can then use that information in an Amazon Athena command.
Please note that, if the file name is different each time, a CREATE TABLE command would be required before a SELECT command can query the data.
General Comments
A Lambda function can run for a maximum of 15 minutes, so make sure the Athena queries do not take more than this time. This is not a particularly efficient use of an AWS Lambda function because it will be billed for the duration of the function call, even if it is just waiting for Athena to finish.
Another option would be to have the Lambda function directly process the file, assuming that the query is not particularly complex. For example, the Lambda function could download the file to temporary storage (maximum 500MB), read through the file, do some calculations (eg add up the total of some columns), then store the results somewhere.
The next step wuold be create a end point to your lambda, you ver can use aws-apigateway for that.
On the other hand, using the amazon console or amazon cli, you can invoke the lambda in order to test.

Configure and Deploy Lambda Pipeline in code

I was wondering if there are any AWS Services or projects which allow us to configure a data pipeline using AWS Lambdas in code. I am looking for something like below. Assume there is a library called pipeline
from pipeline import connect, s3, lambda, deploy
p = connect(s3('input-bucket/prefix'),
lambda(myPythonFunc, dependencies=[list_of_dependencies])
s3('output-bucket/prefix'))
deploy(p)
There can be many variations of this idea of course. This use case assumes only one s3 bucket for e.g. There could be a list of input s3 buckets.
Can this be done by AWS Data Pipeline? The documentation I have(quickly) read says that Lambda is used to trigger a pipeline.
I think the closest thing that is available is the State Machine functionality within the newly released Lambda Step Functions. With these you can coordinate multiple steps that transform your data. I don't believe that they support standard event sources, so you would have to create a standard lambda function (potentially using the Serverless Application Model) to read from S3 and trigger your State Machine.