I want to periodically insert data from S3 (or other fonts) into Amazon Redshift, i.e., when data is added to my S3 bucket, I want an option to add it automatically to my Amazon Redshift cluster.
My preferred method for doing this is to establish a trigger that fire every time a file is created in a part of a bucket. This trigger creates an event that initiates a Lambda function that issues the desired SQL to Redshift. (Or if the work that is needed in Redshift is complex or long running I will use a step function but this is rare.)
Example setups for this:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html
https://64lines.medium.com/building-a-aws-lambda-function-to-run-aws-redshift-sql-scripts-in-python-7468b7c2fdea
I'd start simple if you can and work up to Redshift Data API and Step functions.
You can automate the insertion of data from S3 with a scheduled Lambda that triggers periodically. This might be a better solution than invoking a Lambda on every object upload, especially if you are receiving lots of files continuously.
Related
What is the easiest way to automatically ingest csv-data from a S3 bucket into a Timestream database ?
I have a s3-bucket which continuasly is generating csv files inside a folder structure. I want to save these files inside a timestream database so i can visualize them inside my grafana instance.
I already tried to do that via a Glue crawler but that wont wont for me. Is there any workaround or tutorial on how to solve this task ?
I do this using a Lambda function, an SNS topic and a queue.
New files in my bucket triggers a notification on an SNS Topic
The notification gets added to an SQS queue.
The lambda function consumes the queue, recovers the bucket and key of the new s3 object, downloads the csv file, does some processing and ingests the data into timestream. The lambda is implemented in Python.
This has been working ok, with the caveat that large files may not ingest fully within the lambda 15 minute limit. Timestream is not super fast. It gets better by using multi-valued records, as well as using the "common attributes' feature of the timestream client in boto3.
(it should be noted that the lambda can be triggered directly by the S3 bucket, if one prefers. Using a queue allows a bit more flexibility, such as being able to manually add files to the queue for reprocessing)
I have a use case where my redshift cluster is private and supports only VPN connection to the VPC. I need to send data from kinesis firehose which is in another VPC. I found out that we need to make redshift public or attach an internet gateway to make this happen but I can't use internet gateway. I need to connect to redshift from kinesis firehose with VPN only. I am not able to figure out any way to do this.
As you are already aware, you cannot use a private Redshift cluster in a VPC as a target for Firehose without Internet access. There is no direct solution for this as detailed here and here.
That said, I can think of at least two work arounds that might suffice.
You can have Firehose target S3. Then setup a private link access to S3 from the private VPC and setup an event to copy the data into the Redshift cluster on an acceptable cadence. I think this is probably the best option.
You MIGHT be able to setup Firehose with a lambda processor that feeds the records into Redshift. The reason I say "might" is because the lambda will also need to be within the VPC and will need to be able to keep up with the Firehose flow. This could be fraught with performance issues, and potentially expensive. And Redshift isn't really meant to have high write transactions as a data warehouse. This is the worst option.
Firehose aggregates data in S3 and then triggers a COPY command in Redshift. As you don't have a network path from Firehose to Redshift this fails. However, Firehose can just stop at placing the data in S3.
Now you just need a way to trigger Redshift to COPY the data. There are a number of ways to do this but the easiest way is to use Lambda (in your Redshift VPC) to issue the COPY commands. You will need to decide on when the Lambda should run - Firehose uses two parameters to determine when a COPY should be issued; time since last COPY and data size not yet copied. You can emulate this behavior if you like but the simplest way is to just issue COPYs on some regular time interval, like every 5 min.
To do this you set up CloudWatch to trigger your Lambda every 5 min. The
Lambda looks in the Firehose location in S3 and lists all the files
renames (moves) all these files to put them in a new uniquely named
S3 "subfolder"
issues the COPY command to Redshift to ingest from this "subfolder"
Upon successful ingestion these files can be moved again, left in
the above "subfolder" or deleted
The reason to rename/move the files in S3 is to ensure that each run of the Lambda is operating on a unique set of files and that files aren't ingested more than once.
There is a service that generates data in S3 bucket that is used for warehouse querying. Data is inserted into S3 as daily mechanism.
I am interested in copying that data from S3 to my service account to further classify the data. The classification needs to happen in my AWS service account as it is based on information present in my service account. Classification needs to happens in my service account as it is specific to my team/service. The service generating the data in S3 is neither concerned about the classification nor has the data to make classification decision.
Each S3 file consists of json objects (record) in it. For every record, I need to look into a dynamodb table. Based on whether data exists in Dynamo table, I need to include an additional attribute to the json object and store the list into another S3 bucket in my account.
The way I am considering doing this:
Trigger a scheduled CW event periodically to invoke a Lambda that will copy the files from Source S3 bucket into a bucket (lets say Bucket A) in my account.
Then, use another scheduled CW event to invoke a Lambda to read the records in the json and compare with dynamodb table to determine classification and write to updated record to another bucket (lets say Bucket B).
I have few questions regarding this:
Are there better alternatives for achieving this?
Would using aws s3 sync in the first Lambda be a good way to achieve this? My concerns revolve around lambdas getting timed out due large amount of data, especially for the second lambda that needs to compare against DDB for every record.
Rather than setting up scheduled events, you can trigger the AWS Lambda functions in real-time.
Use Amazon S3 Events to trigger the Lambda function as soon as a file is created in the source bucket. The Lambda function can call CopyObject() to copy the object to Bucket-A for processing.
Similarly, an Event on Bucket-A could then trigger another Lambda function to process the file. Some things to note:
Lambda functions run for a maximum of 15 minutes
You can increase the memory assigned to a Lambda function, which will also increase the amount of CPU assigned. So, this might speed-up the function if it is taking longer than 15 minutes.
There is a maximum of 512MB of storage space made available for a Lambda function.
If the data is too big, or takes too long to process, then you will need to find a way to do it outside of AWS Lambda. For example, using Amazon EC2 instances.
If you can export the data from DynamoDB (perhaps on a regular basis), you might be able to use Amazon Athena to do all the processing, but that depends on what you're trying to do. If it is simple SELECT/JOIN queries, it might be suitable.
I am trying to take sql data stored in a csv file in an s3 bucket and transfer the data to AWS Redshift and automate that process. Would writing etl scripts with lambda/glue be the best way to approach this problem, and if so, how do I get the script/transfer to run periodically? If not, what would be the most optimal way to pipeline data from s3 to Redshift.
Tried using AWS Pipeline but that is not available in my region. I also tried to use the AWS documentation for Lambda and Glue but don't know where to find the exact solution to the problem
All systems (including AWS Data Pipeline) use the Amazon Redshift COPY command to load data from Amazon S3.
Therefore, you could write an AWS Lambda function that connects to Redshift and issues the COPY command. You'll need to include a compatible library (eg psycopg2) to be able to call Redshift.
You can use Amazon CloudWatch Events to call the Lambda function on a regular schedule. Or, you could get fancy and configure Amazon S3 Events so that, when a file is dropped in an S3 bucket, it automatically triggers the Lambda function.
If you don't want to write it yourself, you could search for existing code on the web, including:
The very simply Python-based christianhxc/aws-lambda-redshift-copy: AWS Lambda function that runs the copy command into Redshift
A more fully-featured node-based A Zero-Administration Amazon Redshift Database Loader | AWS Big Data Blog
We have a hundreds of thousands of .csv files stored in S3 that contain at least several data records each. (each record is its own row)
I am trying to design a migration strategy to transform all the records in the .csv files and put them into DynamoDB. During the migration, I'd also like to ensure that if any new .csv gets added to the S3 bucket, we automatically trigger a lambda or something to do the transformation and write to DynamoDB as well.
Eventually we'd stop writing to S3 entirely, but initially we need to keep those writes and any writes to S3 to also trigger a write to DynamoDB. Does anyone know of any good strategies for doing this? (Is there something like DynamoDB streams except for S3?) Any strategies for getting the existing things in .csv in S3 over to DynamoDB in general?
AWS has many tools you can use to solve this problem. Here are a few.
You could use AWS Database Migration Service. It supports migrating data from S3 and into DynamoDB. This AWS product is designed specifically for your use case, and it handles pretty much everything.
Once the migration has started, DMS manages all the complexities of the migration process including automatically replicating data changes that occur in the source database during the migration process.
S3 can publish events to trigger a lambda function which can be used to continuously replicate the data to DynamoDB.
AWS Data Pipelines basically does batch ETL jobs, which could move your data all at once from S3 to DynamoDB. You might also be able to run periodic sync jobs if you can tolerate a delay in replicating data to DynamoDB.
AWS Glue can crawl your data, process it, and store it in another location. I think it would provide you with an initial load plus the ongoing replication. While it could work, it’s designed more for unstructured data, and you have CSV files which are usually structured.
I’d recommend using AWS Database Migration Service because it’s the one-stop solution, but if you can’t use it for some reason, there are other options.
I don't know if DynamoDB has "load records from CSV" feature (RedShift does).
If it does not, then you could roll your own. Write a Python function that imports the csv and boto3 modules, takes as input an S3 path (inside an event dictionary). The function would them download the file from S3 to temp dir, parse it with csv, then use boto3 to insert into DynamoDB.
To get the history loaded, write a function that uses `boto3' to read the list of objects in S3, then call the first function to upload to DynamoDB.
To get future files loaded, install the first function as a Lambda function, and add a trigger from S3 Object Creation events to run the function whenever a new object is put onto S3.