I have multiple files present in different buckets in S3. I need to move these files to Amazon Aurora PostgreSQL every day on a schedule. Every day I will get a new file and, based on the data, insert or update will happen. I was using Glue for insert but with upsert Glue doesn't seem to be the right option. Is there a better way to handle this? I saw Load command from S3 to RDS will solve the issue but didn't get enough details on it. Any recommendations please?
You can trigger a Lambda function from S3 events, that could then process the file(s) and inject them into Aurora. Alternatively you can create a cron-type function that will run daily on whatever schedule you define.
https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
Related
I am looking to trigger code every 1 Hour in AWS.
The code should: Parse through a list of zip codes, fetch data for each of the zip codes, store that data somewhere in AWS.
Is there a specific AWS service would I use for parsing through the list of zip codes and call the api for each zip code? Would this be Lambda?
How could I schedule this service to run every X hours? Do I have to use another AWS Service to call my Lambda function (assuming that's the right answer to #1)?
Which AWS service could I use to store this data?
I tried looking up different approaches and services in AWS. I found I could write serverless code in Lambda which made me think it would be the answer to my first question. Then I tried to look into how that could be ran every x time, but that's where I was struggling to know if I could still use Lambda for that. Then knowing where my options were to store the data. I saw that Glue may be an option, but wasn't sure.
Yes, you can use Lambda to run your code (as long as the total run time is less than 15 minutes).
You can use Amazon EventBridge Scheduler to trigger the Lambda every 1 hour.
Which AWS service could I use to store this data?
That depends on the format of the data and how you will subsequently use it. Some options are
Amazon DynamoDB for key-value, noSQL data
Amazon Aurora for relational data
Amazon S3 for object storage
If you choose S3, you can still do SQL-like queries on the data using Amazon Athena
I want to periodically insert data from S3 (or other fonts) into Amazon Redshift, i.e., when data is added to my S3 bucket, I want an option to add it automatically to my Amazon Redshift cluster.
My preferred method for doing this is to establish a trigger that fire every time a file is created in a part of a bucket. This trigger creates an event that initiates a Lambda function that issues the desired SQL to Redshift. (Or if the work that is needed in Redshift is complex or long running I will use a step function but this is rare.)
Example setups for this:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html
https://64lines.medium.com/building-a-aws-lambda-function-to-run-aws-redshift-sql-scripts-in-python-7468b7c2fdea
I'd start simple if you can and work up to Redshift Data API and Step functions.
You can automate the insertion of data from S3 with a scheduled Lambda that triggers periodically. This might be a better solution than invoking a Lambda on every object upload, especially if you are receiving lots of files continuously.
I have a table running on AWS RDS. I want to use AWS DMS to export all the data on the table every week. Each week after the export I will truncate the table so every next phase the source table will have new data and I planned to perform the DMS task to safely offload the data from the RDS table.
I have configured an RDS source and S3 bucket as target to export data as CSV. The replication type is Full-Load only and it migrates existing data(No Ongoing replication).
But the problem I found is that DMS keeps dropping the old LOADXXXXXXX.csv file from the target s3 whenever I perform the reload-target operation on the DMS task next week.
How can I achieve my goal? How to configure AWS DMS to keep multiple full load files in the same s3 target destination?
I was able to keep the old load file in s3 with a bit of a trick at the S3 target end. It is true that AWS DMS doesn't provide anything to keep the old load file after restarting the DMS task. But if you turn on versioning in the target S3 bucket, then you can keep that old load file as a previous version.
This solution was able to fulfill my requirements.
This is said in another thread [here][1]:
For DMS the incremental counter is started over from 1 each time the task is run. It does not have a "Don't override existing objects" feature.
And to this day, its not possible to change the file naming. So, to do this, you are force to have your execution results in different folders.
[1]: https://stackoverflow.com/a/60385265
We have a hundreds of thousands of .csv files stored in S3 that contain at least several data records each. (each record is its own row)
I am trying to design a migration strategy to transform all the records in the .csv files and put them into DynamoDB. During the migration, I'd also like to ensure that if any new .csv gets added to the S3 bucket, we automatically trigger a lambda or something to do the transformation and write to DynamoDB as well.
Eventually we'd stop writing to S3 entirely, but initially we need to keep those writes and any writes to S3 to also trigger a write to DynamoDB. Does anyone know of any good strategies for doing this? (Is there something like DynamoDB streams except for S3?) Any strategies for getting the existing things in .csv in S3 over to DynamoDB in general?
AWS has many tools you can use to solve this problem. Here are a few.
You could use AWS Database Migration Service. It supports migrating data from S3 and into DynamoDB. This AWS product is designed specifically for your use case, and it handles pretty much everything.
Once the migration has started, DMS manages all the complexities of the migration process including automatically replicating data changes that occur in the source database during the migration process.
S3 can publish events to trigger a lambda function which can be used to continuously replicate the data to DynamoDB.
AWS Data Pipelines basically does batch ETL jobs, which could move your data all at once from S3 to DynamoDB. You might also be able to run periodic sync jobs if you can tolerate a delay in replicating data to DynamoDB.
AWS Glue can crawl your data, process it, and store it in another location. I think it would provide you with an initial load plus the ongoing replication. While it could work, it’s designed more for unstructured data, and you have CSV files which are usually structured.
I’d recommend using AWS Database Migration Service because it’s the one-stop solution, but if you can’t use it for some reason, there are other options.
I don't know if DynamoDB has "load records from CSV" feature (RedShift does).
If it does not, then you could roll your own. Write a Python function that imports the csv and boto3 modules, takes as input an S3 path (inside an event dictionary). The function would them download the file from S3 to temp dir, parse it with csv, then use boto3 to insert into DynamoDB.
To get the history loaded, write a function that uses `boto3' to read the list of objects in S3, then call the first function to upload to DynamoDB.
To get future files loaded, install the first function as a Lambda function, and add a trigger from S3 Object Creation events to run the function whenever a new object is put onto S3.
I have my data in a table in Redshift cluster. I want to periodically run a query against the Redshift table and store the results in a S3 bucket.
I will be running some data transformations on this data in the S3 bucket to feed into another system. As per AWS documentation I can use the UNLOAD command, but is there a way to schedule this periodically? I have searched a lot but I haven't found any relevant information around this.
You can use a scheduling tool like Airflow to accomplish this task. Airflow seem-lessly connects to Redshift and S3. You can have a DAG action, which polls Redshift periodically and unloads the data from Redshift onto S3.
I don't believe Redshift has the ability to schedule queries periodically. You would need to use another service for this. You could use a Lambda function, or you could schedule a cron job on an EC2 instance.
I believe you are looking for AWS data pipeline service.
You can copy data from redshift to s3 using the RedshiftCopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html).
I am copying the relevant content from the above URL for future purposes:
"You can also copy from Amazon Redshift to Amazon S3 using RedshiftCopyActivity. For more information, see S3DataNode.
You can use SqlActivity to perform SQL queries on the data that you've loaded into Amazon Redshift."
Let me know if this helped.
You should try AWS Data Pipelines. You can schedule them to run periodically or on demand. I am confident that it would solve your use case