Can we connect Amazon S3 buckets present in two different regions and migrate CSV Files data into one Particular region Amazon RDS ? I am trying to use AWS Glue.
There are certainly different ways to solve this use case. You can use AWS Glue. You can also write a workflow using AWS Step Functions that can solve this as well. For example, you can write a series of Lambda functions that can read CSV in an Amazon S3 bucket, get the values and then write the values to an Amazon RDS database. Both ways are valid.
See these docs as ref:
https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/
https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc
Keep in mind however. a workflow is not ideal when your data set is so large, it will timeout the 15 min window that Lambda uses. In this case, you should use AWS Glue.
Related
We need to transfer nearly 1000GB data from AWS S3 bucket to another. It's a one time transfer. I have found several solutions for that. One is same region replication. Another solution is for transfer data using AWS CLI. What can be the best solution for this task?
Given that you want to migrate data that already exists on the source bucket, you should use a one-time S3 batch replication job.
You might also find this blog post useful.
I have a use case where my redshift cluster is private and supports only VPN connection to the VPC. I need to send data from kinesis firehose which is in another VPC. I found out that we need to make redshift public or attach an internet gateway to make this happen but I can't use internet gateway. I need to connect to redshift from kinesis firehose with VPN only. I am not able to figure out any way to do this.
As you are already aware, you cannot use a private Redshift cluster in a VPC as a target for Firehose without Internet access. There is no direct solution for this as detailed here and here.
That said, I can think of at least two work arounds that might suffice.
You can have Firehose target S3. Then setup a private link access to S3 from the private VPC and setup an event to copy the data into the Redshift cluster on an acceptable cadence. I think this is probably the best option.
You MIGHT be able to setup Firehose with a lambda processor that feeds the records into Redshift. The reason I say "might" is because the lambda will also need to be within the VPC and will need to be able to keep up with the Firehose flow. This could be fraught with performance issues, and potentially expensive. And Redshift isn't really meant to have high write transactions as a data warehouse. This is the worst option.
Firehose aggregates data in S3 and then triggers a COPY command in Redshift. As you don't have a network path from Firehose to Redshift this fails. However, Firehose can just stop at placing the data in S3.
Now you just need a way to trigger Redshift to COPY the data. There are a number of ways to do this but the easiest way is to use Lambda (in your Redshift VPC) to issue the COPY commands. You will need to decide on when the Lambda should run - Firehose uses two parameters to determine when a COPY should be issued; time since last COPY and data size not yet copied. You can emulate this behavior if you like but the simplest way is to just issue COPYs on some regular time interval, like every 5 min.
To do this you set up CloudWatch to trigger your Lambda every 5 min. The
Lambda looks in the Firehose location in S3 and lists all the files
renames (moves) all these files to put them in a new uniquely named
S3 "subfolder"
issues the COPY command to Redshift to ingest from this "subfolder"
Upon successful ingestion these files can be moved again, left in
the above "subfolder" or deleted
The reason to rename/move the files in S3 is to ensure that each run of the Lambda is operating on a unique set of files and that files aren't ingested more than once.
I have the following currently created in AWS us-east-1 region and per the request of our AWS architect I need to move it all to the us-east-2, completely, and continue developing in us-east-2 only. What are the easiest and least work and coding options (as this is a one-time deal) to move?
S3 bucket with a ton of folders and files.
Lambda function.
AWS Glue database with a ton of crawlers.
AWS Athena with a ton of tables.
Thank you so much for taking a look at my little challenge :)
There is no easy answer for your situation. There are no simple ways to migrate resources between regions.
Amazon S3 bucket
You can certainly create another bucket and then copy the content across, either using the AWS Command-Line Interface (CLI) aws s3 sync command or, for huge number of files, use S3DistCp running under Amazon EMR.
If there are previous Versions of objects in the bucket, it's not easy to replicate them. Hopefully you have Versioning turned off.
Also, it isn't easy to get the same bucket name in the other region. Hopefully you will be allowed to use a different bucket name. Otherwise, you'd need to move the data elsewhere, delete the bucket, wait a day, create the same-named bucket in another region, then copy the data across.
AWS Lambda function
If it's just a small number of functions, you could simply recreate them in the other region. If the code is stored in an Amazon S3 bucket, you'll need to move the code to a bucket in the new region.
AWS Glue
Not sure about this one. If you're moving the data files, you'll need to recreate the database anyway. You'll probably need to create new jobs in the new region (but I'm not that familiar with Glue).
Amazon Athena
If your data is moving, you'll need to recreate the tables anyway. You can use the Athena interface to show the DDL commands required to recreate a table. Then, run those commands in the new region, pointing to the new S3 bucket.
AWS Support
If this is an important system for your company, it would be prudent to subscribe to AWS Support. They can provide advice and guidance for these types of situations, and might even have some tools that can assist with a migration. The cost of support would be minor compared to the savings in your time and effort.
Is it possible for you to create CloudFormation stacks (from existing resources) using the console, then copying the contents of those stacks and run them in the other region (replacing values where they need to be).
See this link: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/resource-import-new-stack.html
I am new to AWS and trying to find a way to load the data from S3 to RDS . In my current approach I am using EC2 instance to do that (where my app is running). I was thinking of doing through lambda but my data will have around (22 million records) and my current approach is taking 4hr. And lambda timeout is 15mins (So lambda approach does not work in this case).
The problem with my current approach is This data files comes may be like ones in a month and I don't want to have a EC2 running just of this task. Any alternatives in server-less world would be helpful.Thank You
Note: The data is loaded from S3 to RDS based on SQS, i,e my application is pulling the messages from SQS which will then load the data into RDS
Please try DMS for this. You need to create DMS agent with S3 bucket info as source and target details of your RDS.
I am trying to take sql data stored in a csv file in an s3 bucket and transfer the data to AWS Redshift and automate that process. Would writing etl scripts with lambda/glue be the best way to approach this problem, and if so, how do I get the script/transfer to run periodically? If not, what would be the most optimal way to pipeline data from s3 to Redshift.
Tried using AWS Pipeline but that is not available in my region. I also tried to use the AWS documentation for Lambda and Glue but don't know where to find the exact solution to the problem
All systems (including AWS Data Pipeline) use the Amazon Redshift COPY command to load data from Amazon S3.
Therefore, you could write an AWS Lambda function that connects to Redshift and issues the COPY command. You'll need to include a compatible library (eg psycopg2) to be able to call Redshift.
You can use Amazon CloudWatch Events to call the Lambda function on a regular schedule. Or, you could get fancy and configure Amazon S3 Events so that, when a file is dropped in an S3 bucket, it automatically triggers the Lambda function.
If you don't want to write it yourself, you could search for existing code on the web, including:
The very simply Python-based christianhxc/aws-lambda-redshift-copy: AWS Lambda function that runs the copy command into Redshift
A more fully-featured node-based A Zero-Administration Amazon Redshift Database Loader | AWS Big Data Blog