Data Copy from Azure blob to S3 andn Synapse to Redshift - amazon-web-services

There is a requirement to copy from Azure Blob to S3 for 10TB data and also from Synpase to Redshift for 10TB of data.
What is the best way to achieve these 2 migrations?

For the Redshift - you could export Azure Synapse Analytics to a a blob storage in a compatible format ideally compressed and then copy the data to S3. It is pretty straightforward to import data from S3 to Redshift.
You may need a VM instance to load read from Azure Storage and put into AWS S3 (doesn't matter where). The simplest option seems to be using the default CLI (Azure and AWS) to read the content to the migration instance and write to to the target bucket. However me personally - I'd maybe create an application writing down checkpoints, if the migration process interrupts for any reason, the migration process wouldn't need to start from the scratch.
There are a few options you may "tweak" based on the files to move, if there are many small files or less large files, from which region to move where, ....
https://aws.amazon.com/premiumsupport/knowledge-center/s3-upload-large-files/
As well you may consider using the AWS S3 Transfer Acceleration, may or may not help too.
Please note every larger cloud provider has some outbound data egress cost, for 10TB it may be considerable cost

Related

Why do we first need to unload data from redshift to S3

I am trying to consume some data in redshift using sagemaker to train some model. After some research, I found the best way to do so is first unloading the data from redshift to an S3 bucket. I assume sagemaker has API to directly interact with redshift, but why do we need to first unload it to an S3 bucket?
UNLOADing is a best practice and generally the method that the docs will promote. This is due to efficiency and performance. Redshift is a cluster with a single leader and multiple compute nodes. S3 is a cluster - a distributed object store. Having multiple compute nodes connect to S3 when moving data is far faster and less of a burden to the database.
Also, tools that you may be using with sagemaker (like EMR) are also clusters and will also benefit from multiple parallel connections to S3.
The larger the amount of data being moved the greater this benefit will be.

Data Ingestion in Amazon Redshift

I have multiple data source from which I need to build and implement a DWH in AWS. I have one challenge with respect to one of my unstructured data source (Data coming from different APIs). How can I ingest data from this source into the Amazon Redshift??? Can we first pull it into Amazon S3 bucket and then integrate S3 with Amazon redshift? What is a better approach?
Yes, S3 first. You APIs can write to S3 or/and if you like you can use a service like Kinesis (with or without firehose) to populate S3. From there it is just work in Redshift.
Without knowing more about the sources, yes S3 is likely the right approach - whether you require latency in seconds, minutes or hours will be an important consideration.
If latency is not a driving concern, simply:
Set up an S3 bucket to use a destination from your initial source(s).
Create tables in your Redshift database (loading data from S3 to Redshift requires pre-existing destination table).
Use the COPY command load from S3 to Redshift.
As noted, there may be value in Kinesis, especially if you're working with real-time data streams (the service recently introduced support for skipping S3 and streaming directly to Redshift).
S3 is probably the easier approach, if you're not trying to analyze real-time streams.

Read S3 Bucket from EC2 for ML Training

I am trying to train a machine learning model on AWS EC2. I have over 50GB of data currently stored in an AWS S3 bucket. When training my model on EC2, I want to be able to access this data.
Essentially, I want to be able to call this command:
python3 train_model.py --train_files /data/train.csv --dev_files /data/dev.csv --test_files /data/test.csv
where /data/train.csv is my S3 bucket s3://data/. How can I do this? I currently only see ways to cp my S3 data into my EC2.
You can develop an enhancement to your code using boto.
But if you want access to your S3 as if it was another local filesystem I would consider s3fs-fuse, explained further here.
Another option would be to use the aws-cli to sync your code to a local folder.
How can I do this? I currently only see ways to cp my S3 data into my EC2.
S3 is a object storage system. It does not allow for direct access nor reading of files like a regular file system.
Thus to read your files, you need to download it first (downloading in parts is also possible), or have some third party software do it for you like s3-fuse. You can download it to your instance, or store in external file system (e.g. EFS).
Its not clear from your question if you have one 50GB CSV file, or multiple small ones. In case you have one large CSV file of 50GB in size, you can reduce the amount of data read, if not all of its needed, at once using S3 Select:
With S3 Select, you can use a simple SQL expression to return only the data from the store you’re interested in, instead of retrieving the entire object. This means you’re dealing with an order of magnitude less data which improves the performance of your underlying applications.
Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format.

Best way to transfer data from on-prem to AWS

I have a requirement to transfer data(one time) from on prem to AWS S3. The data size is around 1 TB. I was going through AWS Datasync, Snowball etc... But these managed services are better to migrate if the data is in petabytes. Can someone suggest me the best way to transfer the data in a secured way cost effectively
You can use the AWS Command-Line Interface (CLI). This command will copy data to Amazon S3:
aws s3 sync c:/MyDir s3://my-bucket/
If there is a network failure or timeout, simply run the command again. It only copies files that are not already present in the destination.
The time taken will depend upon the speed of your Internet connection.
You could also consider using AWS Snowball, which is a piece of hardware that is sent to your location. It can hold 50TB of data and costs $200.
If you have no specific requirements (apart from the fact that it needs to be encrypted and the file-size is 1TB) then I would suggest you stick to something plain and simple. S3 supports an object size of 5TB so you wouldn't run into trouble. I don't know if your data is made up of many smaller files or 1 big file (or zip) but in essence its all the same. Since the end-points or all encrypted you should be fine (if your worried, you can encrypt your files before and they will be encrypted while stored (if its backup of something). To get to the point, you can use API tools for transfer or just file-explorer type of tools which have also connectivity to S3 (e.g. https://www.cloudberrylab.com/explorer/amazon-s3.aspx). some other point: cost-effectiviness of storage/transfer all depends on how frequent you need the data, if just a backup or just in case. archiving to glacier is much cheaper.
1 TB is large but it's not so large that it'll take you weeks to get your data onto S3. However if you don't have a good upload speed, use Snowball.
https://aws.amazon.com/snowball/
Snowball is a device shipped to you which can hold up to 100TB. You load your data onto it and ship it back to AWS and they'll upload it to the S3 bucket you specify when loading the data.
This can be done in multiple ways.
Using AWS Cli, we can copy files from local to S3
AWS Transfer using FTP or SFTP (AWS SFTP)
Please refer
There are tools like cloudberry clients which has a UI interface
You can use AWS DataSync Tool

Aws: best approach to process data from S3 to RDS

I'm trying to implement, I think, a very simple process, but I don't really know what's the best approach.
I want to read a big csv (around 30gb) file from S3, make some transformation and load it into RDS MySQL and I want this process to be replicable.
I tought that the best approach was Aws data pipeline, but I've found that this service is more designed to load data from different sources to redshift after several transformtions.
I've also seen that the process of creating a pipeline is slow and a little bit messy.
Then I've found the dataduct wrapper of Coursera, but after some research, it seems that this project has been abandoned (the last commit was one year ago).
So I don't know if I should continue trying with aws data pipeline or take another approach.
I've also read about AWS Simple Workflow and Step Functions, but I don't know if it's simpler.
Then I've seen a video of AWS glue and it looks nice, but unfortunatelly it's not yet available and I don't know when Amazon will launch it.
As you see, I'm a little bit confuse, can anyone enlight me?
Thanks in advance
If you are trying to get them into RDS so you can query them, there are other options that do not require the data to be moved from S3 to RDS to do SQL like queries.
You can use Redshift spectrum to read and query information from S3 now.
Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables
Step 1. Create an IAM Role for Amazon Redshift
Step 2: Associate the IAM Role with Your Cluster
Step 3: Create an External Schema and an External Table
Step 4: Query Your Data in Amazon S3
Or you can use Athena to query the data in S3 as well if Redshift is too much horsepower for the need job.
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
You could use an ETL tool to do the transformations on your csv data and then load it into your RDS database. There are a number of open source tools that do not require large licensing costs. That way you can pull the data into the tool, do your transformations and then the tool will load the data into your MySQL database. For example there is Talend, Apache Kafka, and Scriptella. Here's some information on them for comparison.
I think Scriptella would be an option for this situation. It can use SQL scripts (or other scripting languages), and has JDBC/ODBC compliant drivers. With this you could create a script that would perform your transformations and then load the data into your MySQL database. And you would be using familiar SQL (I'm assuming you already can create SQL scripts) so there isn't a big learning curve.