I am confused on how we can send MySQL data from MySQL database deployed in an EC2 instance to an Amazon RedShift cluster.
What are the ways that can be used for doing this task?
Possible solutions:
Easiest solution should be "AWS Data Pipeline".
write output of your SQL Query into a CSV file --> zip it ( if huge data ) --> upload to S3 --> Use Redshift copy command to upload all these records in bulk into redshift.
Related
Is there a way to export data from a SQL Server query to an AWS (S3) bucket in csv file?
I created the bucket
arn:aws:s3:::s3tintegration
https://s3tintegration.s3.sa-east-1.amazonaws.com/key_prefix/
Can anybody help me?
If you are looking for Automated solution then there are several option in AWS .
Schedule or trigger lambda that will connect to RDS execute query and save as csv file s3 bucket .Please remember aws lambda has to be in same vpc and subnet where your SQL server is .
If you have query that takes long time you can use AWS Glue to run a task and write output to S3 in CSV format .Glue can use JDBC connection as well .
You can also use DMS that will connect SQL server as source and S3 as target in CSV format .You need to lean DMS that can migrate full table or part of it but not query .
If you are familiar with Big data you can very well use hive that will run your query and write to s3 in CSV format .
The quick and easiest way to start with is Lambda .
I am a new AWS user and got confused about its services. In our company, we stored our data in S3 therefore I created a bucket in s3 and created an AWS Glue crawler to load this table to the Redshift table (what we normally do in our company), which I successfully can see on Redshift.
Based on my research the Glue crawler should create metadata related to my data in the Glue data catalog which again I am able to see. Here is my question: How my crawler works and does it load S3 data to Redshift? Should my company have a special configuration that lets me load data to Redshift?
Thanks
AWS Glue does not natively interact with Amazon Redshift.
Load data from Amazon S3 to Amazon Redshift using AWS Glue - AWS Prescriptive Guidance provides an example of using AWS Glue to load data into Redshift, but it simply connects to it like a generic JDBC database.
It appears that you can Query external data using Amazon Redshift Spectrum - Amazon Redshift, but this is Redshift using the AWS Glue Data Catalog to access data stored in Amazon S3. The data is not "loaded" into Redshift. Rather, the External Table definition in Redshift tells it how to access the data directly in S3. This is very similar to Amazon Athena, which queries data stored in S3 without having to load it into a database. (Think of Redshift Spectrum as being Amazon Athena inside Amazon Redshift.)
So, there are basically two ways to query data using Amazon Redshift:
Use the COPY command to load the data from S3 into Redshift and then query it, OR
Keep the data in S3, use CREATE EXTERNAL TABLE to tell Redshift where to find it (or use an existing definition in the AWS Glue Data Catalog), then query it without loading the data into Redshift itself.
I figured out what I meant by seeing the tables in Redshift after running crawler. In fact, I created an external table in Redshift not store the table to Redshift.
I have some JSON files in S3 and I was able to create databases and tables in Amazon Athena from those data files. It's done, my next target is to copy those created tables into Amazon Redshift. There are other tables in the Amazon Athena which I created base on those data files. I mean I created three tables using those data files which is in the S3, latter I created new tables using those those 3 tables. So at the moment I have 5 different tables which want to create in the Amazon Redshift with data or without data.
I checked the COPY command in Amazon Redshift, but there is no COPY command for Amazon Athena. Here are the available list.
COPY from Amazon S3
COPY from Amazon EMR
COPY from Remote Host (SSH)
COPY from Amazon DynamoDB
If there is no any other solutions, I planned to create new JSON files based on newly created tables in the Amazon Athena into S3 buckets. Then we can easily copy those from S3 into the Redshift, isn't it? Are there any other good solutions for this?
If your s3 files are in an OK format you can use Redshift Spectrum.
1) Set up a hive metadata catalog of your s3 files, using aws glue if you wish.
2) Set up Redshift Spectrum to see that data inside redshift (https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html)
3) Use CTAS to create a copy inside redshift
create table redshift_table as select * from redshift_spectrum_schema.redshift_spectrum_table;
I have a csv file which will be present (daily new file) in S3 bucket. From here I am trying to use AWS Glue to extract, transform & load in AWS Aurora Database. Aurora DB is designed as a normalized relational database, I have to load the csv into this relational database with information mapped between multiple tables.
Steps that I am trying:
1) Modify the python script to perform the load operation.
Wanted to know if there is any other way to achieve this load operation?
RDS Aurora provides a feature built-in where in you can load data from a CSV file residing in a S3 bucket using "LOAD DATA FROM S3 into TABLE". You need to add appropriate IAM roles, and configure it in the Aurora parameter groups.
We are using this feature for the past one year, and its working fine. You can also do the reverse as well, like unload data from a table into a S3 bucket. Can you please check the following link for more information and testing. Hope I got your question correct?
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.LoadFromS3.html
Thanks,
Yuva
I have a local Hadoop cluster and want to load data into Amazon Redshift. Informatica/Talend is not an option considering the costs so can we leverage Sqoop to export the tables from Hive into Redshift directly? Does Sqoop connect to Redshift?
The most efficient way to load data into Amazon Redshift is by placing data into Amazon S3 and then issuing the COPY command in Redshift. This performs a parallel data load across all Redshift nodes.
While Sqoop might be able to insert data into Redshift by using traditional INSERT SQL commands, it is not a good way to insert data into Redshift.
The preferred method would be:
Export the data into Amazon S3 as CSV format (preferably in .gz or .bzip format)
Trigger a COPY command in Redshift
You should be able to export data to S3 by copying data to a Hive External Table in CSV format.
Alternatively, Redshift can load data from HDFS. It needs some additional setup to grant Redshift acces to the EMR cluster. See Redshift documentation: Loading Data from Amazon EMR
copy command not supporting upsert it just simply load as many times as you mention and end up with duplicate data, so better way is use glue job and modify it for update else insert or use lambda to upsert into redshift