Copying entire schema from one redshift cluster to another - amazon-web-services

I would like to copy my schema tables structure without the data from one redshift cluster to another cluster so that both clusters have the same structure.
I understand we can use UNLOAD from one Cluster to the S3 bucket and then COPY to another cluster. However, this applies to a table. Does it apply to the schema as well? I have 262 tables in my current cluster.
Would the UNLOAD and LOAD allow me to copy my schema tables to another cluster?

LOAD and UNLOAD does not define/copy a schema.
Instead, I think you could (Use Amazon Redshift as a source for AWS Schema Conversion Tool. This will 'extract' the schema for a table and give you the SQL DDL commands that can be used to create the table in another Amazon Redshift cluster.

Related

How to create tables automatically in Redshift through AWS Glue based on RDS data source

I have dozens of tables in my data source (RDS) and I am ingesting all of this data into Redshift through AWS Glue. I am currently manually creating tables in Redshift (through SQL) and then proceeding with the Crawler and AWS Glue to fill in the Redshift tables with the data flowing from RDS.
Is there a way I can create these target tables within Redshift automatically (based on the tables I have in RDS, as these will just be an exact same copy initially) and not manually create each one of them with SQL in the Redshift Query Editor section?
Thanks in advance,

How to directly copy Amazon Athena tables into Amazon Redshift?

I have some JSON files in S3 and I was able to create databases and tables in Amazon Athena from those data files. It's done, my next target is to copy those created tables into Amazon Redshift. There are other tables in the Amazon Athena which I created base on those data files. I mean I created three tables using those data files which is in the S3, latter I created new tables using those those 3 tables. So at the moment I have 5 different tables which want to create in the Amazon Redshift with data or without data.
I checked the COPY command in Amazon Redshift, but there is no COPY command for Amazon Athena. Here are the available list.
COPY from Amazon S3
COPY from Amazon EMR
COPY from Remote Host (SSH)
COPY from Amazon DynamoDB
If there is no any other solutions, I planned to create new JSON files based on newly created tables in the Amazon Athena into S3 buckets. Then we can easily copy those from S3 into the Redshift, isn't it? Are there any other good solutions for this?
If your s3 files are in an OK format you can use Redshift Spectrum.
1) Set up a hive metadata catalog of your s3 files, using aws glue if you wish.
2) Set up Redshift Spectrum to see that data inside redshift (https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html)
3) Use CTAS to create a copy inside redshift
create table redshift_table as select * from redshift_spectrum_schema.redshift_spectrum_table;

How to import hive tables from Hadoop datalake to AWS RDS?

I need suggestions on importing data from Hadoop datalake (Kerberos authenticated) to AWS. All the tables in the Hive table should land in s3 and then needs to be loaded to AWS RDS.
I have considered the following options:
1) AWS Glue ?
2) Spark connecting to hive metastore ?
3) Connecting to impala from AWS ?
There are around 50 tables to be imported. How can i maintain the schema ? IS it better to import the data and then create a seperate schema in RDS ?
Personally, I would dump a list of all tables needing moved.
From that, run SHOW CREATE TABLE over them all, and save the queries.
Run distcp, or however else you want to move the data into S3 / EBS
Edit each create table command to specify a Hive table location that's in the cloud data store. I believe you'll need to make all these as external tables since you cannot put data directly into the Hive warehouse directory and have the metastore immediately know of it.
Run all the commands on the AWS Hive connection.
I have coworkers who have used CircusTrain
Impala and Spark are meant for processing. You're going to need to deal mostly with the Hive metastore here.

What are the steps to use Redshift Spectrum.?

Currently I am using Amazon Redshift as well as Amazon S3 to store data. Now I want to use Spectrum to improve performance but confused in how to use it properly.
If I am using SQL workbench can I create external schema from same or I need to create it from AWS console or Athena.?
Do I need to have Athena for a specific region.? Is it possible to use spectrum without Athena.?
Now if I try to create external schema through SQL workbench it was throwing an error "CREATE EXTERNAL SCHEMA is not enabled" How can enable this..?
Please help if someone had used Spectrum and let me know detailed steps to use spectrum.
Redshift Spectrum requires an external data catalog that contains the definition of the table. It is this data catalog that contains the reference to the files in S3, rather than the external table definition in Redshift. This data catalog can be defined in Elastic MapReduce as a Hive Catalog (good if you have an existing EMR deployment) or in Athena (good if you don't have EMR or don't want to get into managing Hadoop). The Athena route can be managed fully by Redshift, if you wish.
It looks to me like your issue is one of four things. Either:
Your Redshift cluster is not in an AWS region that currently supports Athena and Spectrum.
Your Redshift cluster version doesn't support Spectrum yet (1.0.1294 or later).
Your IAM policies don't allow Redshift control over Athena.
You're not using the CREATE EXTERNAL DATABASE IF NOT EXISTS parameter on your CREATE EXTERNAL SCHEMA statement.
To allow Redshift to manage Athena you'll need to attach an IAM policy to your Redshift cluster that allows it Full Control over Athena, as well as Read access to the S3 bucket containing your data.
Once that's in place, you can create your external schema as you have been already, ensuring that the CREATE EXTERNAL DATABASE IF NOT EXISTS argument is also passed. This makes sure that the external database is created in Athena if you don't have a pre-existing configuration: http://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-create-external-table.html
Finally, run your CREATE EXTERNAL TABLE statement, which will transparently create the table metadata in the Athena data catalog: http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html

Can we use sqoop to export data from Hadoop (Hive) to Amazon Redshift

I have a local Hadoop cluster and want to load data into Amazon Redshift. Informatica/Talend is not an option considering the costs so can we leverage Sqoop to export the tables from Hive into Redshift directly? Does Sqoop connect to Redshift?
The most efficient way to load data into Amazon Redshift is by placing data into Amazon S3 and then issuing the COPY command in Redshift. This performs a parallel data load across all Redshift nodes.
While Sqoop might be able to insert data into Redshift by using traditional INSERT SQL commands, it is not a good way to insert data into Redshift.
The preferred method would be:
Export the data into Amazon S3 as CSV format (preferably in .gz or .bzip format)
Trigger a COPY command in Redshift
You should be able to export data to S3 by copying data to a Hive External Table in CSV format.
Alternatively, Redshift can load data from HDFS. It needs some additional setup to grant Redshift acces to the EMR cluster. See Redshift documentation: Loading Data from Amazon EMR
copy command not supporting upsert it just simply load as many times as you mention and end up with duplicate data, so better way is use glue job and modify it for update else insert or use lambda to upsert into redshift