Use external table redshift spectrum defined in glue data catalog - amazon-web-services

I have a table defined in Glue data catalog that I can query using Athena. As there is some data in the table that I want to use with other Redshift tables, can I access the table defined in Glue data catalog?
What will be the create external table query to reference the table definition in Glue catalog?

From AWS (Creating External Schemas),
create external schema athena_schema from data catalog
database 'sampledb'
iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole'
region 'us-east-2';
This creates a schema athena_schema that points to the sampledb database in Athena / Glue.
You need to grant appropriate access to the IAM role you specify: the Redshift cluster needs to be able to assume the role, and the role needs access to Glue.

Related

How Glue crawler load data in Redshift table?

I am a new AWS user and got confused about its services. In our company, we stored our data in S3 therefore I created a bucket in s3 and created an AWS Glue crawler to load this table to the Redshift table (what we normally do in our company), which I successfully can see on Redshift.
Based on my research the Glue crawler should create metadata related to my data in the Glue data catalog which again I am able to see. Here is my question: How my crawler works and does it load S3 data to Redshift? Should my company have a special configuration that lets me load data to Redshift?
Thanks
AWS Glue does not natively interact with Amazon Redshift.
Load data from Amazon S3 to Amazon Redshift using AWS Glue - AWS Prescriptive Guidance provides an example of using AWS Glue to load data into Redshift, but it simply connects to it like a generic JDBC database.
It appears that you can Query external data using Amazon Redshift Spectrum - Amazon Redshift, but this is Redshift using the AWS Glue Data Catalog to access data stored in Amazon S3. The data is not "loaded" into Redshift. Rather, the External Table definition in Redshift tells it how to access the data directly in S3. This is very similar to Amazon Athena, which queries data stored in S3 without having to load it into a database. (Think of Redshift Spectrum as being Amazon Athena inside Amazon Redshift.)
So, there are basically two ways to query data using Amazon Redshift:
Use the COPY command to load the data from S3 into Redshift and then query it, OR
Keep the data in S3, use CREATE EXTERNAL TABLE to tell Redshift where to find it (or use an existing definition in the AWS Glue Data Catalog), then query it without loading the data into Redshift itself.
I figured out what I meant by seeing the tables in Redshift after running crawler. In fact, I created an external table in Redshift not store the table to Redshift.

How to create tables automatically in Redshift through AWS Glue based on RDS data source

I have dozens of tables in my data source (RDS) and I am ingesting all of this data into Redshift through AWS Glue. I am currently manually creating tables in Redshift (through SQL) and then proceeding with the Crawler and AWS Glue to fill in the Redshift tables with the data flowing from RDS.
Is there a way I can create these target tables within Redshift automatically (based on the tables I have in RDS, as these will just be an exact same copy initially) and not manually create each one of them with SQL in the Redshift Query Editor section?
Thanks in advance,

Populate external schema table in Redshift from S3 bucket file

I am new to AWS and trying to figure out how to populate a table within an external schema, residing in Amazon Redshift. I used Amazon Glue to create a table from a .csv file that sits in a S3 bucket. I can query the newly created table via Amazon Athena.
Here is where I am stuck because my task is to take the data and populate a table living in an RedShift external schema. I tried created a Job within Glue, but had no luck.
This is where I am stuck. Am I supposed to first create an empty destination table that mirrors the table that I can query using Athena?
Thank you to anyone in advance who might be able to assist!!!
Redshift Spectrum and Athena both use the Glue data catalog for external tables. When you create a new Redshift external schema that points at your existing Glue catalog the tables it contains will immediately exist in Redshift.
-- Create the Redshift Spectrum schema
CREATE EXTERNAL SCHEMA IF NOT EXISTS my_redshift_schema
FROM DATA CATALOG DATABASE 'my_glue_database'
IAM_ROLE 'arn:aws:iam:::role/MyIAMRole'
;
-- Review the schema info
SELECT *
FROM svv_external_schemas
WHERE schemaname = 'my_redshift_schema'
;
-- Review the tables in the schema
SELECT *
FROM svv_external_tables
WHERE schemaname = 'my_redshift_schema'
;
-- Confirm that the table returns data
SELECT *
FROM my_redshift_schema.my_external_table LIMIT 10
;

How do you connect to an external schema/table on Redshift Spectrum through AWS Quicksight?

I have spun up a Redshift cluster and added my S3 external schema by running
CREATE EXTERNAL SCHEMA s3 FROM DATA CATALOG
DATABASE '<aws_glue_db>'
IAM_ROLE '<redshift_s3_glue_iam_role_arn>';
to access the AWS Glue Data Catalog. Everything is fine on Redshift, I can query data and all is well. On Quicksight, however, the table is recognized but is empty.
Do i have to move the data into Redshift? If so, would the only reason I should be using Redshift be to process Parquet files?
You should be able to select external tables from redshift, I think the role you're using is missing access to s3
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cross-account-glue-s3/
In the end I just wrote a custom SQL expression to select the relevant fields

What are the steps to use Redshift Spectrum.?

Currently I am using Amazon Redshift as well as Amazon S3 to store data. Now I want to use Spectrum to improve performance but confused in how to use it properly.
If I am using SQL workbench can I create external schema from same or I need to create it from AWS console or Athena.?
Do I need to have Athena for a specific region.? Is it possible to use spectrum without Athena.?
Now if I try to create external schema through SQL workbench it was throwing an error "CREATE EXTERNAL SCHEMA is not enabled" How can enable this..?
Please help if someone had used Spectrum and let me know detailed steps to use spectrum.
Redshift Spectrum requires an external data catalog that contains the definition of the table. It is this data catalog that contains the reference to the files in S3, rather than the external table definition in Redshift. This data catalog can be defined in Elastic MapReduce as a Hive Catalog (good if you have an existing EMR deployment) or in Athena (good if you don't have EMR or don't want to get into managing Hadoop). The Athena route can be managed fully by Redshift, if you wish.
It looks to me like your issue is one of four things. Either:
Your Redshift cluster is not in an AWS region that currently supports Athena and Spectrum.
Your Redshift cluster version doesn't support Spectrum yet (1.0.1294 or later).
Your IAM policies don't allow Redshift control over Athena.
You're not using the CREATE EXTERNAL DATABASE IF NOT EXISTS parameter on your CREATE EXTERNAL SCHEMA statement.
To allow Redshift to manage Athena you'll need to attach an IAM policy to your Redshift cluster that allows it Full Control over Athena, as well as Read access to the S3 bucket containing your data.
Once that's in place, you can create your external schema as you have been already, ensuring that the CREATE EXTERNAL DATABASE IF NOT EXISTS argument is also passed. This makes sure that the external database is created in Athena if you don't have a pre-existing configuration: http://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-create-external-table.html
Finally, run your CREATE EXTERNAL TABLE statement, which will transparently create the table metadata in the Athena data catalog: http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html