Currently I am using Amazon Redshift as well as Amazon S3 to store data. Now I want to use Spectrum to improve performance but confused in how to use it properly.
If I am using SQL workbench can I create external schema from same or I need to create it from AWS console or Athena.?
Do I need to have Athena for a specific region.? Is it possible to use spectrum without Athena.?
Now if I try to create external schema through SQL workbench it was throwing an error "CREATE EXTERNAL SCHEMA is not enabled" How can enable this..?
Please help if someone had used Spectrum and let me know detailed steps to use spectrum.
Redshift Spectrum requires an external data catalog that contains the definition of the table. It is this data catalog that contains the reference to the files in S3, rather than the external table definition in Redshift. This data catalog can be defined in Elastic MapReduce as a Hive Catalog (good if you have an existing EMR deployment) or in Athena (good if you don't have EMR or don't want to get into managing Hadoop). The Athena route can be managed fully by Redshift, if you wish.
It looks to me like your issue is one of four things. Either:
Your Redshift cluster is not in an AWS region that currently supports Athena and Spectrum.
Your Redshift cluster version doesn't support Spectrum yet (1.0.1294 or later).
Your IAM policies don't allow Redshift control over Athena.
You're not using the CREATE EXTERNAL DATABASE IF NOT EXISTS parameter on your CREATE EXTERNAL SCHEMA statement.
To allow Redshift to manage Athena you'll need to attach an IAM policy to your Redshift cluster that allows it Full Control over Athena, as well as Read access to the S3 bucket containing your data.
Once that's in place, you can create your external schema as you have been already, ensuring that the CREATE EXTERNAL DATABASE IF NOT EXISTS argument is also passed. This makes sure that the external database is created in Athena if you don't have a pre-existing configuration: http://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-create-external-table.html
Finally, run your CREATE EXTERNAL TABLE statement, which will transparently create the table metadata in the Athena data catalog: http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html
Related
I am a new AWS user and got confused about its services. In our company, we stored our data in S3 therefore I created a bucket in s3 and created an AWS Glue crawler to load this table to the Redshift table (what we normally do in our company), which I successfully can see on Redshift.
Based on my research the Glue crawler should create metadata related to my data in the Glue data catalog which again I am able to see. Here is my question: How my crawler works and does it load S3 data to Redshift? Should my company have a special configuration that lets me load data to Redshift?
Thanks
AWS Glue does not natively interact with Amazon Redshift.
Load data from Amazon S3 to Amazon Redshift using AWS Glue - AWS Prescriptive Guidance provides an example of using AWS Glue to load data into Redshift, but it simply connects to it like a generic JDBC database.
It appears that you can Query external data using Amazon Redshift Spectrum - Amazon Redshift, but this is Redshift using the AWS Glue Data Catalog to access data stored in Amazon S3. The data is not "loaded" into Redshift. Rather, the External Table definition in Redshift tells it how to access the data directly in S3. This is very similar to Amazon Athena, which queries data stored in S3 without having to load it into a database. (Think of Redshift Spectrum as being Amazon Athena inside Amazon Redshift.)
So, there are basically two ways to query data using Amazon Redshift:
Use the COPY command to load the data from S3 into Redshift and then query it, OR
Keep the data in S3, use CREATE EXTERNAL TABLE to tell Redshift where to find it (or use an existing definition in the AWS Glue Data Catalog), then query it without loading the data into Redshift itself.
I figured out what I meant by seeing the tables in Redshift after running crawler. In fact, I created an external table in Redshift not store the table to Redshift.
I am trying to create external table in Amazon Redshift using statement
mentioned at this link.
In my case I want location To be parameterized instead of static value
I am using dB Weaver for Amazon redshift
If your partitions are hive compatible(<partition_column_name>=<partition_column_value>) and your table is defined via Glue or Athena, then you can run MSCK REPAIR TABLE on the Athena table directly, which would add them. Read this thread for more info: https://forums.aws.amazon.com/thread.jspa?messageID=800945
You can also try using partition projections, if you don't use hive compatible partitions, where you define the structure of the files location in relation to the partitions and parameters.
If those don't work with you, you can use AWS Glue Crawlers which supposedly automatically detect partitions: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If that doesn't work for you, well then your problem is very specific. I suggest pulling up your sleeves and write some code, deploy on Lambda or AWS Glue Python Shell Job. Here's a bunch of examples where other people tried that:
https://medium.com/swlh/add-newly-created-partitions-programmatically-into-aws-athena-schema-d773722a228e
https://medium.com/#alsmola/partitioning-cloudtrail-logs-in-athena-29add93ee070
I have spun up a Redshift cluster and added my S3 external schema by running
CREATE EXTERNAL SCHEMA s3 FROM DATA CATALOG
DATABASE '<aws_glue_db>'
IAM_ROLE '<redshift_s3_glue_iam_role_arn>';
to access the AWS Glue Data Catalog. Everything is fine on Redshift, I can query data and all is well. On Quicksight, however, the table is recognized but is empty.
Do i have to move the data into Redshift? If so, would the only reason I should be using Redshift be to process Parquet files?
You should be able to select external tables from redshift, I think the role you're using is missing access to s3
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cross-account-glue-s3/
In the end I just wrote a custom SQL expression to select the relevant fields
I'm attempting to use AWS Glue to ETL a MySQL database in RDS to S3 so that I can work with the data in services like SageMaker or Athena. At this time, I don't care about transformations, this is a prototype and I simply want to dump the DB to S3 to start testing the various tool chains.
I've set up a Glue database and tested the connection to RDS successfully
I am using the AWS provide Glue IAM service role
My S3 bucket has the correct prefix of aws-glue-*
I created a crawler using the Glue database, AWSGlue service role, and S3 bucket above with the options:
Schema updates in the data store: Update the table definition in the data catalog
Object deletion in the data store: Delete tables and partitions from the data catalog.
When I run the crawler, it completes in ~60 seconds but it does not create any tables in the database.
I've tried adding the Admin policy to the glue service role to eliminate IAM access issues and the result is the same.
Also, CloudWatch logs are empty. Log groups are created for the test connection and the crawler but neither contains any entries.
I'm not sure how to further troubleshoot this, info on AWS Glue seems pretty sparse.
Figured it out. I had a syntax error in my "include path" for the crawler. Make sure the connection is the data source (RDS in this case) and the include path lists the data target you want e.g. mydatabase/% (I forgot the /%).
You can substitute the percent (%) character for a schema or table. For databases that support schemas, type MyDatabase/MySchema/% to match all tables in MySchema with MyDatabase. Oracle and MySQL don't support schema in the path, instead type MyDatabase/%. For information about which JDBC data stores support schema, see Cataloging Tables with a Crawler.
Ryan Fisher is correct in the sense that it's an error. I wouldn't categorize it as a syntax error. When I ran into this it was because the 'Include path' didn't include the default schema that sql server lovingly provides to you.
I had this: database_name/table_name
When it needed to be: database_name/dbo/table_name
I would like to export data from an Amazon Redshift table into an external table stored in Amazon S3. Every hour, I want to export rows from the Redshift source into the external table target.
What kind of options exist in AWS to achieve this?
I know that there is the UNLOAD command that allows me to export data to S3, but I think it would not work to store the data into an external table (which is partitioned too). Or is Amazon EMR probably the only method to get this working?
Amazon Redshift Spectrum external tables are read-only. You cannot update them from Redshift (eg via INSERT commands).
Therefore, you would need a method to create the files directly in S3.
UNLOAD can certainly do this, but it cannot save the data in a partition structure.
Amazon EMR would, indeed, be a good option. These days it is charged per-second, so it would only need to run long enough to export the data. You could use your preferred tool (eg Hive or Spark) to export the data from Redshift, then write it into a partitioned external table.
For example, see: Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning | AWS Big Data Blog
Another option might be AWS Glue. I'm not too familiar with it, but it can output into partitions, so this might be an even easier method to accomplish your goal!
See: Managing Partitions for ETL Output in AWS Glue - AWS Glue
Its now possible to Insert into external tsble , since June 2020 i think:
https://aws.amazon.com/about-aws/whats-new/2020/06/amazon-redshift-now-supports-writing-to-external-tables-in-amazon-s3/
And heres documentation:
https://docs.aws.amazon.com/redshift/latest/dg/r_INSERT_external_table.html
Basically theres 2 ways:
INSERT INTO external_schema.table_name { select_statement }
Or
CREATE EXTERNAL TABLE AS { SELECT }
Typically you specify in redshift external schema of yours (ex my_stg) the glu database name, so any external table you create inside redshift external schema already knows glue catalog database name.
Thats good news since op question is from 2018 👍