I am a new AWS user and got confused about its services. In our company, we stored our data in S3 therefore I created a bucket in s3 and created an AWS Glue crawler to load this table to the Redshift table (what we normally do in our company), which I successfully can see on Redshift.
Based on my research the Glue crawler should create metadata related to my data in the Glue data catalog which again I am able to see. Here is my question: How my crawler works and does it load S3 data to Redshift? Should my company have a special configuration that lets me load data to Redshift?
Thanks
AWS Glue does not natively interact with Amazon Redshift.
Load data from Amazon S3 to Amazon Redshift using AWS Glue - AWS Prescriptive Guidance provides an example of using AWS Glue to load data into Redshift, but it simply connects to it like a generic JDBC database.
It appears that you can Query external data using Amazon Redshift Spectrum - Amazon Redshift, but this is Redshift using the AWS Glue Data Catalog to access data stored in Amazon S3. The data is not "loaded" into Redshift. Rather, the External Table definition in Redshift tells it how to access the data directly in S3. This is very similar to Amazon Athena, which queries data stored in S3 without having to load it into a database. (Think of Redshift Spectrum as being Amazon Athena inside Amazon Redshift.)
So, there are basically two ways to query data using Amazon Redshift:
Use the COPY command to load the data from S3 into Redshift and then query it, OR
Keep the data in S3, use CREATE EXTERNAL TABLE to tell Redshift where to find it (or use an existing definition in the AWS Glue Data Catalog), then query it without loading the data into Redshift itself.
I figured out what I meant by seeing the tables in Redshift after running crawler. In fact, I created an external table in Redshift not store the table to Redshift.
Related
I have dozens of tables in my data source (RDS) and I am ingesting all of this data into Redshift through AWS Glue. I am currently manually creating tables in Redshift (through SQL) and then proceeding with the Crawler and AWS Glue to fill in the Redshift tables with the data flowing from RDS.
Is there a way I can create these target tables within Redshift automatically (based on the tables I have in RDS, as these will just be an exact same copy initially) and not manually create each one of them with SQL in the Redshift Query Editor section?
Thanks in advance,
In AWS Glue jobs, in order to retrieve data from DB or S3, we can get using 2 approaches. 1) Using Crawler 2) Using direct connection to DB or S3.
So my question is: How does crawler much better than direct connecting to a database and retrieve data?
AWS Glue Crawlers will not retrieve the actual data. Crawlers accesses your data stores and progresses through a prioritized list of classifiers to extract the schema of your data and other statistics, and then populates the Glue Data Catalog with this metadata. Crawlers can be scheduled to run periodically that will detect the availability of the new data along with the change to the existing data, including the table definition changes made by the data crawler. Crawlers automatically adds new table, new partitions to the existing table and the new versions of table definitions.
AWS Glue Data Catalog becomes a common metadata repository between
Amazon Athena, Amazon Redshift Spectrum, Amazon S3. AWS Glue Crawlers
helps in building this metadata repository.
I have some JSON files in S3 and I was able to create databases and tables in Amazon Athena from those data files. It's done, my next target is to copy those created tables into Amazon Redshift. There are other tables in the Amazon Athena which I created base on those data files. I mean I created three tables using those data files which is in the S3, latter I created new tables using those those 3 tables. So at the moment I have 5 different tables which want to create in the Amazon Redshift with data or without data.
I checked the COPY command in Amazon Redshift, but there is no COPY command for Amazon Athena. Here are the available list.
COPY from Amazon S3
COPY from Amazon EMR
COPY from Remote Host (SSH)
COPY from Amazon DynamoDB
If there is no any other solutions, I planned to create new JSON files based on newly created tables in the Amazon Athena into S3 buckets. Then we can easily copy those from S3 into the Redshift, isn't it? Are there any other good solutions for this?
If your s3 files are in an OK format you can use Redshift Spectrum.
1) Set up a hive metadata catalog of your s3 files, using aws glue if you wish.
2) Set up Redshift Spectrum to see that data inside redshift (https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html)
3) Use CTAS to create a copy inside redshift
create table redshift_table as select * from redshift_spectrum_schema.redshift_spectrum_table;
I have spun up a Redshift cluster and added my S3 external schema by running
CREATE EXTERNAL SCHEMA s3 FROM DATA CATALOG
DATABASE '<aws_glue_db>'
IAM_ROLE '<redshift_s3_glue_iam_role_arn>';
to access the AWS Glue Data Catalog. Everything is fine on Redshift, I can query data and all is well. On Quicksight, however, the table is recognized but is empty.
Do i have to move the data into Redshift? If so, would the only reason I should be using Redshift be to process Parquet files?
You should be able to select external tables from redshift, I think the role you're using is missing access to s3
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cross-account-glue-s3/
In the end I just wrote a custom SQL expression to select the relevant fields
I have a local Hadoop cluster and want to load data into Amazon Redshift. Informatica/Talend is not an option considering the costs so can we leverage Sqoop to export the tables from Hive into Redshift directly? Does Sqoop connect to Redshift?
The most efficient way to load data into Amazon Redshift is by placing data into Amazon S3 and then issuing the COPY command in Redshift. This performs a parallel data load across all Redshift nodes.
While Sqoop might be able to insert data into Redshift by using traditional INSERT SQL commands, it is not a good way to insert data into Redshift.
The preferred method would be:
Export the data into Amazon S3 as CSV format (preferably in .gz or .bzip format)
Trigger a COPY command in Redshift
You should be able to export data to S3 by copying data to a Hive External Table in CSV format.
Alternatively, Redshift can load data from HDFS. It needs some additional setup to grant Redshift acces to the EMR cluster. See Redshift documentation: Loading Data from Amazon EMR
copy command not supporting upsert it just simply load as many times as you mention and end up with duplicate data, so better way is use glue job and modify it for update else insert or use lambda to upsert into redshift