Here's my issue with unsigned integers in our source database (MySQL RDS):
I use AWS DMS to do an initial load of the source table and the target is S3 (our Zone1 of our data lake) saved as parquet. I can then crawl it with Glue and query the table in Athena. All good here
I then created a Glue job to read that Zone1 data catalog and output to another bucket in S3 (our Zone2). However, the Glue job fails with: Parquet type not supported: INT64 (UINT_64)
Does anyone have a workaround that I can put into the Glue job to 'cast' this data type to something else?
Related
I am a new AWS user and got confused about its services. In our company, we stored our data in S3 therefore I created a bucket in s3 and created an AWS Glue crawler to load this table to the Redshift table (what we normally do in our company), which I successfully can see on Redshift.
Based on my research the Glue crawler should create metadata related to my data in the Glue data catalog which again I am able to see. Here is my question: How my crawler works and does it load S3 data to Redshift? Should my company have a special configuration that lets me load data to Redshift?
Thanks
AWS Glue does not natively interact with Amazon Redshift.
Load data from Amazon S3 to Amazon Redshift using AWS Glue - AWS Prescriptive Guidance provides an example of using AWS Glue to load data into Redshift, but it simply connects to it like a generic JDBC database.
It appears that you can Query external data using Amazon Redshift Spectrum - Amazon Redshift, but this is Redshift using the AWS Glue Data Catalog to access data stored in Amazon S3. The data is not "loaded" into Redshift. Rather, the External Table definition in Redshift tells it how to access the data directly in S3. This is very similar to Amazon Athena, which queries data stored in S3 without having to load it into a database. (Think of Redshift Spectrum as being Amazon Athena inside Amazon Redshift.)
So, there are basically two ways to query data using Amazon Redshift:
Use the COPY command to load the data from S3 into Redshift and then query it, OR
Keep the data in S3, use CREATE EXTERNAL TABLE to tell Redshift where to find it (or use an existing definition in the AWS Glue Data Catalog), then query it without loading the data into Redshift itself.
I figured out what I meant by seeing the tables in Redshift after running crawler. In fact, I created an external table in Redshift not store the table to Redshift.
I have data that is coming into an S3 bucket and I would like to run a query on it every hour. The data comes in as a JSON. I crawl it, run a job on the data to transform it to ORC format, and crawl it again to create a table that's faster for queries than the original JSONs (as they are deeply nested). I'm trying to query the data with Athena. I have managed to link the previous steps together using Lambda and cloudwatch events.
The problem here is that the last crawler is supposed to create new tables instead of just partitions of the same table, so the table name is not known prior to running the list of jobs. I found that you can listen for the creation of a new table and the completion of a crawler, but the log for the end of a crawler's run doesn't contain the name of the new table created (using Amazon's Documentation). Is there a way to get this table name dynamically and query it using Lambda or Athena? Thanks
Why not invoke lambda from glue job after crawler completes? Table name is folder in S3 bucket in which you stored orc data. Since it is done in glue job, I believe you already have folder name which you can pass to lambda from glue job.
I have a service running that populates my S3 bucket with the compressed log files, but the log files do not have a fixed schema and athena expects a fixed schema. (Which I wrote while creating the table)
So my question is as in the title, is there any way around through which I can query a dynamic schema? If not is there any other service like athena to do the same thing?
Amazon Athena can't do that by itself, but you can configure an AWS Glue crawler to automatically infer the schema of your JSON files. The crawler can run on a schedule, so your files will be indexed automatically even if the schema changes. Athena will use the Glue data catalog if AWS Glue is available in the region you're running Athena in.
See Cataloging Tables with a Crawler in the AWS Glue docs for the details on how to set that up.
I'm trying to create AWS Glue ETL Job that would load data from parquet files stored in S3 in to a Redshift table.
Parquet files were writen using pandas with 'simple' file schema option into multiple folders in an S3 bucked.
The layout looks like this:
s3://bucket/parquet_table/01/file_1.parquet
s3://bucket/parquet_table/01/file_2.parquet
s3://bucket/parquet_table/01/file_3.parquet
s3://bucket/parquet_table/01/file_1.parquet
s3://bucket/parquet_table/02/file_2.parquet
s3://bucket/parquet_table/02/file_3.parquet
I can use AWS Glue Crawler to create a table in the AWS Glue Catalog and that table can be queried from Athena, but it does not work when i try to create ETL Job that would copy the same table to Redshift.
If I Crawl a single file or if I crawl multiple files in one folder, it works, as soon as there are multiple folders involved, I get the above mentioned error
AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
Similar issues appear if instead of 'simple' schema I use 'hive'. Then we have multiple folders and also empty parquet files that throw
java.io.IOException: Could not read footer: java.lang.RuntimeException: xxx is not a Parquet file (too small)
Is there some recommendation on how to read Parquet files and structure them ins S3 when using AWS Glue (ETL and Data Catalog)?
Redshift doesn't support parquet format. Redshift Spectrum does. Athena also supports parquet format.
The error that you're facing is because the when reading the parquet files from s3 from spark/glue it expects the data to be in hive partition i.e the partition names should have key- value pair, You'll to have the s3 hierarchy in hive style partition something like below
s3://your-bucket/parquet_table/id=1/file1.parquet
s3://your-bucket/parquet_table/id=2/file2.parquet
and so on..
then use the below path to read all the files in bucket
location : s3://your-bucket/parquet_table
If the data in s3 partition the above way, you'll not the face any issues.
I'm running a pyspark job which creates a dataframe and stores it to S3 as below:
df.write.saveAsTable(table_name, format="orc", mode="overwrite", path=s3_path)
I can read the orcfile without a problem, just by using spark.read.orc(s3_path), so there's schema information in the orcfile, as expected.
However, I'd really like to view the dataframe contents using Athena. Clearly if I wrote to my hive metastore, I can call hive and do a show create table ${table_name}, but that's a lot of work when all I want is a simple schema.
Is there another way?
One of the approaches would be to set up a Glue crawler for your S3 path, which would create a table in the AWS Glue Data Catalog. Alternatively, you could create the Glue table definition via the Glue API.
The AWS Glue Data Catalog is fully integrated with Athena, so you would see your Glue table in Athena, and be able to query it directly:
http://docs.aws.amazon.com/athena/latest/ug/glue-athena.html