AWS Glue Crawler creates a table for every file - amazon-web-services

I created a test Redshift cluster and enabled audit logging on the database. This creates connection logs, user logs and user activity logs (details about the logs are available here). This creates the logs in S3 bucket in the following location:
s3://bucket_name/AWSLogs/123456789012/redshift/<region>/<year>/<month>/<date>/*_<log_type>_<timestamp>.gz
Next I created a Glue Crawler and pointed the data store to s3://bucket_name/AWSLogs/123456789012/redshift and left the remaining options as the default values.
When I run the Crawler, it creates a separate table for every log item. Instead, I expect it to create 3 tables (one each for user log, user activity log and connection log).
Following are some things I tried with no success:
Updated the data store to point to prefix further inside the bucket like s3://bucket_name/AWSLogs/123456789012/redshift/<region>.
Grouping behavior: create a single schema for each S3 path
Configuration options: add new columns only
Am I missing something here? Thank you.

You cant keep all 3 schema files under one folder. They should be in separate folders before running crawler at root folder

Related

Glue crawler creating tables from file insides folders

I'm trying to crawler an S3 bucket with multiple folders, each one containing some csv files extracted by a Glue Job from Amazon RDS.
In the moment, this is basically the schema for S3:
s3://bucket/folder_table_x/files
s3://bucket/folder_table_y/files
The goal is to crawler this buckets and folders to create a new database and then query via Amazon Athena.
But i'm getting the following tables (i mean, with the name of the file, not the folder):
run-unnamed-1-part-r-00000
Most of the tables are being created correctly, but I'm not being able to deal with some.
I've already set the table level as 2 (is that right?) and also set the option that says "Create a single schema for each S3 path"
Theses files that are being created as tables it contains only the header, but none info.
Anyone can help?

Why is my AWS Glue crawler not creating any tables?

I'm attempting to use AWS Glue to ETL a MySQL database in RDS to S3 so that I can work with the data in services like SageMaker or Athena. At this time, I don't care about transformations, this is a prototype and I simply want to dump the DB to S3 to start testing the various tool chains.
I've set up a Glue database and tested the connection to RDS successfully
I am using the AWS provide Glue IAM service role
My S3 bucket has the correct prefix of aws-glue-*
I created a crawler using the Glue database, AWSGlue service role, and S3 bucket above with the options:
Schema updates in the data store: Update the table definition in the data catalog
Object deletion in the data store: Delete tables and partitions from the data catalog.
When I run the crawler, it completes in ~60 seconds but it does not create any tables in the database.
I've tried adding the Admin policy to the glue service role to eliminate IAM access issues and the result is the same.
Also, CloudWatch logs are empty. Log groups are created for the test connection and the crawler but neither contains any entries.
I'm not sure how to further troubleshoot this, info on AWS Glue seems pretty sparse.
Figured it out. I had a syntax error in my "include path" for the crawler. Make sure the connection is the data source (RDS in this case) and the include path lists the data target you want e.g. mydatabase/% (I forgot the /%).
You can substitute the percent (%) character for a schema or table. For databases that support schemas, type MyDatabase/MySchema/% to match all tables in MySchema with MyDatabase. Oracle and MySQL don't support schema in the path, instead type MyDatabase/%. For information about which JDBC data stores support schema, see Cataloging Tables with a Crawler.
Ryan Fisher is correct in the sense that it's an error. I wouldn't categorize it as a syntax error. When I ran into this it was because the 'Include path' didn't include the default schema that sql server lovingly provides to you.
I had this: database_name/table_name
When it needed to be: database_name/dbo/table_name

Athena can't resolve CSV files from AWS DMS

I've DMS configured to continuously replicate data from MySQL RDS to S3. This creates two type of CSV files: a full load and change data capture (CDC). According to my tests, I have the following files:
testdb/addresses/LOAD001.csv.gz
testdb/addresses/20180405_205807186_csv.gz
After DMS is running properly, I trigger a AWS Glue Crawler to build the Data Catalog for the S3 Bucket that contains the MySQL Replication files, so the Athena users will be able to build queries in our S3 based Data Lake.
Unfortunately the crawlers are not building the correct table schema for the tables stored in S3.
For the example above It creates two tables for Athena:
addresses
20180405_205807186_csv_gz
The file 20180405_205807186_csv.gz contains a one line update, but the crawler is not capable of merging the two informations (taking the first load from LOAD001.csv.gz and making the updpate described in 20180405_205807186_csv.gz).
I also tried to create the table in the Athena console, as described in this blog post:https://aws.amazon.com/pt/blogs/database/using-aws-database-migration-service-and-amazon-athena-to-replicate-and-run-ad-hoc-queries-on-a-sql-server-database/.
But it does not yield the desired output.
From the blog post:
When you query data using Amazon Athena (later in this post), you
simply point the folder location to Athena, and the query results
include existing and new data inserts by combining data from both
files.
Am I missing something?
The AWS Glue crawler is not able to reconcile the different schemas in the initial LOAD csvs and incremental CDC csvs for each table. This blog post from AWS and its associated cloudformation templates demonstrate how to use AWS Glue jobs to process and combine these two type of DMS target outputs.
Athena will combine the files in am S3 if they are the same structure. The blog speaks to only inserts of new data in the cdc files. You'll have to build a process to merge the CDC files. Not what you wanted to hear, I'm sure.
From the blog post:
"When you query data using Amazon Athena (later in this post), due to the way AWS DMS adds a column indicating inserts, deletes and updates to the new file created as part of CDC replication, we will not be able to run the Athena query by combining data from both files (initial load and CDC files)."

Scheduling data extraction from AWS Redshift to S3

I am trying to build out a job for extracting data from Redshift and write the same data to S3 buckets.
Till now I have explored AWS Glue, but Glue is not capable to run custom sql's on redshift. I know we can run unload commands and can be stored to S3 directly. I am looking for a solution which can be parameterised and scheduled in AWS.
Consider using AWS Data Pipeline for this.
AWS Data Pipeline is AWS service that allows you to define and schedule regular jobs. These jobs are referred to as pipelines. Pipeline contains a business logic of the work required, for example, extracting data from Redshift to S3. You can schedule a pipeline to run however often you require e.g. daily.
Pipeline is defined by you, you can even version control it. You can prepare a pipeline definition in a browser using Data Pipeline Architect or compose it using JSON file locally on your computer. Pipeline definition is composed of components, such as, Redshift database, S3 node , SQL activity, as well as parameters, for example to specifying S3 path to use for extracted data.
AWS Data Pipeline service handles scheduling, dependency between components in your pipeline, monitoring and error handling.
For your specific use case, I would consider the following options:
Option 1
Define pipeline with the following components: SQLDataNode and S3DataNode. SQLDataNode would reference your Redshift database and SELECT query to use to extract your data. S3DataNode would point to S3 path to be used to store your data. You add a CopyActivity activity to copy data from SQLDataNode to S3DataNode. When such pipeline runs, it will retrieve data from Redshift using SQLDataNode and copy that data to S3DataNode using CopyActivity. S3 path in S3DataNode can be parameterised so it is different every time you run a pipeline.
Option 2
Firstly, define SQL query with UNLOAD statement to be used to unload your data to S3. Optionally, you can save it in a file and upload to S3. Use SQLActivity component to specify SQL query to execute in Redshift database. SQL query in SQLActivity can be a reference to S3 path where you stored your query (optionally), or just a query itself. Whenever a pipeline runs, it will connect to Redshift and execute SQL query which stores the data in S3.
Constraints of option 2: in UNLOAD statement, S3 path is static. If you plan to store every data extract in a separate S3 path, you will have to modify UNLOAD statement to use another S3 path every time you run it which is not out-of-the-box function.
Where do these pipelines run?
On EC2 instance with a TaskRunner, a tool provided by AWS to run data pipelines. You can start that instance automatically at the time when pipeline runs, or you can reference already running instance with a TaskRunner installed on it. You have to make sure that EC2 instance is allowed to connect to your Redshift database.
Relevant documentation:
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftdatabase.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-sqldatanode.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-sqlactivity.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-using-task-runner.html
I think Pawel has answered this correctly , I'm just adding details on option two for anyone who wants to implement this:
Go to "Data Pipeline" from AWS console
Click on "New Pipeline" on top right corner page
Edit each field in this json file(after copying to your favorite editor) and update the fields which has "$NEED_TO_UPDATE_THIS_WITH_YOURS" with the correct value that pertains to your AWS environment and save it as data_pipeline_template.json some where on your computer
Go back to AWS Console again, Click on "Load Local File" for the source field and upload the json file
if you are not able to upload it because you may be getting some error related to your database instances etc then follow these steps:
Go to "Data Pipeline" from AWS console
Click on "New Pipeline" on top right corner page
Populate all the fields manually (see below)
Click on "Edit in Architect" at the bottom of the page
Implement the same activities and resources as below , again make sure your are adding the correct values such as your Database JDBC connection etc

How to run aws glue jobs on incremental data at s3?

On first day i kept my data as Folder 1 in s3 and run the job from glue,
i got the expected output.
On second day i kept my data as Folder 2 in same parent folder and run the job from glue,
folder1 data got replicated and output for data in folder 2 also came.
How can i avoid replication of data from folder1?
Have you enabled the bookmark in your AWS Glue Job? Enabling the bookmark will cause Glue to keep track of what it has already loaded. If you ever have to reload all your data, there's a "reset bookmark" option on the Jobs menu.