I'm trying to crawler an S3 bucket with multiple folders, each one containing some csv files extracted by a Glue Job from Amazon RDS.
In the moment, this is basically the schema for S3:
s3://bucket/folder_table_x/files
s3://bucket/folder_table_y/files
The goal is to crawler this buckets and folders to create a new database and then query via Amazon Athena.
But i'm getting the following tables (i mean, with the name of the file, not the folder):
run-unnamed-1-part-r-00000
Most of the tables are being created correctly, but I'm not being able to deal with some.
I've already set the table level as 2 (is that right?) and also set the option that says "Create a single schema for each S3 path"
Theses files that are being created as tables it contains only the header, but none info.
Anyone can help?
Related
My doubt is related with the files saved after run a query in AWS Athena. It will save two files in S3 bucket connected to workgroup, but I only want the csv.metadata. Is it a way to only create the file csv.metadata instead of create the two files?
Thanks
I exported my SQL DB into S3 in csv format. Each table is exported into separate csv files and saved in Amazon S3. Now, can I send any query to that S3 bucket which can join multiple tables (multiple csv files in S3) and get a result-set? How can I do that and save in a separate csv file?
The steps are:
Put all files related to one table into a separate folder (directory path) in the S3 bucket. Do not mix files from multiple tables in the same folder because Amazon Athena will assume they all belong to one table.
Use the CREATE TABLE to define a new table in Amazon Athena, and specify where the files are kept via the LOCATION 's3://bucket_name/[folder]/' parameter. This tells Athena which folder to use when reading the data.
Or, instead of using CREATE TABLE, an easier way is:
Go to the AWS Glue management console
Select Create crawler
Select Add a data source provide the location in S3 where the data is stored
Provide other information as prompted (you'll figure it out)
Then, run the crawler and AWS Glue will look at the data files in the specified folder and will automatically create a table for that data. The table will appear in the Amazon Athena console.
Once you have created the tables, you can use normal SQL to query and join the tables.
Directly related to: Crawler is creating a table with weird suffix to the name
I have an AWS glue crawler, crawling an S3 bucket.
I changed the location of the data to a different S3 bucket, updated the crawler, and deleted completly the old tables - using DROP TABLE and making sure the table doesn't exist in glue data catalog, but whenever I run the crawler again it creates tables with hash in the suffix.
Is there a way to prevent this behaviour?
I am trying AWS Glue crawler to create tables in athena.
The source that I am pulling it from is a Postgresql server. The crawler is able to parse the tables, create metadata and show the tables and columns in the Glue data catalog but the tables are not added in athena despite the fact that I have added the target database from athena.
Not sure why this is happening
Also, if I choose a csv source from s3 then it is able to create a table in athena with _csv as a suffix
Any help?
Athena doesn't recognize my Postgres tables added by Glue either. My guess is that Athena is used for querying data stored on S3, so it's not working for database queries.
Also, to be able to query your CSV files on S3, files need to be under a folder crawled by glue. If you just crawl a single file with Glue, Athena will return 0 records from the query.
I've DMS configured to continuously replicate data from MySQL RDS to S3. This creates two type of CSV files: a full load and change data capture (CDC). According to my tests, I have the following files:
testdb/addresses/LOAD001.csv.gz
testdb/addresses/20180405_205807186_csv.gz
After DMS is running properly, I trigger a AWS Glue Crawler to build the Data Catalog for the S3 Bucket that contains the MySQL Replication files, so the Athena users will be able to build queries in our S3 based Data Lake.
Unfortunately the crawlers are not building the correct table schema for the tables stored in S3.
For the example above It creates two tables for Athena:
addresses
20180405_205807186_csv_gz
The file 20180405_205807186_csv.gz contains a one line update, but the crawler is not capable of merging the two informations (taking the first load from LOAD001.csv.gz and making the updpate described in 20180405_205807186_csv.gz).
I also tried to create the table in the Athena console, as described in this blog post:https://aws.amazon.com/pt/blogs/database/using-aws-database-migration-service-and-amazon-athena-to-replicate-and-run-ad-hoc-queries-on-a-sql-server-database/.
But it does not yield the desired output.
From the blog post:
When you query data using Amazon Athena (later in this post), you
simply point the folder location to Athena, and the query results
include existing and new data inserts by combining data from both
files.
Am I missing something?
The AWS Glue crawler is not able to reconcile the different schemas in the initial LOAD csvs and incremental CDC csvs for each table. This blog post from AWS and its associated cloudformation templates demonstrate how to use AWS Glue jobs to process and combine these two type of DMS target outputs.
Athena will combine the files in am S3 if they are the same structure. The blog speaks to only inserts of new data in the cdc files. You'll have to build a process to merge the CDC files. Not what you wanted to hear, I'm sure.
From the blog post:
"When you query data using Amazon Athena (later in this post), due to the way AWS DMS adds a column indicating inserts, deletes and updates to the new file created as part of CDC replication, we will not be able to run the Athena query by combining data from both files (initial load and CDC files)."