AWS Glue Crawler creating temp_tables in Athena - amazon-web-services

I have a set up with a few crawlers crawling a few buckets and generating tables within Glue that are then able to be queried from the Athena query engine.
I have noticed recently that a lot of tables have been popping up, none located in the buckets I am crawling and all look like some system generated tables
temp_table_*
There appear to be over 10 created every day, and when I look at the details of them they are generated from the system query results s3://aws-athena-query-results-*
Is there a reason these are created? Do I have to manually clean them up? Is there a way to stop them being generated, or ignore them ?
They are cluttering the ability to see the tables that matter in Athena.
Thanks in advance for any assistance.

Related

How to deal with failing Athena queries as AWS Glue datacatalog metada size grows large?

Based on my research, the easiest and the most straight forward way to get metadata out of Glue's Data Catalog, is using Athena and querying the information_schema database. The article below has come up frequently in my research and is written by Amazon's team:
Querying AWS Glue Data Catalog
However, under the section titled Considerations and limitations the following is written:
Querying information_schema is most performant if you have a small to moderate amount of AWS Glue metadata. If you have a large amount of metadata, errors can occur.
Unfortunately, in this article, there do not seem to be any indications or suggestion regarding what constitutes as "large amount of metadata" and exactly what errors could occur when the metadata is large and one needs to query the metadata.
My question is, how to deal with the issue related to the ever growing size of data catalog's metadata so that one would never encounter errors when using Athena to query the metadata?
Is there a best practice for this? Or perhaps a better solution for getting the same metadata that querying the catalog using Athena provides without multiple or great many API calls (using boto3, Hive DDL etc)?
I talked to AWS Support and did some research on this. Here's what I gathered:
The information_schema is built at query execution time, there doesn't seem to be any caching.
If you access information_schema.tables, it will make separate calls for each schema you have to the Hive Metastore (Glue Data Catalog).
If you access information_schema.columns, it will make separate calls for each schema and each table in that schema you have to the Hive Metastore.
These queries are affected by the general service quotas. In this case, DML queries like your select must finish within 30 minutes.
If your Glue Data Catalog has many thousands of schemas, tables, and columns all of this may result in slow performance. As a rough guesstimate support told me that you should be fine as long as you have less than ~ 10000 tables, which should be the case for most people.

Creating a new file in aws s3 (parquet/csv) via queries in aws Athena

Goal/Help Requested:
Given a bunch of tables in Athena, write an sql query that writes a scheduled new file into s3 to be used for other purposes.
Limited aws background - currently teaching myself but could use some help if possible.
Background
I have a very large number of Athena tables that draw upon a very large number of s3 folders with extensive nesting.
For example a 'sales' Athena table that draws upon a 'sales' s3 bucket that has depth of countries>regions>states>cities>stores>week>day. I have similar other types of information.
I want to write various sql queries that take this information and apply some logic that writes back single files into s3 to be used for various other purposes I have.
What I'm looking for
Any sort of links/direction on how to do this. Again I'm pretty new to aws and maybe this is a simple ask, I just can't articulate it properly.

Row level changes captured via AWS DMS

I am trying to migrate the database using AWS DMS. Source is Azure SQL server and destination is Redshift. Is there any way to know the rows updated or inserted? We dont have any audit columns in source database.
Redshift doesn’t track changes and you would need to have audit columns to do this at the user level. You may be able to deduce this from Redshift query history and save data input files but this will be solution dependent. Query history can be achieved in a couple of ways but both require some action. The first is to review the query logs but these are only saved for a few days. If you need to look back further than this you need a process to save these tables so the information isn’t lost. The other is to turn on Redshift logging to S3 but this would need to be turned on before you run queries on Redshift. There may be some logging from DMS that could be helpful but I think the bottom line answer is that row level change tracking is not something that is on in Redshift by default.

Athena not collecting results for a portion of files in Amazon S3

I've tried to find similar issues on here/online, but came up short.
I have Athena pointing to a folder in Amazon S3 which itself contains folders/partitions each with a single .tsv inside (e.g. s3://my_bucket/partition/file.tsv). Athena is able to collect results for the majority of the files in the bucket, but doesn't collect results for a small number of them.
I've run the repair code (MSCK REPAIR TABLE) and I checked glue to make sure that it is seeing the partitions (it is). I also checked the Amazon knowledge center (https://aws.amazon.com/premiumsupport/knowledge-center/athena-empty-results/). Not sure what else might be causing the issue.
It turns out that the columns of the tables (pulled from an API) were in a different order for the files that were not working. Running the queries on a different field provided results. The solution was to enforce the order of the columns were consistent after collecting data from the API.

AWS Glue crawler need to create one table from many files with identical schemas

We have a very large number of folders and files in S3, all under one particular folder, and we want to crawl for all the CSV files, and then query them from one table in Athena. The CSV files all have the same schema. The problem is that the crawler is generating a table for every file, instead of one table. Crawler configurations have a checkbox option to "Create a single schema for each S3 path" but this doesn't seem to do anything.
Is what I need possible? Thanks.
Glue crawlers claims to solve many problems, but in fact solves few. If you're slightly outside the scope of what they designed for you're out of luck. There might be a way to configure it to do what you want, but in my experience trying to make Glue crawlers do things that aren't perfectly aligned with it is not worth the effort.
It sounds like you have a good idea of what the schema of your data is. When that is the case Glue crawlers also provide very little value. You probably have a better idea of what the schema should look than Glue will ever be able to figure out.
I suggest that you manually create the table, and write a one off script that lists all the partition locations on S3 that you want to include in the table and generate ALTER TABLE ADD PARTITION … SQL, or Glue API calls to add those partitions to the table.
To keep the table up to date when new partition locations are added, have a look at this answer for guidance: https://stackoverflow.com/a/56439429/1109
One way to do what you want is to use just one of the tables created by the crawler as an example, and create a similar table manually (in AWS Glue->Tables->Add tables, or in Athena itself, with
CREATE EXTERNAL TABLE `tablename`(
`column1` string,
`column2` string, ...
using existing table as an example, you can see the query used to create that table in Athena when you go to Database -> select your data base from Glue Data Catalog, then click on 3 dots in front of the one "automatically created by crawler table" that you choose as an example, and click on "Generate Create table DDL" option. It will generate a big query for you, modify it as necessary (I believe you need to look at LOCATION and TBLPROPERTIES parts, mostly).
When you run this modified query in Athena, a new table will appear in Glue data catalog. But it will not have any information about your s3 files and partitions, and crawler most likely will not update metastore info for you. So you can in Athena run "MSCK REPAIR TABLE tablename;" query (it's not very efficient, but works for me), and it will add missing file information, in the Result tab you will see something like (in case you use partitions on s3, of course):
Partitions not in metastore: tablename:dt=2020-02-03 tablename:dt=2020-02-04
Repair: Added partition to metastore tablename:dt=2020-02-03
Repair: Added partition to metastore tablename:dt=2020-02-04
After that you should be able to run your Athena queries.