AWS Athena - duplicate columns due to partitionning - amazon-web-services

We have a glue crawler that read avro files in S3 and create a table in glue catalog accordingly.
The thing is that we have a column named 'foo' that came from the avro schema and we also have something like 'foo=XXXX' in the s3 bucket path, to have Hive partitions.
What we did not know is that the crawler will then create a table which now has two columns with the same name, thus our issue while querying the table:
HIVE_INVALID_METADATA: Hive metadata for table mytable is invalid: Table descriptor contains duplicate columns
Is there a way to tell glue to map the partition 'foo' to another column name like 'bar' ?
That way we would avoid having to reprocess our data by specifying a new partition name in the s3 bucket path..
Or any other suggestions ?

Glue Crawlers are pretty terrible, this is just one of the many ways where it creates unusable tables. I think you're better off just creating the tables and partitions with a simple script. Create the table without the foo column, and then write a script that lists your files on S3 do the Glue API calls (BatchCreatePartition), or execute ALTER TABLE … ADD PARTITION … calls in Athena.
Whenever new data is added on S3, just add the new partitions with the API call or Athena query. There is no need to do all the work that Glue Crawlers do if you know when and how data is added. If you don't, you can use S3 notificatons to run Lambda functions that do the Glue API calls instead. Almost all solutions are better than Glue Crawlers.
The beauty of Athena and Glue Catalog is that it's all just metadata, it's very cheap to throw it all away and recreate it. You can also create as many tables as you want that use the same location, to try out different schemas. In your case there is no need to move any objects on S3, you just need a different table and a different mechanism to add partitions to it.

You can fix this by updating the schema of the glue table and rename the duplicate column:
Open the AWS Glue console.
Choose the table name from the list, and then choose Edit schema.
Choose the column name foo (not the partitioned column foo), enter a new name, and then choose Save.
Reference:
Resolve HIVE_INVALID_METADATA error

Related

Possible to copy data from one s3 bucket to another via Hive?

How can I query one bucket via hive and copy the results to another bucket in s3?
I have a DDL setup to run avro queries but wanting to transfer the subset of results from my filter to a new bucket/location in s3.
You can just use a CREATE TABLE AS SELECT statement from one catalog in Presto to another.

AWS Glue Crawler query

I have a few AWS Glue crawlers setup to crawl CSV's in S3 to populate my tables in Athena.
My scenario and question:
I replace the .csv files in S3 daily with updated versions. Do I have to run the existing crawlers again perhaps on a schedule to update the tables on Athena with the latest content? Or is the crawler only required to run if schema changes such as additional columns added? I just want to ensure that my tables in Athena always output all of the data as per the updated CSV's - I rarely do any schema changes to the table structures. If the crawlers are only required to run when actual structure changes take place then I would prefer to run them a lot less frequently
When a glue crawler runs, the following actions take place:
It classifies data to determine the format, schema, and associated properties of the raw data
Groups data into tables or partitions
Writes metadata to the Data Catalog
The schema of tables created in the Data Catalog is referenced by Athena to query the specified S3 datasource. So, if the schema remains constant, scheduling the crawler runs can be reduced.
You can also refer the documentation here to understand working with glue crawlers and csv files in Athena: https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html

While Running AWS Athena query Query says Zero Records Returned

SELECT * FROM "sampledb"."parquetcheck" limit 10;
Trying to use Parquet file in S3 and created a table in AWS Athena and it is created perfectly.
However when I run the select query, it says "Zero Records Returned."
Although My Parquet file in S3 has data.
I have created partition too. IAM has full access on Athena.
If your specified columns names are correct, then you may need to load the partitions using: MSCK REPAIR TABLE EnterYourTableName;. This will add new partitions to the Glue Catalog.
If any of the above fails, you can create a temporary Glue Crawler to crawl your table and then validate the metadata in Athena by clicking the 3 dots next to the table name. Then select Generate Create Table DDL. You can then compare any differences in DDL.

Pipelining Athena query after ETL from glue crawler

I have data that is coming into an S3 bucket and I would like to run a query on it every hour. The data comes in as a JSON. I crawl it, run a job on the data to transform it to ORC format, and crawl it again to create a table that's faster for queries than the original JSONs (as they are deeply nested). I'm trying to query the data with Athena. I have managed to link the previous steps together using Lambda and cloudwatch events.
The problem here is that the last crawler is supposed to create new tables instead of just partitions of the same table, so the table name is not known prior to running the list of jobs. I found that you can listen for the creation of a new table and the completion of a crawler, but the log for the end of a crawler's run doesn't contain the name of the new table created (using Amazon's Documentation). Is there a way to get this table name dynamically and query it using Lambda or Athena? Thanks
Why not invoke lambda from glue job after crawler completes? Table name is folder in S3 bucket in which you stored orc data. Since it is done in glue job, I believe you already have folder name which you can pass to lambda from glue job.

Query csv tables stored s3 through athena

Recently we started to store our backups in aws s3. It is all csv files that we need to query through aws athena.
We tried to insert the tables one by one but it's taking too long, it is a fair amount of data. Is there any API that we can use or something that is alredy set?
we were about to do something with spark, but maybe there is a simpler way, or something that's already have been done.
thanks
You can simply create an external table on top of CSV files with the required properties.
Reference : Create External Table on AWS Athena
You can also use Glue Crawler and configure it to automatically populate the tables for you.
Reference : Cataloging tables with a crawler
There are different AWS SDK's available (here) to automate your tasks like uploading files to S3, creating athena tables or cataloging tables through glue clawler.