AWS Athena Returning Zero Records from Tables Created from GLUE Crawler input csv from S3 - amazon-web-services

Part One :
I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned.
But the demo data of ELB in Athena works fine.
Part Two (Scenario:)
Suppose I Have a excel file and data dictionary of how and what format data is stored in that file , I want that data to be dumped in AWS Redshift What would be best way to achieve this ?

I experienced the same issue. You need to give the folder path instead of the real file name to the crawler and run it. I tried with feeding folder name to the crawler and it worked. Hope this helps. Let me know. Thanks,

I experienced the same issue. try creating separate folder for single table in s3 buckets than rerun the glue crawler.you will get a new table in glue data catalog which has the same name as s3 bucket folder name .

Delete Crawler ones again create Crawler(only one csv file should be not more available in s3 and run the crawler)
important note
one CSV file run it we can view the records in Athena.

I was indeed providing the S3 folder path instead of the filename and still couldn't get Athena to return any records ("Zero records returned", "Data scanned: 0KB").
Turns out the problem was that the input files (my rotated log files automatically uploaded to S3 from Elastic Beanstalk) start with underscore (_), e.g. _var_log_nginx_rotated_access.log1534237261.gz! Apparently that's not allowed.

The structure of the s3 bucket / folder is very important :
s3://<bucketname>/<data-folder>/
/<type-1-[CSVs|Parquets etc]>/<files.[csv or parquet]>
/<type-2-[CSVs|Parquets etc]>/<files.[csv or parquet]>
...
/<type-N-[CSVs|Parquets etc]>/<files.[csv or parquet]>
and specify in the "include path" of the Glue Crawler:
s3://<bucketname e.g my-s3-bucket-ewhbfhvf>/<data-folder e.g data>

Solution: Select path of folder even if within folder you have many files. This will generate one table and data will be displayed.

So in many such cases using EXCLUDE PATTERN in Glue Crawler helps me.
This is sure that instead of directly pointing the crawler to the file, we should point it to the directory and even by doing so when we do not get any records, Exclude Pattern comes to rescue.
You will have to devise some pattern by which only the file which u want gets crawled and rest are excluded. (suggesting to do this instead of creating different directories for each file and most of the times in production bucket, doing such changes is not feasible )
I was having data in S3 bucket ! There were multiple directories and inside each directory there were snappy parquet file and json file. The json file was causing the issue.
So i ran the crawler on the master directory that was containing many directories and in the EXCLUDE PATTERN i gave - * / *.json
And this time, it did no create any table for the json file and i was able to see the records of the table using Athena.
for reference - https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html

Pointing glue crawler to the S3 folder and not the acutal file did the trick.

Here's what worked for me: I needed to move all of my CSVs into their own folders, just pointing Glue Crawler to the parent folder ('csv/' for me) was not enough.
csv/allergies.csv -> fails
csv/allergies/allergies.csv -> succeeds
Then, I just pointed AWS Glue Crawler to csv/ and everything was parsed out well.

Related

Athena query error HIVE_BAD_DATA: Not valid Parquet file . csv / .metadata

I'm creating an app that works with AWS Athena on compressed Parquet (SNAPPY) data.
It works almost fine, however, after every query execution, 2 files get uploaded to the S3_OUTPUT_BUCKET of type csv and metadata. (as it should)
These 2 files break the execution of the next query.
I get the following error:
HIVE_BAD_DATA: Not valid Parquet file: s3://MY_OUTPUT_BUCKET/logs/QUERY_NAME/2022/08/07/tables/894a1d10-0c1d-4de1-9e61-13b2b0f79e40.metadata expected magic number: PAR1 got: HP
I need to manually delete those files for the next query to work.
Any suggestions on how to make this work?
(I know I cannot exclude those files with a regex etc.. but I don't want to delete the files manually for the app to work)
I read everything about the output files but it didn't help. ( Working with query results, recent queries, and output files )
Any help is appreciated.
While setting up Athena for execution, we need to specify where the metadata and csv from the query execution are written into. This needs to be written into a different folder than the table location.
Go to Athena Query Editor > Settings > Manage
and edit Query Result Location to be another S3 bucket than the table or a different folder within the same bucket.

Excluded folder in glue crawler throws HIVE_BAD_DATA error in Athena

I'm trying to create a glue crawler to crawl a specific path pattern. I have the following paths:
bucket/inference/2022/04/28/modelling/metadata.tar.gz
bucket/inference/2022/04/28/prediction/predictions.parquet
bucket/inference/2022/04/28/extract/data.parquet
The same pattern is repeated every day, i.e. we have the above for
bucket/inference/2022/04/29/*
bucket/inference/2022/04/30/*
I only want to crawl what's in the **/predictions folders each day. I've set up a glue crawler pointing to bucket/inference/, and have the following exclude patterns:
**/modelling/**
**/extract/**
The logs correctly show that the bucket/inference/2022/04/28/modelling/metadata.tar.gz and bucket/inference/2022/04/28/extract/data.parquet files are being excluded, and the DDL metadata shows that it's picking up the correct number of objects and rows in the data.
However, when I go to SELECT * in Athena, I get the following error:
HIVE_BAD_DATA: Not valid Parquet file: s3://bucket/inference/2022/04/28/modelling/metadata.tar.gz expected magic number: PAR1
I've tried every combo of the above exclude patterns, but it always seems to be picking up what's in the modelling folder, despite the logs explicitly excluding it. Am I missing something here?
Many thanks.
This is a known issue with Athena. From AWS troubleshooting documentation:
Athena does not recognize exclude patterns that you specify an AWS Glue crawler. For example, if you have an Amazon S3 bucket that contains both .csv and .json files and you exclude the .json files from the crawler, Athena queries both groups of files. To avoid this, place the files that you want to exclude in a different location.
Reference: Athena reads files that I excluded from the AWS Glue crawler (AWS)

Create Athena table using s3 source data

Below is given the s3 path where I have stored the files obtained at the end of a process. The below-provided path is dynamic, that is, the value of the following fields will vary - partner_name, customer_name, product_name.
s3://bucket/{val1}/data/{val2}/output/intermediate_results
I am trying to create Athena tables for each output file present under output/ as well as under intermediate_results/ directories, for each val1-val2.
Each file is a CSV.
But I am not much familiar with AWS Athena so I'm unable to figure out the way to implement this. I would really appreciate any kind of help. Thanks!
Use CREATE TABLE - Amazon Athena. You will need to specify the LOCATION of the data in Amazon S3 by providing a path.
Amazon Athena will automatically use all files in that path, including subdirectories. This means that a table created with a Location of output/ will include all subdirectories, including intermediate_results. Therefore, your data storage format is not compatible with your desired use for Amazon Athena. You would need to put the data into separate paths for each table.

how to create multiple table from multiple folder with one location path and athena should also work on it with glue crawler

I have tried this not achieving required results-
I have multiple CSV files in a folder of s3 bucket but when it creates multiple table for it then Athena returns zero results so I made a different folder for each file then it works fine.
problem-
but if in future more folders will be added then I have to go to crawler and have to add a new location path for each newly added folder so is there any way to do it automatically or some other way to do it. I am using glue crawler and s3 bucket athena for query run on multiple CSV files.
In general a table needs all of its files to be in a directory, and no other files to be in that directory.
There is however, a mechanism that makes it possible to create tables that include just specific files. You can read more about that in the the second part of this answer: Partition Athena query by S3 created date (scroll down a bit after the horizontal rule). You can also find an example in the S3 Inventory documentation: https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html

how do I automatically save the s3 bucket names in glue pyspark script?

My problem now is I have tons of JSON files in a S3 bucket (which contains several sub buckets).
I want to explode it and save to a new flat file with one of the column telling me which sub bucket the records were originally from.
How do I do it in SQL to automatically get that information? THanks!!
I am using glue pyspark , btw.
from comments -> You can use input_file_name()column