LOAD DATA FROM S3 PREFIX using bucket root only loads some files - amazon-web-services

I have about 2M+ records across ~600 CSV files in a single bucket all at the root level - not in any subfolders. The files all start with a unique ID number of 3-6 digits. If I do the following command:
LOAD DATA FROM S3 PREFIX 's3://my-bucket/'
IGNORE INTO TABLE `my_table`
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
IGNORE 1 LINES;
Only about 500k records are loaded into the table. But if I do a sequence of commands starting with 1-9 then eventually I get the expected row count of data loaded into the table.
LOAD DATA FROM S3 PREFIX 's3://my-bucket/1'
...
LOAD DATA FROM S3 PREFIX 's3://my-bucket/2'
...
LOAD DATA FROM S3 PREFIX 's3://my-bucket/3'
...
...
LOAD DATA FROM S3 PREFIX 's3://my-bucket/9'
According to the docs, it does not appear you can use wildcard * in prefix string. I'm at a loss as to why this isn't behaving as expected.

Update, figured out the issue. The files were being overwritten/replaced as part of an update process. If a file/object was in the middle of being written to then the LOAD from S3 would stop on that file. The solution was to prefix the updates with a timestamp instead of writing on top of the same file names over and over.

Related

Athena query error HIVE_BAD_DATA: Not valid Parquet file . csv / .metadata

I'm creating an app that works with AWS Athena on compressed Parquet (SNAPPY) data.
It works almost fine, however, after every query execution, 2 files get uploaded to the S3_OUTPUT_BUCKET of type csv and metadata. (as it should)
These 2 files break the execution of the next query.
I get the following error:
HIVE_BAD_DATA: Not valid Parquet file: s3://MY_OUTPUT_BUCKET/logs/QUERY_NAME/2022/08/07/tables/894a1d10-0c1d-4de1-9e61-13b2b0f79e40.metadata expected magic number: PAR1 got: HP
I need to manually delete those files for the next query to work.
Any suggestions on how to make this work?
(I know I cannot exclude those files with a regex etc.. but I don't want to delete the files manually for the app to work)
I read everything about the output files but it didn't help. ( Working with query results, recent queries, and output files )
Any help is appreciated.
While setting up Athena for execution, we need to specify where the metadata and csv from the query execution are written into. This needs to be written into a different folder than the table location.
Go to Athena Query Editor > Settings > Manage
and edit Query Result Location to be another S3 bucket than the table or a different folder within the same bucket.

Excluded folder in glue crawler throws HIVE_BAD_DATA error in Athena

I'm trying to create a glue crawler to crawl a specific path pattern. I have the following paths:
bucket/inference/2022/04/28/modelling/metadata.tar.gz
bucket/inference/2022/04/28/prediction/predictions.parquet
bucket/inference/2022/04/28/extract/data.parquet
The same pattern is repeated every day, i.e. we have the above for
bucket/inference/2022/04/29/*
bucket/inference/2022/04/30/*
I only want to crawl what's in the **/predictions folders each day. I've set up a glue crawler pointing to bucket/inference/, and have the following exclude patterns:
**/modelling/**
**/extract/**
The logs correctly show that the bucket/inference/2022/04/28/modelling/metadata.tar.gz and bucket/inference/2022/04/28/extract/data.parquet files are being excluded, and the DDL metadata shows that it's picking up the correct number of objects and rows in the data.
However, when I go to SELECT * in Athena, I get the following error:
HIVE_BAD_DATA: Not valid Parquet file: s3://bucket/inference/2022/04/28/modelling/metadata.tar.gz expected magic number: PAR1
I've tried every combo of the above exclude patterns, but it always seems to be picking up what's in the modelling folder, despite the logs explicitly excluding it. Am I missing something here?
Many thanks.
This is a known issue with Athena. From AWS troubleshooting documentation:
Athena does not recognize exclude patterns that you specify an AWS Glue crawler. For example, if you have an Amazon S3 bucket that contains both .csv and .json files and you exclude the .json files from the crawler, Athena queries both groups of files. To avoid this, place the files that you want to exclude in a different location.
Reference: Athena reads files that I excluded from the AWS Glue crawler (AWS)

create tables from S3 bucket file

In my S3 bucket I have several files with different schemas.
s3://folder/file1.csv
s3://folder/file2.csv
s3://folder/file3.csv
s3://folder/file4.csv
All files contain fields I need, but number of columns differs.
I try to do this for one of the file, but the created table remains empty
CREATE EXTERNAL TABLE test1 (
app_id string,
app_version string
)
row format delimited fields terminated by ','
LOCATION 's3://folder/file4.csv';
MSCK REPAIR TABLE test1;
Can I create 3 tables from these files? Or I can put fields I need from all files in one table?
You cannot define a file as a LOCATION for Amazon Athena. It will result in this error message:
Can't make directory for path 's3://my-bucket/foo.csv' since it is a file
You should put each file in a separate folder and then set the LOCATION to the folder. All files in that folder (even if it is just one file) will be scanned for each query.
Also, there is no need to call MSCK REPAIR TABLE unless it contains partitions.
By the way, this line:
LOCATION 's3://folder/file4.csv'
should also specify the bucket name:
LOCATION 's3://my-bucket/folder/file4.csv'

S3 avoid loading of duplicate files

I have the following work flow.
I need to identify duplicate files on S3 in order to avoid duplicates on my destination ( Redshift ).
Load files to S3 every 4 hours from FTP Server ( File storage structure : year/month/date/hour/minute/filename)
Load S3 to Redshift once all of the files are pulled ( for that interval )
This is a continuous job that will be running every 4 hour.
Problem :
Some times the files with same content but different file names are present on S3. These files can belong to different intervals or different days. For example if a files arrives say one.csv on 1st Oct 2018 and contains 1,2.3,4 as a content then it is possible that on 10th Oct 2018 a file may arrive with same content 1,2,3,4 but with different file name.
I want to avoid to load this file to S3 if the contents are same.
I know that I can use file hash to identify the two identical files, But my problem is how to achieve this on S3 and that too with so much of files.
What will be the best approach to proceed ?
Bascially, I want to avoid loading of data to S3 that is already present.
You can add another table in redshift ( or anywhere else actually like MySQL or dynamodb ) which will contain Etag/md5 hash of files uploaded.
You might already be having a script which is running every 4 hours and is loading data into redshift. In this same script, after data is loaded successfully into redshift; just make an entry into this table. Also, put a check in this same script(from this new table) before loading data into Redshift.
You need to make sure, that you load this new table with all the Etags of files you have already loaded into redshift.

Regex for s3 directory path

I want to use Regex to find an S3 directory path in AWS Data Pipeline.
This is for an S3 Data Node. And then I will do a Redshift Copy from S3 to a Redshift table.
Example S3 path: S3://foldername/hh=10
Can you we use Regex to find hh=##, where ## could be any number from 0-24.
The goal is to copy all the files in folders where the name is hh=1, hh=2, hh=3, etc. (hh is hour)
Here's a bit of regex that will capture the last 1 or 2 digits after 'hh=', at the end of the line.
/hh=(\d{1,2})$/