create tables from S3 bucket file - amazon-web-services

In my S3 bucket I have several files with different schemas.
s3://folder/file1.csv
s3://folder/file2.csv
s3://folder/file3.csv
s3://folder/file4.csv
All files contain fields I need, but number of columns differs.
I try to do this for one of the file, but the created table remains empty
CREATE EXTERNAL TABLE test1 (
app_id string,
app_version string
)
row format delimited fields terminated by ','
LOCATION 's3://folder/file4.csv';
MSCK REPAIR TABLE test1;
Can I create 3 tables from these files? Or I can put fields I need from all files in one table?

You cannot define a file as a LOCATION for Amazon Athena. It will result in this error message:
Can't make directory for path 's3://my-bucket/foo.csv' since it is a file
You should put each file in a separate folder and then set the LOCATION to the folder. All files in that folder (even if it is just one file) will be scanned for each query.
Also, there is no need to call MSCK REPAIR TABLE unless it contains partitions.
By the way, this line:
LOCATION 's3://folder/file4.csv'
should also specify the bucket name:
LOCATION 's3://my-bucket/folder/file4.csv'

Related

Athena query error HIVE_BAD_DATA: Not valid Parquet file . csv / .metadata

I'm creating an app that works with AWS Athena on compressed Parquet (SNAPPY) data.
It works almost fine, however, after every query execution, 2 files get uploaded to the S3_OUTPUT_BUCKET of type csv and metadata. (as it should)
These 2 files break the execution of the next query.
I get the following error:
HIVE_BAD_DATA: Not valid Parquet file: s3://MY_OUTPUT_BUCKET/logs/QUERY_NAME/2022/08/07/tables/894a1d10-0c1d-4de1-9e61-13b2b0f79e40.metadata expected magic number: PAR1 got: HP
I need to manually delete those files for the next query to work.
Any suggestions on how to make this work?
(I know I cannot exclude those files with a regex etc.. but I don't want to delete the files manually for the app to work)
I read everything about the output files but it didn't help. ( Working with query results, recent queries, and output files )
Any help is appreciated.
While setting up Athena for execution, we need to specify where the metadata and csv from the query execution are written into. This needs to be written into a different folder than the table location.
Go to Athena Query Editor > Settings > Manage
and edit Query Result Location to be another S3 bucket than the table or a different folder within the same bucket.

how to create multiple table from multiple folder with one location path and athena should also work on it with glue crawler

I have tried this not achieving required results-
I have multiple CSV files in a folder of s3 bucket but when it creates multiple table for it then Athena returns zero results so I made a different folder for each file then it works fine.
problem-
but if in future more folders will be added then I have to go to crawler and have to add a new location path for each newly added folder so is there any way to do it automatically or some other way to do it. I am using glue crawler and s3 bucket athena for query run on multiple CSV files.
In general a table needs all of its files to be in a directory, and no other files to be in that directory.
There is however, a mechanism that makes it possible to create tables that include just specific files. You can read more about that in the the second part of this answer: Partition Athena query by S3 created date (scroll down a bit after the horizontal rule). You can also find an example in the S3 Inventory documentation: https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html

Selecting specific files for athena

While creating a table in Athena, I am not able to create tables using specific files. Is there any way to select all the files starting with "year_2019" from a given bucket? For e.g.
s3://bucketname/prefix/year_2019*.csv
The documentation is very clear about it and it is not allowed.
From:
https://docs.aws.amazon.com/athena/latest/ug/tables-location-format.html
Athena reads all files in an Amazon S3 location you specify in the
CREATE TABLE statement, and cannot ignore any files included in the
prefix. When you create tables, include in the Amazon S3 path only the
files you want Athena to read. Use AWS Lambda functions to scan files
in the source location, remove any empty files, and move unneeded
files to another location.
I will like to know if the community has found some work-around :)
Unfortunately the filesystem abstraction that Athena uses for S3 doesn't support this. It requires table locations to look like directories, and Athena will add a slash to the end of the location when listing files.
There is a way to create tables that contain only a selection of files, but as far as I know it does not support wildcards, only explicit lists of files.
What you do is you create a table with
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
and then instead of pointing the LOCATION of the table to the actual files, you point it to a prefix with a single symlink.txt file (or point each partition to a prefix with a single symlink.txt). In the symlink.txt file you add the S3 URIs of the files to include in the table, one per line.
The only documentation that I know of for this feature is the S3 Inventory documentation for integrating with Athena.
You can also find a full example in this Stackoverflow response: https://stackoverflow.com/a/55069330/1109

Using tar.gz file as a source for Amazon Athena

If I define *.tsv files on Amazon S3 as a source for an Athena table and use OpenCSVSerde or LazySimpleSerDe as a deserializer it works correctly. But if I define *.tar.gz files that include *.tsv files I see several strange rows in a table (e.g. a row that contains tsv file name and several empty rows). What is the right way to use tar.gz files in Athena?
The problem is tar, it adds additional rows. Athena can open only *.gz files, but not tar. So in this case I have to use *.gz instead of *.tar.gz.

AWS Athena data input location

According to the docs, when you create a table in Athena, you need to specify the location of the input data file in the s3 bucket. You can only specify the s3 location containing that file, but not the file to be used. For example I have many files like type1.log.gz, type2.log.gz, type3.log.gz of different format at a location my-bucket/logs/.
Currently the location given is 's3://my-bucket/logs/'
So is it possible to specify which file(say type2.log.gz) to be used.
Or do I have to copy the file(type2.log.gz) to another location having no other files and specify its path?
Athena expects all of the data within an S3 location to have the same schema. This is a big help when you have a very large table, as it can be broken into many files that Athena can read in parallel, or when you want to add data to an existing table. However, that does mean that you simply can't use Athena in a situation where one S3 location has files with different schemas.
In your case, you would need to move the file you want to query to a different location, and then create a table pointing to its location--e.g. if you copy to s3://my-bucket/logs/type2/type2.log.gz, the table should point to s3://my-bucket/logs/type2.
No it is not possible. You are obliged to copy the file in an external bucket
Ref : Confirmed by AWS