Selecting specific files for athena - amazon-athena

While creating a table in Athena, I am not able to create tables using specific files. Is there any way to select all the files starting with "year_2019" from a given bucket? For e.g.
s3://bucketname/prefix/year_2019*.csv
The documentation is very clear about it and it is not allowed.
From:
https://docs.aws.amazon.com/athena/latest/ug/tables-location-format.html
Athena reads all files in an Amazon S3 location you specify in the
CREATE TABLE statement, and cannot ignore any files included in the
prefix. When you create tables, include in the Amazon S3 path only the
files you want Athena to read. Use AWS Lambda functions to scan files
in the source location, remove any empty files, and move unneeded
files to another location.
I will like to know if the community has found some work-around :)

Unfortunately the filesystem abstraction that Athena uses for S3 doesn't support this. It requires table locations to look like directories, and Athena will add a slash to the end of the location when listing files.
There is a way to create tables that contain only a selection of files, but as far as I know it does not support wildcards, only explicit lists of files.
What you do is you create a table with
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
and then instead of pointing the LOCATION of the table to the actual files, you point it to a prefix with a single symlink.txt file (or point each partition to a prefix with a single symlink.txt). In the symlink.txt file you add the S3 URIs of the files to include in the table, one per line.
The only documentation that I know of for this feature is the S3 Inventory documentation for integrating with Athena.
You can also find a full example in this Stackoverflow response: https://stackoverflow.com/a/55069330/1109

Related

Create Athena table using s3 source data

Below is given the s3 path where I have stored the files obtained at the end of a process. The below-provided path is dynamic, that is, the value of the following fields will vary - partner_name, customer_name, product_name.
s3://bucket/{val1}/data/{val2}/output/intermediate_results
I am trying to create Athena tables for each output file present under output/ as well as under intermediate_results/ directories, for each val1-val2.
Each file is a CSV.
But I am not much familiar with AWS Athena so I'm unable to figure out the way to implement this. I would really appreciate any kind of help. Thanks!
Use CREATE TABLE - Amazon Athena. You will need to specify the LOCATION of the data in Amazon S3 by providing a path.
Amazon Athena will automatically use all files in that path, including subdirectories. This means that a table created with a Location of output/ will include all subdirectories, including intermediate_results. Therefore, your data storage format is not compatible with your desired use for Amazon Athena. You would need to put the data into separate paths for each table.

how to create multiple table from multiple folder with one location path and athena should also work on it with glue crawler

I have tried this not achieving required results-
I have multiple CSV files in a folder of s3 bucket but when it creates multiple table for it then Athena returns zero results so I made a different folder for each file then it works fine.
problem-
but if in future more folders will be added then I have to go to crawler and have to add a new location path for each newly added folder so is there any way to do it automatically or some other way to do it. I am using glue crawler and s3 bucket athena for query run on multiple CSV files.
In general a table needs all of its files to be in a directory, and no other files to be in that directory.
There is however, a mechanism that makes it possible to create tables that include just specific files. You can read more about that in the the second part of this answer: Partition Athena query by S3 created date (scroll down a bit after the horizontal rule). You can also find an example in the S3 Inventory documentation: https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html

Can AWS Glue Crawler handle different file types in same folder?

I have reports delivered to S3 in the following structure:
s3://chum-bucket/YYYY/MM/DD/UsageReportYYYYMMDD.zip
s3://chum-bucket/YYYY/MM/DD/SearchReportYYYYMMDD.zip
s3://chum-bucket/YYYY/MM/DD/TimingReportYYYYMMDD.zip
The YYYY MM DD vary per day. The YYYMMDD in the filename is there because the files all go into one directory on a server before they are moved to S3.
I want to have 1 or 3 crawlers that deliver 3 tables to the catalog, one for each type of report. Is this possible? I can't seem to specify
s3://chum-bucket/**/UsageReport*.zip
s3://chum-bucket/**/SearchReport*.zip
s3://chum-bucket/**/TimingReport*.zip
I can write one crawler that excludes SearchReport and TimingReport, and therefore crawls the UsageReport only. Is that the best way?
Or do I have to completely re-do the bucket / folder / file name design?
Amazon Redshift loads all files in a given path, regardless of filename.
Redshift will not take advantage of partitions (Redshift Spectrum will, but not a normal Redshift COPY statement), but it will read files from any subdirectories within the given path.
Therefore, if you want to load the data into separate tables (UsageReport, SearchReport, TimingReport), the they need to be in separate paths (directories). All files within the designated directory hierarchy must be in the same format and will be loaded into the same table via the COPY command.
An alternative is that you could point to a specific file using manifest files, but this can get messy.
Bottom line: Move the files to separate directories.

AWS Athena data input location

According to the docs, when you create a table in Athena, you need to specify the location of the input data file in the s3 bucket. You can only specify the s3 location containing that file, but not the file to be used. For example I have many files like type1.log.gz, type2.log.gz, type3.log.gz of different format at a location my-bucket/logs/.
Currently the location given is 's3://my-bucket/logs/'
So is it possible to specify which file(say type2.log.gz) to be used.
Or do I have to copy the file(type2.log.gz) to another location having no other files and specify its path?
Athena expects all of the data within an S3 location to have the same schema. This is a big help when you have a very large table, as it can be broken into many files that Athena can read in parallel, or when you want to add data to an existing table. However, that does mean that you simply can't use Athena in a situation where one S3 location has files with different schemas.
In your case, you would need to move the file you want to query to a different location, and then create a table pointing to its location--e.g. if you copy to s3://my-bucket/logs/type2/type2.log.gz, the table should point to s3://my-bucket/logs/type2.
No it is not possible. You are obliged to copy the file in an external bucket
Ref : Confirmed by AWS

Retaining source file name while importing data from s3 to Redshift

I have large numbers of files within s3 bucket and usually import it to Redshift. Since number of files is large I need a column in Redshift table which should contain source file name from s3 location.
Is there any means to carried out problem ?
Agree with Ketan that this is currently not possible in Redshift. If this is what you would want to achieve, it is possible through either
Reading the S3 files programmatically and write a new S3 files with file name as the column and load the new file
Alternatively, use Hive. Create external table on S3 file bucket location and use INPUT__FILE__NAME to get the file names, create a new table and then write back to S3. You can also do some pre-processing in Hive.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
Hope this helps.
That isn't possible. During a Copy operation, Redshift only loads file contents into a table; it doesn't provide access to S3 file names.
To achieve what you want, you need to preprocess the data to add additional information inside the files.