Can AWS Glue Crawler handle different file types in same folder? - amazon-web-services

I have reports delivered to S3 in the following structure:
s3://chum-bucket/YYYY/MM/DD/UsageReportYYYYMMDD.zip
s3://chum-bucket/YYYY/MM/DD/SearchReportYYYYMMDD.zip
s3://chum-bucket/YYYY/MM/DD/TimingReportYYYYMMDD.zip
The YYYY MM DD vary per day. The YYYMMDD in the filename is there because the files all go into one directory on a server before they are moved to S3.
I want to have 1 or 3 crawlers that deliver 3 tables to the catalog, one for each type of report. Is this possible? I can't seem to specify
s3://chum-bucket/**/UsageReport*.zip
s3://chum-bucket/**/SearchReport*.zip
s3://chum-bucket/**/TimingReport*.zip
I can write one crawler that excludes SearchReport and TimingReport, and therefore crawls the UsageReport only. Is that the best way?
Or do I have to completely re-do the bucket / folder / file name design?

Amazon Redshift loads all files in a given path, regardless of filename.
Redshift will not take advantage of partitions (Redshift Spectrum will, but not a normal Redshift COPY statement), but it will read files from any subdirectories within the given path.
Therefore, if you want to load the data into separate tables (UsageReport, SearchReport, TimingReport), the they need to be in separate paths (directories). All files within the designated directory hierarchy must be in the same format and will be loaded into the same table via the COPY command.
An alternative is that you could point to a specific file using manifest files, but this can get messy.
Bottom line: Move the files to separate directories.

Related

Create Athena table using s3 source data

Below is given the s3 path where I have stored the files obtained at the end of a process. The below-provided path is dynamic, that is, the value of the following fields will vary - partner_name, customer_name, product_name.
s3://bucket/{val1}/data/{val2}/output/intermediate_results
I am trying to create Athena tables for each output file present under output/ as well as under intermediate_results/ directories, for each val1-val2.
Each file is a CSV.
But I am not much familiar with AWS Athena so I'm unable to figure out the way to implement this. I would really appreciate any kind of help. Thanks!
Use CREATE TABLE - Amazon Athena. You will need to specify the LOCATION of the data in Amazon S3 by providing a path.
Amazon Athena will automatically use all files in that path, including subdirectories. This means that a table created with a Location of output/ will include all subdirectories, including intermediate_results. Therefore, your data storage format is not compatible with your desired use for Amazon Athena. You would need to put the data into separate paths for each table.

Copy ~200.000 of s3 files to new prefixes

I have ~200.000 s3 files that I need to partition, and have made an Athena query to produce a target s3 key for each of the original s3 keys. I can clearly create a script out of this, but how to make the process robust/reliable?
I need to partition csv files using info inside each csv so that each file is moved to a new prefix in the same bucket. The files are mapped 1-to-1, but the new prefix depends on the data inside the file
The copy command for each would be something like:
aws s3 cp s3://bucket/top_prefix/file.csv s3://bucket/top_prefix/var1=X/var2=Y/file.csv
And I can make a single big script to copy all through Athena and bit of SQL, but I am concerned about doing this reliably so that I can be sure that all are copied across, and not have the script fail, timeout etc. Should I "just run the script"? From my machine or better to put it in an ec2 1st? These kinds of questions
This is a one-off, as the application code producing the files in s3 will start outputting directly to partitions.
If each file contains data for only one partition, then you can simply move the files as you have shown. This is quite efficient because the content of the files do not need to be processed.
If, however, lines within the files each belong to different partitions, then you can use Amazon Athena to 'select' lines from an input table and output the lines to a destination table that resides in a different path, with partitioning configured. However, Athena does not "move" the files -- it simply reads them and then stores the output. If you were to do this for new data each time, you would need to use an INSERT statement to copy the new data into an existing output table, then delete the input files from S3.
Since it is one-off, and each file belongs in only one partition, I would recommend you simply "run the script". It will go slightly faster from an EC2 instance, but the data is not uploaded/downloaded -- it all stays within S3.
I often create an Excel spreadsheet with a list of input locations and output locations. I create a formula to build the aws s3 cp <input> <output_path> commands, copy them to a text file and execute it as a batch. Works fine!
You mention that the destination depends on the data inside the object, so it would probably work well as a Python script that would loop through each object, 'peek' inside the object to see where it belongs, then issue a copy_object() command to send it to the right destination. (smart-open ยท PyPI is a great library for reading from an S3 object without having to download it first.)

how to create multiple table from multiple folder with one location path and athena should also work on it with glue crawler

I have tried this not achieving required results-
I have multiple CSV files in a folder of s3 bucket but when it creates multiple table for it then Athena returns zero results so I made a different folder for each file then it works fine.
problem-
but if in future more folders will be added then I have to go to crawler and have to add a new location path for each newly added folder so is there any way to do it automatically or some other way to do it. I am using glue crawler and s3 bucket athena for query run on multiple CSV files.
In general a table needs all of its files to be in a directory, and no other files to be in that directory.
There is however, a mechanism that makes it possible to create tables that include just specific files. You can read more about that in the the second part of this answer: Partition Athena query by S3 created date (scroll down a bit after the horizontal rule). You can also find an example in the S3 Inventory documentation: https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html

Selecting specific files for athena

While creating a table in Athena, I am not able to create tables using specific files. Is there any way to select all the files starting with "year_2019" from a given bucket? For e.g.
s3://bucketname/prefix/year_2019*.csv
The documentation is very clear about it and it is not allowed.
From:
https://docs.aws.amazon.com/athena/latest/ug/tables-location-format.html
Athena reads all files in an Amazon S3 location you specify in the
CREATE TABLE statement, and cannot ignore any files included in the
prefix. When you create tables, include in the Amazon S3 path only the
files you want Athena to read. Use AWS Lambda functions to scan files
in the source location, remove any empty files, and move unneeded
files to another location.
I will like to know if the community has found some work-around :)
Unfortunately the filesystem abstraction that Athena uses for S3 doesn't support this. It requires table locations to look like directories, and Athena will add a slash to the end of the location when listing files.
There is a way to create tables that contain only a selection of files, but as far as I know it does not support wildcards, only explicit lists of files.
What you do is you create a table with
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
and then instead of pointing the LOCATION of the table to the actual files, you point it to a prefix with a single symlink.txt file (or point each partition to a prefix with a single symlink.txt). In the symlink.txt file you add the S3 URIs of the files to include in the table, one per line.
The only documentation that I know of for this feature is the S3 Inventory documentation for integrating with Athena.
You can also find a full example in this Stackoverflow response: https://stackoverflow.com/a/55069330/1109

S3 avoid loading of duplicate files

I have the following work flow.
I need to identify duplicate files on S3 in order to avoid duplicates on my destination ( Redshift ).
Load files to S3 every 4 hours from FTP Server ( File storage structure : year/month/date/hour/minute/filename)
Load S3 to Redshift once all of the files are pulled ( for that interval )
This is a continuous job that will be running every 4 hour.
Problem :
Some times the files with same content but different file names are present on S3. These files can belong to different intervals or different days. For example if a files arrives say one.csv on 1st Oct 2018 and contains 1,2.3,4 as a content then it is possible that on 10th Oct 2018 a file may arrive with same content 1,2,3,4 but with different file name.
I want to avoid to load this file to S3 if the contents are same.
I know that I can use file hash to identify the two identical files, But my problem is how to achieve this on S3 and that too with so much of files.
What will be the best approach to proceed ?
Bascially, I want to avoid loading of data to S3 that is already present.
You can add another table in redshift ( or anywhere else actually like MySQL or dynamodb ) which will contain Etag/md5 hash of files uploaded.
You might already be having a script which is running every 4 hours and is loading data into redshift. In this same script, after data is loaded successfully into redshift; just make an entry into this table. Also, put a check in this same script(from this new table) before loading data into Redshift.
You need to make sure, that you load this new table with all the Etags of files you have already loaded into redshift.