Inventory List from multiple S3 buckets using Athena - amazon-web-services

I am trying query AWS S3 Invetory List using Athena. I can do this if I have only one source bucket. I am not sure how to configure this to work with multiple source buckets.
we are using all the default configuration options with data format as CSV. S3 Inventory destination bucket name pattern for hive is like this:
destination-prefix/source-bucket/config-ID/hive/dt=YYYY-MM-DD-HH-MM/symlink.txt
So when I am creating an Athena table I have to use static hive path.
CREATE EXTERNAL TABLE your_table_name(
//column names
)
PARTITIONED BY (dt string)
//options ignored
LOCATION 's3://destination-prefix/source-bucket/config-ID/hive/';
So if I want to query inventory data for multiple source buckets, it seems like I have to create a table for each "source-bucket".
Alternatively, with out using Athena, I am trying to do this with AWS CLI
aws s3 ls s3://our-bucket-name/prefix/abc --recursive | awk ‘$1 > “2019-04-01”
But this gets every single file first as there is no option to set --include or --exclude with "S3 ls"
finally, the questions are:
Can I configure AWS Inventory to generate inventory for multiple S3 buckets to that it puts everything into the same "hive" directory (i.e. ignore "source-bucket" prefix while generating Invetory)?
Is it possible to configure Athena read from multiple hive locations? But with the possibility of new buckets getting created and dropped, I guess this gets ugly.
Is there any alternative way to query inventory list instead of Athena or AWS CLI or writing a custom code to use manifest.json file to get these csv files.

You can't make S3 Inventory create one inventory for multiple buckets, unfortunately. You can however splice the inventories together into one table.
The guide you link to says to run MSCK REPAIR TABLE … to load your inventories. I would recommend you to not do that, because it will create weird tables with partitions that each represent the inventory of some point in time, which is something you might want if you want to compare what's in a bucket from day to day or week to week, but probably not what you want most of the time. Most of the time you want to know what's in the bucket right now. To get multiple inventories into the same table you should also not run that command.
First you change how you create the table slightly:
CREATE EXTERNAL TABLE your_table_name(
//column names
)
PARTITIONED BY (bucket_name string)
//options ignored
LOCATION 's3://destination-prefix/source-bucket/config-ID/hive/';
Notice that I changed the partitioning from dt string to bucket_name string.
Next you add the partitions manually:
ALTER TABLE your_table_name
ADD PARTITION (bucket_name = 'some-bucket') LOCATION 's3://destination-prefix/source-bucket/config-ID1/hive/dt=YYYY-MM-DD/'
ADD PARTITION (bucket_name = 'another-bucket') LOCATION 's3://destination-prefix/source-bucket/config-ID2/hive/dt=YYYY-MM-DD/';
The locations should be the S3 URIs to the latest dates under the "hive" prefix of the inventory for each bucket.
The downside of this is that when new inventories are delivered you will need to update the table to point to these new locations. You can do this by first dropping the partitions:
ALTER TABLE your_table_name
DROP PARTITION (bucket_name = 'some-bucket')
DROP PARTITION (bucket_name = 'another-bucket');
and then adding them again using the same SQL as above, but with new S3 URIs.

Related

AWS Athena Partitioning

I've read Partitioning data in Athena, however it is not clear how to make partitions for a table, when S3 has the following structure:
aws s3 ls s3://xxx-s3-zzz-datalake-prod/yyy/2022/09/
PRE 01/
PRE 02/
PRE 03/
PRE 04/
PRE 05/
PRE 06/
etc...
How could I create partition for such structure, is it possible? Or should I rename it to
aws s3 ls s3://xxx-s3-yyy-datalake-prod/zzz/2022/09/
PRE day=01/
PRE day=02/
PRE day=03/
etc...
and then add PARTITIONED BY (day int) ?
The partitions can be added in both cases and you have to use different methods in these cases.If you have data in below format:
aws s3 ls s3://xxx-s3-zzz-datalake-prod/yyy/2022/09/
PRE 01/
PRE 02/
PRE 03/
PRE 04/
PRE 05/
PRE 06/
etc...
Then you can only add these partitions information to the table using below query below:
ALTER TABLE orders ADD
PARTITION (day = '01') LOCATION 's3://xxx-s3-zzz-datalake-prod/yyy/2022/09/01'
PARTITION (day = '02') LOCATION 's3://xxx-s3-zzz-datalake-prod/yyy/2022/09/02';
Refer to below link for more information.
https://docs.aws.amazon.com/athena/latest/ug/alter-table-add-partition.html
Also try to add more partitions to your table by adding year and month to PARTITIONED BY clause as only adding a day will not do any good.
In case of below structure it is straight forward and easy:
aws s3 ls s3://xxx-s3-yyy-datalake-prod/zzz/2022/09/
PRE day=01/
PRE day=02/
PRE day=03/
etc...
Here you can run MSCK REPAIR TABLE <table-name> which will automatically populates table with partitions information as the structure is in Hive key-value supported format.The same information also can be added by Glue crawler.
Below link has more explanation for hive style and non hive style partitioning formats.
https://aws.amazon.com/premiumsupport/knowledge-center/athena-create-use-partitioned-tables/

AWS Glue - custom s3 partition to single table

I have multiple files stored into an S3 bucket under uniquely named folders which I would expect AWS Glue to put into a single table - instead it creates one per file. Any ideas how to configure the crawler to get a single table ?
The current tS3 structure is s3://bucket_name/YYYYMMDDUUID/data.json:
20210801123123cfec/data.json
20210808876551cedc/data.json
....
20210810112313feed/data.json
The json schema is definitely not a problem, it is similar - for example when I change the folder names from the custom names to "1", "2", ... etc I get a single table with multiple partitions.

Do AWS Spectrum really need = in s3 location to understand it as hive format?

I run some test with spectrum.
I created two AWS Glue crawler.
The first one called hive-tst which scans:
s3://hive-test/type='a'/year='2021'/month='01'
s3://hive-test/type='b'/year='2021'/month='01'
s3://hive-test/type='c'/year='2021'/month='01'
s3://hive-test/type='d'/year='2021'/month='01'
s3://hive-test/type='e'/year='2021'/month='01'
The second one scans:
s3://non-hive-test/a/2021/01
s3://non-hive-test/b/2021/01
s3://non-hive-test/c/2021/01
s3://non-hive-test/d/2021/01
s3://non-hive-test/e/2021/01
Both has two files in each bucket partition, both files are parquet files with 50mb.
Then I run a test of querying first partition of each spectrum table:
select distinct event from test.hive_tst;
It took 8s 272
select distinct partition_0 from test.nonhive_tst;
It took 8s 66ms
So it doesn't seem that adding the = improves performance.
Also checked that both tables have Hive format in partitions.
select *
from svv_external_partitions
where schemaname='test'
and tablename='hive_tst';
values
location
input_format
output_format
serialization_lib
["a","2021","01"]
s3://hive-test/event=a/year=2021/month=01/
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
select *
from svv_external_partitions
where schemaname='test'
and tablename='nonhive_tst';
values
location
input_format
output_format
serialization_lib
["a","2021","01"]
s3://hive-test/a/2021/01/
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
Maybe the data volume in the folders is not enough to test it, but everything, execution-times, and partitions format using svv_external_partitions seem the same.
Then the question is:
Do AWS Spectrum really need = in s3 location to understand it as hive format?
Finally, after a lot of search and reading I get into a conclusion:
Both S3 buckets have partitions, and as we use AWS Glue, all partitions are automatically added.
The only difference is that prefix such as year=2020 corresponds to hive naming convention, so AWS Glue knows how to handle that when add the partitions, and then partitions have a pretty name such as year instead of partition_x.
Then, answering: Do AWS Spectrum really need = in s3 location to understand it as hive format?
No, you don't need it to understant as hive format, but to understand it with Hive naming convention
Sources
My own test above
AWS Glue partitioned data post
Hive naming convention on S3

how to create multiple table from multiple folder with one location path and athena should also work on it with glue crawler

I have tried this not achieving required results-
I have multiple CSV files in a folder of s3 bucket but when it creates multiple table for it then Athena returns zero results so I made a different folder for each file then it works fine.
problem-
but if in future more folders will be added then I have to go to crawler and have to add a new location path for each newly added folder so is there any way to do it automatically or some other way to do it. I am using glue crawler and s3 bucket athena for query run on multiple CSV files.
In general a table needs all of its files to be in a directory, and no other files to be in that directory.
There is however, a mechanism that makes it possible to create tables that include just specific files. You can read more about that in the the second part of this answer: Partition Athena query by S3 created date (scroll down a bit after the horizontal rule). You can also find an example in the S3 Inventory documentation: https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html

AWS Athena Returning Zero Records from Tables Created from GLUE Crawler input csv from S3

Part One :
I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned.
But the demo data of ELB in Athena works fine.
Part Two (Scenario:)
Suppose I Have a excel file and data dictionary of how and what format data is stored in that file , I want that data to be dumped in AWS Redshift What would be best way to achieve this ?
I experienced the same issue. You need to give the folder path instead of the real file name to the crawler and run it. I tried with feeding folder name to the crawler and it worked. Hope this helps. Let me know. Thanks,
I experienced the same issue. try creating separate folder for single table in s3 buckets than rerun the glue crawler.you will get a new table in glue data catalog which has the same name as s3 bucket folder name .
Delete Crawler ones again create Crawler(only one csv file should be not more available in s3 and run the crawler)
important note
one CSV file run it we can view the records in Athena.
I was indeed providing the S3 folder path instead of the filename and still couldn't get Athena to return any records ("Zero records returned", "Data scanned: 0KB").
Turns out the problem was that the input files (my rotated log files automatically uploaded to S3 from Elastic Beanstalk) start with underscore (_), e.g. _var_log_nginx_rotated_access.log1534237261.gz! Apparently that's not allowed.
The structure of the s3 bucket / folder is very important :
s3://<bucketname>/<data-folder>/
/<type-1-[CSVs|Parquets etc]>/<files.[csv or parquet]>
/<type-2-[CSVs|Parquets etc]>/<files.[csv or parquet]>
...
/<type-N-[CSVs|Parquets etc]>/<files.[csv or parquet]>
and specify in the "include path" of the Glue Crawler:
s3://<bucketname e.g my-s3-bucket-ewhbfhvf>/<data-folder e.g data>
Solution: Select path of folder even if within folder you have many files. This will generate one table and data will be displayed.
So in many such cases using EXCLUDE PATTERN in Glue Crawler helps me.
This is sure that instead of directly pointing the crawler to the file, we should point it to the directory and even by doing so when we do not get any records, Exclude Pattern comes to rescue.
You will have to devise some pattern by which only the file which u want gets crawled and rest are excluded. (suggesting to do this instead of creating different directories for each file and most of the times in production bucket, doing such changes is not feasible )
I was having data in S3 bucket ! There were multiple directories and inside each directory there were snappy parquet file and json file. The json file was causing the issue.
So i ran the crawler on the master directory that was containing many directories and in the EXCLUDE PATTERN i gave - * / *.json
And this time, it did no create any table for the json file and i was able to see the records of the table using Athena.
for reference - https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html
Pointing glue crawler to the S3 folder and not the acutal file did the trick.
Here's what worked for me: I needed to move all of my CSVs into their own folders, just pointing Glue Crawler to the parent folder ('csv/' for me) was not enough.
csv/allergies.csv -> fails
csv/allergies/allergies.csv -> succeeds
Then, I just pointed AWS Glue Crawler to csv/ and everything was parsed out well.