Partition csv data in s3 bucket for querying using Athena - amazon-web-services

I have csv log data coming every hour in a single s3 bucket and I want to partition it for improving queries performance as well as converting it to parquet.
Also how can I add partitions automatically for new logs that will be added.
Note :
csv file names follow standard date format
files are written from external source and cannot be edited to be written in folders but only in the main bucket
I wanted to convert csv files to parquet separately

It appears that your situation is:
Objects are being uploaded to an Amazon S3 bucket
You would like those objects to be placed in a path hierarchy to support Amazon Athena partitioning
You could configure an Amazon S3 event to trigger an AWS Lambda function whenever a new object is created.
The Lambda function would:
Read the filename (or the contents of the file) to determine where it should be placed in the hierarchy
Perform a CopyObject() to put the object in the correct location (S3 does not have a 'move' command)
Delete the original object with DeleteObject()
Be careful that the above operation does not result in an event that triggers the Lambda function again (eg do it in a different folder or bucket), otherwise an infinite loop would occur.
When you wish to convert the CSV files to Parquet, see:
Converting to Columnar Formats - Amazon Athena
Using AWS Athena To Convert A CSV File To Parquet | CloudForecast Blog

Related

AWS DMS - How to write to RDS table data to a single S3 target file?

I have successfully setup DMS to copy data from RDS (SQL Server) to S3 in csv format (Full load). However, upon running the task, DMS copies the source table and creates multiple csv files in S3 for the single table. Is there any way to make sure that for 1 table, DMS only creates one target csv file in S3?
The first full load operation will load all data into one file.
For on-going replicated data, migrated data has different format, it contains additional character like this:
I: for inserted record
U: for changed one
D: for deleted one
So, they can not be merged into one file.
You can do this by using Lambda, but it's not a good way:
Add trigger to Lambda function on S3 bucket whenever any data change is made on above S3 bucket - which contains csv files
In Lambda function: handle file in each above cases and merge them in by your self.
I suggest to use other DB target like MySQL, Postgres, etc. As they support them all.

AWS 100 TB data transformation at rest S3

I have about 50 TB of data in an S3 bucket, the bucket doesn't have any partitioning. The files are json files approx 100KB each in size.
I need to do the partitioning on this data and put this in a different s3 bucket to store it in a structure of yyyy/mm/dd/filename.json or add a custom metadata field to the files which is the original lastmodifieddate on the file itself and move it to the different bucket.
I have looked into options like
Doing it with a spark cluster, mounting both buckets as dbfs and then doing the transformation and copy to destination bucket.
I have also tried writing a lambda function which can do the same for a given file and invoke it from another program. 1000 files take about 15 seconds to copy.
I also looked in to generating s3 inventory and running job on it but it's not customizable to add metadata or create a partition structure so to say.
Is there an obvious choice I may be missing or there are better ways to do it ?

How can I download s3 bucket data?

I'm trying to find some way to export data from an s3 bucket such as file path, filenames, metadata tags, last modified, and file size to something like a .csv .xml or .json. Is there any way to generate this without having to manually step through and hand generate it?
Please note I'm not trying to download all the files, rather I'm trying to get at a way to export the exposed data about those files presented in the s3 console.
Yes!
From Amazon S3 Inventory - Amazon Simple Storage Service:
Amazon S3 inventory provides comma-separated values (CSV), Apache optimized row columnar (ORC) or Apache Parquet (Parquet) output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).

AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

I'm trying to create AWS Glue ETL Job that would load data from parquet files stored in S3 in to a Redshift table.
Parquet files were writen using pandas with 'simple' file schema option into multiple folders in an S3 bucked.
The layout looks like this:
s3://bucket/parquet_table/01/file_1.parquet
s3://bucket/parquet_table/01/file_2.parquet
s3://bucket/parquet_table/01/file_3.parquet
s3://bucket/parquet_table/01/file_1.parquet
s3://bucket/parquet_table/02/file_2.parquet
s3://bucket/parquet_table/02/file_3.parquet
I can use AWS Glue Crawler to create a table in the AWS Glue Catalog and that table can be queried from Athena, but it does not work when i try to create ETL Job that would copy the same table to Redshift.
If I Crawl a single file or if I crawl multiple files in one folder, it works, as soon as there are multiple folders involved, I get the above mentioned error
AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
Similar issues appear if instead of 'simple' schema I use 'hive'. Then we have multiple folders and also empty parquet files that throw
java.io.IOException: Could not read footer: java.lang.RuntimeException: xxx is not a Parquet file (too small)
Is there some recommendation on how to read Parquet files and structure them ins S3 when using AWS Glue (ETL and Data Catalog)?
Redshift doesn't support parquet format. Redshift Spectrum does. Athena also supports parquet format.
The error that you're facing is because the when reading the parquet files from s3 from spark/glue it expects the data to be in hive partition i.e the partition names should have key- value pair, You'll to have the s3 hierarchy in hive style partition something like below
s3://your-bucket/parquet_table/id=1/file1.parquet
s3://your-bucket/parquet_table/id=2/file2.parquet
and so on..
then use the below path to read all the files in bucket
location : s3://your-bucket/parquet_table
If the data in s3 partition the above way, you'll not the face any issues.

COPY csv data into Redshift using LAMBDA

I'd like to create a Lambda function that COPY on Redshift the data of a file PUT in a specific S3 Bucket, but I can't figure how to do that.
So far, I created a LAMBDA function that triggers whenever a .csv file is PUT on the S3 bucket and I managed to COPY the data from a local .csv file to Redshift.
Now I'd like to be helped on how to COPY the data using a Lambda function. I searched on the internet, but can't manage to have proper examples using Lambda.
I use powershell to export data from SQL Server and upload it to an S3 bucket. I then have a Lambda function with an S3 put trigger that executes a stored procedure in Redshift that contains the copy command in a dynamic statement, in order to load data into different tables, and it will trigger's every time a file is uploaded, so multiple tables multiple times.
The solution approach-
Lambda trigger works if you placed any file in S3
The data pipeline automated using below -
Flatfile - > S3 -> dynamodb copy comand - Lamdbda execute the script inside dynamodb - > load data on Redshift