Push RedShift table to S3 by doing some aggregation as CSV - amazon-web-services

I have been looking to the best way to programatically pull Redshift table (table needs to be aggregated) into s3.
What would be the best solution. From Athena to s3 I found this article however, I could not find any information to do it from Redshift to s3.
https://www.datastackpros.com/2020/07/export-athena-view-as-csv-to-aws-s3.html
I would be daily ingestion and the csv file should be overwritten.
Thanks

There are 2 ways that come to mind right away - UNLOAD and CREATE EXTERNAL TABLE. Each has its pros and cons. Your use case isn't completely clear as to what you need the resulting file(s) to look like but let me take a guess.
I expect you need a single CSV file (with or without header row?) for other tools to read / use. In this case I'd use UNLOAD with PARALLEL OFF to save the result of the query to S3. This will produce 1 file in S3 ONLY IF the resulting size is less than 5GB.

Related

AWS Athena - What happens when you add new files to S3 folder

I have a sample working where I put a file in S3.
What I'm confused about is what happens when I add new CSV files (with the same format) to that folder.
Are they instantly available in queries? Or do you have to run Glue or something to process them? So for example, what if set up a Lambda function to extract a new CSV every hour, or even every 5 minutes to that same S3 directory.
Does Athena actually load the data into some database somewhere in order to do fast performing queries?
If your table is not partitioned or you add a file to an existing partition the data will be available right away.
However, if you constantly add files you may want to consider partition your table to optimize query performance, see:
Table Location in Amazon S3
Partitioning Data
Athena itself doesn't have any caching, any query will hit the S3 location of the table.

How to create AWS Athena table via Glue crawler when the s3 data store has both json and .gz compressed files?

I have two problems in my intended solution:
1.
My S3 store structure is as following:
mainfolder/date=2019-01-01/hour=14/abcd.json
mainfolder/date=2019-01-01/hour=13/abcd2.json.gz
...
mainfolder/date=2019-01-15/hour=13/abcd74.json.gz
All json files have the same schema and I want to make a crawler pointing to mainfolder/ which can then create a table in Athena for querying.
I have already tried with just one file format, e.g. if the files are just json or just gz then the crawler works perfectly but I am looking for a solution through which I can automate either type of file processing. I am open to write a custom script or any out of the box solution but need pointers where to start.
2.
The second issue that my json data has a field(column) which the crawler interprets as struct data but I want to make that field type as string. Reason being that if the type remains struct the date/hour partitions get a mismatch error as obviously struct data has not the same internal schema across the files. I have tried to make a custom classifier but there are no options there to describe data types.
I would suggest skipping using a crawler altogether. In my experience Glue crawlers are not worth the problems they cause. It's easy to create tables with the Glue API, and so is adding partitions. The API is a bit verbose, especially adding partitions, but it's much less pain than trying to make a crawler do what you want it to do.
You can of course also create the table from Athena, that way you can be sure you get tables that work with Athena (otherwise there are some details you need to get right). Adding partitions is also less verbose using SQL through Athena, but slower.
Crawler will not take compressed and uncompressed data together , so it will not work out of box.
It is better to write spark job in glue and use spark.read()

Loading parquet file from S3 to DynamoDB

I have been looking at options to load (basically empty and restore) Parquet file from S3 to DynamoDB. Parquet file itself is created via spark job that runs on EMR cluster. Here are few things to keep in mind,
I cannot use AWS Data pipeline
File is going to contain millions of rows (say 10 million), so would need an efficient solution. I believe boto API (even with batch write) might not be that efficient ?
Are there any other alternatives ?
Can you just refer to the Parquet files in a Spark RDD and have the workers put the entries to dynamoDB? Ignoring the challenge of caching the DynamoDB client in each worker for reuse in different rows, it some bit of scala to take a row, build an entry for dynamo and PUT that should be enough.
BTW: Use DynamoDB on demand here, as it handles peak loads well without you having to commit to some SLA.
Look at the answer below:
https://stackoverflow.com/a/59519234/4253760
To explain the process:
Create desired dataframe
Use .withColumn to create new column and use psf.collect_list to convert to desired collection/json format, in the new column in the
same dataframe.
Drop all un-necessary (tabular) columns and keep only the JSON format Dataframe columns in Spark.
Load the JSON data into DynamoDB as explained in the answer.
My personal suggestion: whatever you do, do NOT use RDD. RDD interface even in Scala is 2-3 times slower than Dataframe API of any language.
Dataframe API's performance is programming language agnostic, as long as you dont use UDF.

Best storage format to backup hive internal table

I have one hive internal table which has around 500 million records.
My hive is deployed on top of AWS EMR. I do not want to keep the AWS EMR always running. Hence I want to backup the hive internal table data.
One easy way of doing it to create an external table pointing to S3 Location and then moving all records into that external table using insert command.
When ever I need internal table back, I can use this external S3 table to get all the data back.
Since this table only purpose is for backup, I want to ask which stored as format will be best choice for me.
Hive as of now supports following formats
TEXTFILE
SEQUENCEFILE
ORC
PARQUET
AVRO
RCFILE
Also is there any other way to backup your internal tables other than the approach mentioned above.
In Short
I'd think changing file format(the list you mentioned) will not have much difference in size. But file size and type of access you want on that file plays crucial role your cloud account billing.
So consider following,
Compression - To reduce the size
Amazon Glacier - Cost effective solution than S3 in AWS, as the data is less likely to access (archival)
Things to consider when choosing a solution, How much time you can buy
To access file from archival storage.
to convert data format to Hive managed table (if you change during archival)
to data uncompress(each compression is trade of between time and size)
Extended answer
Here are some of the file formats with their decompression speed and space efficiency, pick the balanced(means time/space as per above questions) and available compression format for you.
more compress and compress benchmarks at

Data Transformation in AWS EMR without using Scala or Python

I have a star schema kind of database structure, like one fact table having all the id’s & skeys, whereas there are multiple dimension tables having the actual id, code, descriptions for the id’s referred in the fact table.
we are moving all these tables (fact & dimensions) to S3 (cloud) individually and each table data are split into multiple parquet files in S3 location (one S3 object per table)
Query: i need to perform a transformation on cloud (ie) i need strip of all the id’s & skeys referred in the fact table and replace it with the actual code that is residing in the dimension tables and create another file and store the final output back in S3 location. This file will later be consumed by Redshift for Analytics.
My Doubt:
Whats the best way to achieve this solution, cos i don’t need raw data (skeys & id’s) in Redshift for cost and storage optimization?
Do we need to first combine these split files (parquet) into one large file (ie) before performing the data transformation. Also, after data transformation, I am planning to save the final output file in parquet format, but the catch is, Redshift doesn’t allow copy of parquet file, so is there a workaround for that
I am not a hardcore programmer and want to avoid using scala/python in a EMR, but I am good at SQL, so is there a way to perform data transformation in cloud thru SQL thru EMR and save the output data into a file or files. Please advise
You should be able to run redshift type queries directly against your s3 parquet data by using amazon athena
some information on that
https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/