Redshift how to handle other files in S3 that are not overwrite - amazon-web-services

I have a table in Amazon Redshift and my use case would be to unload this table on row number partition to multiple S3 file locations daily. The files from S3 locations will be loaded to DynamoDB by means of S3DataNode of Datapipeline with parameters of DirectoryPath that is pointing towards the S3 locations updated in above step.
unload('select * from test_table where (row_num%100) >= 0 and (row_num%100)<=30')
to 's3://location1/file_'
iam_role 'arn:aws:iam::xxxxxxxxxx'
parallel off
ALLOWOVERWRITE;
unload('select * from test_table where (row_num%100) >= 31 and (row_num%100)<=100')
to 's3://location2/file_'
iam_role 'arn:aws:iam::xxxxxxxxxx'
parallel off
ALLOWOVERWRITE;
Day1: there are huge number of rows in Redshift and it created file_000, file_001, file_002 and file_003 in one of S3 location (say location2).
Day2: there is not much of rows in Redshift and it created file_000. Because of ALLOWOVERWRITE option that I'm using while unload to S3, there is only overwrite of file_000. but still the other files exists(file_001, file_002, file_003 that were created on Day1) and this is resulting in updating already existing items from files_002 and file_003 to DynamoDB.
How should I copy only overwrite files to DynamoDB?
(Note: If I use MANIFEST on unload, Can the datapipeline S3 DataNode only copies the manifest generated files to DynamoDB ?)

Related

How do I remove old data from a Glue Studio ETL job

I have a CSV file in an S3 bucket. I am using Glue Studio to take that CSV and create various partitions in an S3 bucket so that I can speed up my Athena queries.
However, when the job runs, it's creating new files in the partitions and it retains the previous data.
Is there a way to remove the data from the glue job's previous run before adding the new partitioned data?

Query to multiple csv fles at S3 through Athena

I exported my SQL DB into S3 in csv format. Each table is exported into separate csv files and saved in Amazon S3. Now, can I send any query to that S3 bucket which can join multiple tables (multiple csv files in S3) and get a result-set? How can I do that and save in a separate csv file?
The steps are:
Put all files related to one table into a separate folder (directory path) in the S3 bucket. Do not mix files from multiple tables in the same folder because Amazon Athena will assume they all belong to one table.
Use the CREATE TABLE to define a new table in Amazon Athena, and specify where the files are kept via the LOCATION 's3://bucket_name/[folder]/' parameter. This tells Athena which folder to use when reading the data.
Or, instead of using CREATE TABLE, an easier way is:
Go to the AWS Glue management console
Select Create crawler
Select Add a data source provide the location in S3 where the data is stored
Provide other information as prompted (you'll figure it out)
Then, run the crawler and AWS Glue will look at the data files in the specified folder and will automatically create a table for that data. The table will appear in the Amazon Athena console.
Once you have created the tables, you can use normal SQL to query and join the tables.

How to directly copy Amazon Athena tables into Amazon Redshift?

I have some JSON files in S3 and I was able to create databases and tables in Amazon Athena from those data files. It's done, my next target is to copy those created tables into Amazon Redshift. There are other tables in the Amazon Athena which I created base on those data files. I mean I created three tables using those data files which is in the S3, latter I created new tables using those those 3 tables. So at the moment I have 5 different tables which want to create in the Amazon Redshift with data or without data.
I checked the COPY command in Amazon Redshift, but there is no COPY command for Amazon Athena. Here are the available list.
COPY from Amazon S3
COPY from Amazon EMR
COPY from Remote Host (SSH)
COPY from Amazon DynamoDB
If there is no any other solutions, I planned to create new JSON files based on newly created tables in the Amazon Athena into S3 buckets. Then we can easily copy those from S3 into the Redshift, isn't it? Are there any other good solutions for this?
If your s3 files are in an OK format you can use Redshift Spectrum.
1) Set up a hive metadata catalog of your s3 files, using aws glue if you wish.
2) Set up Redshift Spectrum to see that data inside redshift (https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html)
3) Use CTAS to create a copy inside redshift
create table redshift_table as select * from redshift_spectrum_schema.redshift_spectrum_table;

Amazon Athena not returning recent data when partitions are partially loaded

I have a defined a partitioned table which points to a S3 bucket which uses date partitioning. I have data for the past 3 months in the S3 bucket. I have loaded the partitions for the 1st month. However I haven't loaded the data in partition using msck repair table or alter table commands for the past 2 months. When I try to query the table , data for the past 2 months are not loaded from S3 , only the most recent partitioned data is showing up in the query results. Is this expected? If so , why?
I tried to create another partitioned table for the same s3 bucket but this time around I did not load any of the partitions. When I query this table , I get the most recent records.
Yes it is expected.
Athena uses metadata to recognize data in S3. Most important metadata used to detect data in S3 is partition. Athena keeps details about all partitions in metadata. Using this partition info, it reaches to corresponding folder in S3 to fetch data.
If you add more files to same partition: If partition is already added in athena metadata, all new files will be detected automatically because athena reads all files from folder in S3 by using partition metadata and s3 location.
If u add files in new partition: if partition is not in athena metadata, athana doesn't know how to locate corresponding folder in S3. Therefore, it doesn't access data from that folder.
There are three ways to recognize new partitions:
1. Run Glue crawler over S3 bucket and it will refresh partition metadata.
2. Use alter table command in athana to add new partitions
3. Use msck repair table if your partition has different schema than table schema.

Automatically load data into Redshift with the COPY function

The Amazon Redshift documentation states that the best way to load data into the database is by using the COPY function. How can I run it automatically every day with a data file uploaded to S3?
The longer version: I have launched a Redshift cluster and set up the database. I have created an S3 bucket and uploaded a CSV file. Now from the Redshift Query editor, I can easily run the COPY function manually. How do I automate this?
Before you finalize your approach you should consider below important points:
If possible, compress csv files into gzips and then ingest into corresponding redshift tables. This will reduce your file size with a good margin and will increase overall data ingestion performance.
Finalize the compression scheme on table columns. If you want redshift to do the job, auto compression can be enabled with "COMPUPDATE ON" in copy command. Refer aws documentation
Now, to answer your question:
As you have created S3 bucket for the same, create directories for each table and place your files there. If your input files are large, split them into multiple files ( number of files should be chosen according to number of nodes you have, to enable better parallel ingestion, refer aws doc for more details).
Your copy command should look something like this :
PGPASSWORD=<password> psql -h <host> -d <dbname> -p 5439 -U <username> -c "copy <table_name> from 's3://<bucket>/<table_dir_path>/' credentials 'aws_iam_role=<iam role identifier to ingest s3 files into redshift>' delimiter ',' region '<region>' GZIP COMPUPDATE ON REMOVEQUOTES IGNOREHEADER 1"
next step it to create lambda and enable sns over redshift s3 bucket, this sns should trigger lambda as soon as you receive new files at s3 bucket. Alternate method would be to set cloudwatch scheduler to run the lambda.
Lambda can be created(java/python or any lang) which reads s3 files, connect to redshift and ingest files into tables using copy command.
Lambda has 15 mins limit, if that is a concern to you then fargate would be better. Running jobs on EC2 will cause more billing than lambda or fargate ( in case you forget to turn off ec2 machine)
You could create an external table over your bucket. Redshift would automatically scan all the files in the bucket. But bare in mind that the performance of queries may not be as good as with data loaded via COPY, but what you gain is no scheduler needed.
Also once you have an external table you could load it once to redshift with a single CREATE TABLE AS SELECT ... FROM your_external_table. The benefit of that approach is that it's idempotent - you don't need to keep track of your files - it will always load all data from all files in the bucket.